topic Dataset best practices in Splunk Enterprise

Dataset best practices

RdomSplunkUser7 — Wed, 14 May 2025 14:15:39 GMT

Hello!

I maintain Splunk reports. Some of the Pivot reports are based on a Dataset that is generated based on a simple search. Duplicate values have not been taken into account in the generation.

Due to an error, there were two data sources for a few weeks. This resulted in identical duplicate rows in the dataset.

In the future, duplicate rows can be removed from the dataset with a simple dedup. However, are there any best practices to fix this?

Re: Dataset best practices

livehybrid — Wed, 14 May 2025 16:11:11 GMT

Hi @RdomSplunkUser7

I think ultimately this depends on what your searches are doing, if there is a risk of pulling in duplicate data then dedup is a good option, or you could look at using something like stats latest(fieldName) as latestFieldName

It really depends on your search(es). If you'd like to share the SPL we might be able to help further.

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

Re: Dataset best practices

RdomSplunkUser7 — Thu, 15 May 2025 14:45:47 GMT

It is based on very simple search.

index=<index_name> sourcetype= <blaahaa> field2. After this, a number of fields are extracted using rex.

I would like to include in the search as a new contrain a very simple dedup clause "| dedup _raw|".

Is this advisable?

Re: Dataset best practices

RdomSplunkUser7 — Fri, 16 May 2025 12:37:27 GMT

I have an identical situation as described here <https://community.splunk.com/t5/Reporting/How-to-not-include-the-duplicated-events-while-accelerating-the/m-p/244884>

Re: Dataset best practices

RdomSplunkUser7 — Fri, 16 May 2025 12:41:11 GMT

Here are some guidance how to resolve the problem

https://community.splunk.com/t5/Splunk-Search/How-to-delete-duplicate-events/td-p/70656?_gl=1*va2tii*_gcl_au*MTgyNjg5MzM4NS4xNzQ3MDQ5MDAyLjE4MjI0OTQwOTkuMTc0NzIzMTA3Ni4xNzQ3MjMxMDc1*FPAU*MTgyNjg5MzM4NS4xNzQ3MDQ5MDAy*_ga*MTE2NjU0NjgxNC4xNzQ3MDQ5MDAy*_ga_5EPM2P39FV*czE3NDczOTczMzQkbzkkZzEkdDE3NDczOTkwNTckajAkbDAkaDIwMzUxMzI1NzU.*_fplc*ZWZ1MWJ5V3h4NFVZd1ZpMVJqc2xKOU1WdHo2WTNQdU1OcUlhWlE3bGpXdXU3ZENuYjJFOXppSDNCSVRvcENOcUxuaUpWSU5FUkpGaXFNMG9DN0slMkYlMkZKdUtPVWZncHhEY1lmUDQlMkI1RFJGU2NOVmhYaDFJSlpxMWszNDRHbDB3JTNEJTNE

Re: Dataset best practices

isoutamo — Fri, 16 May 2025 13:54:41 GMT

IMHO: I don't like or suggest you to add delete permissions to anyone permanently! It isn't great idea to run scheduled job which are removing events from splunk.

Re: Dataset best practices

RdomSplunkUser7 — Fri, 16 May 2025 13:58:53 GMT

I dont like the idea you can not add dedup with pipelines in the simple base search of dataset.

The Splunk should offer a ready method to deduplicate index.

Re: Dataset best practices

isoutamo — Fri, 16 May 2025 14:10:07 GMT

Unfortunately at least I didn't know any generic answer for this.

That method what they presented here is one option, but as said you need to be 100% sure that it works with your data and test it several times to be sure!

And of course you must 1st get rid of those new duplicates and ensure that all your inputs works as they should without duplicating new events.

After that you probably could do that delete if you are absolutely sure that it works also in your case. And I propose that you should use some temp account which has can_delete role just for this time what is needed to do that clean up.

Re: Dataset best practices

isoutamo — Fri, 16 May 2025 14:12:43 GMT

Of course you can add dedup on those queries but this will kill your performance!
And it depends how this duplication has happened and how you could identified those events? That just depends on how that has happened, can you e.g. dedup _raw or just set of different fields or did it needs some calculations/modifications (e.g. times) too?

Re: Dataset best practices

RdomSplunkUser7 — Fri, 16 May 2025 19:37:38 GMT

Maybe I can make a new data set which is the original data set minus duplicate identical log lines. Are there any tutorials for this? I am newbie super user for just for some reports. I hate this role.

Re: Dataset best practices

isoutamo — Fri, 16 May 2025 19:46:53 GMT

How big and complex your dataset is and how much its content is changing? And how long time span it covers?

Re: Dataset best practices

RdomSplunkUser7 — Fri, 16 May 2025 20:19:20 GMT

My situation is quite simple. I lack the basic training. The ChatGPT showed me the way. Maybe this can be an answer for another Splunk report newbie super user.

”Ah, so you want to deduplicate data and use it in a Pivot table – great clarification! Pivot in Splunk is based on Data Models, and indeed, there are limitations on SPL commands (like | dedup _raw) in that context.

🔍 The Problem:

Pivot uses a Data Model, and in the base search of a Data Model, you cannot use pipe (|) commands like dedup.

🎯 Your Goal:

Remove duplicates based on _raw and still use the data in a Pivot table.

✅ Solution Options for Use in Pivot:

🔁

1. Create a Saved Search with dedup, then build a Data Model on top of it

This is the recommended method:

Step 1: Create a Saved Search

Go to Splunk’s Search view.
Enter the SPL:

index=your_index sourcetype=your_sourcetype

| dedup _raw

Click Save As → Report.
Name it, for example: Deduped Raw Events.

Step 2: Create a new Data Model based on that Saved Search

Go to Settings → Data Models → New Data Model.
Give it a name and save it.
Add a new Object, and choose:

Object Type: Event
Constraint: Use the saved search:

savedsearch="Deduped Raw Events"

NOTE: savedsearch="your_report_name" references the saved search.

Step 3: Use Pivot on top of this Data Model

Go to Pivot → Select your new Data Model → Deduped Raw Events.
Build your table as desired.

⚠️ Notes:

This only works if the saved search is public (shared) or you have permission to use it.
The Saved Search must return fields that you can use in Pivot (like _time, host, source, custom fields, etc.).

🧪 Option 2: Simulate Dedup within the Data Model (if possible)

Data Models do not allow | dedup, but you can:

Add an auto-extracted field, which lets you group by that field in Pivot.
Or, if you have a unique identifier (e.g., event_id), you can use first-value or latest-value aggregations in Pivot to simulate deduplication.

📌 Summary:

Method	Dedup Allowed?	Usable in Pivot?
Saved Search + dedup	✅	✅
Native Data Model search	❌	✅
SPL with pipes in Pivot UI	❌ (not allowed)	✅ but very limited

If you’d like, I can also help you write the full search or configure it for a specific type of data or log source – just let me know what you’re using it for in Pivot!”

The data is very simple event log type data. The amount of data is small. There is a unique field in log lines (event id). The question was about how to tweak existing data set. Splunk is not good for these type of business reports which should be moved to another report platform (ie MSBI).