Splunk Search

What is the best strategy to handing overlapping data?

Marinus
Communicator

Some sources will produce data that overlaps i.e. you get some of the data you already indexed. This can have quite a negative effect on search performance especially if you have to dedup whole events. Is there a best practice to deal with such a scenario.

highiqboy
Explorer

@ cfrantsen - I think maverick is saying you can utilize Splunk's alerting feature, only instead of specifying to send an email alert when the search runs and finds the duplicates, you will choose the last option in the scheduled saved search popup window and tell splunk to simply add the dedup'd search results into a secondary index that you create on the indexes management page exactly for this purpose.

0 Karma

cfrantsen
Explorer

Regarding option c, what is the best way to "save off the results into a new index"?

0 Karma

Marinus
Communicator

Thanks for the feedback, very useful. Lets assume that you can't control your source to remove duplicates. One of the ideas that I've been toying with is to create a new search command like dedup that will dedup the search -1 event so that you can delete i.e. * | dedup2 fieldx | delete. You could run that on a regular basis, I just don't know what it's actually going to do in the index and if over time it's actually going to help.

0 Karma

splunkedout
Explorer

my first thought was the same as maverick's option c above. this is what i would do if i was really constrained.

0 Karma

maverick
Splunk Employee
Splunk Employee

Couple ways you can approach this type of scenario.

First is to address the root cause or source of the duplicate events and try to resolve that. Overlapping events, where one event is an exact replica of one or more other events generated elsewhere, is not typical, at least not in my experience. Perhaps you might think about posting a separate question to describe more details around that topic and we can help you resolve.

Second is to assume that you cannot resolve the duplicate events issue and just optimize search performance in other ways, such as

a) turning off auto key/value extraction

b) piping to the "fields" command and only listing the two or three fields you need in your search results (i.e. this will avoid using auto key/value extraction by default)

c) you could set up a scheduled saved search to perform the dedups every few minutes on all of the events in real-time in the background and save off the results (i.e. which will be all unique events only) into a new index you create for this purposes, and then base your actual ad-hoc searches on that new index, instead of on the main index.

Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...