I suspect that I may have duplicate events indexed by Splunk. The cause may be my originating files having dupes OR my Splunk configuration may be indexing some events twice or more times.
To be sure, what search can I run to find all my duplicate events currently within my Splunk index?
I think it's safe to assume that if an event is duplicated (same value for _raw
) than the duplicates and the original should have the same timestamp. Therefore, it should be possible to include maxspan=1s
, like so:
... | eval dupfield=_raw | transaction dupfield maxspan=1s keepevicted=true
I'm not sure about Gerald's comment about multi-line events, since my de-dedup catching was limited to single line events, but it seems to me that some kind of sed
trick could be used, like so:
... | eval dupfield=_raw | rex mode=sed field=dupfield "s/[\r\n]/<EOL>/g" | transaction dupfield maxspan=1s keepevicted=true
BTW, I found the transaction
based approach to be much faster than using stats
approach suggested in the comments above and much less restrictive. (It seems like stats
has a a 10,000 entry limit on the "by" clause)
Also, in my case I was trying to not only get a count of duplicate events but figure out the extra volume (in bytes) that could have been avoided if the data was de-duped externally before being loaded. I used a search like this:
sourcetype=my_source_type | rename _raw as raw | eval raw_bytes=len(raw) | transaction raw maxspan=1s keepevicted=true | search eventcount>1 | eval extra_events=eventcount-1 | eval extra_bytes=extra_events*raw_bytes | timechart span=1d sum(extra) as exta_events, sum(eval(extra_bytes/1024.0/1024.0)) as extra_mb
This shows you the impact in megabytes per day.