- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I suspect that I may have duplicate events indexed by Splunk. The cause may be my originating files having dupes OR my Splunk configuration may be indexing some events twice or more times.
To be sure, what search can I run to find all my duplicate events currently within my Splunk index?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think it's safe to assume that if an event is duplicated (same value for _raw
) than the duplicates and the original should have the same timestamp. Therefore, it should be possible to include maxspan=1s
, like so:
... | eval dupfield=_raw | transaction dupfield maxspan=1s keepevicted=true
I'm not sure about Gerald's comment about multi-line events, since my de-dedup catching was limited to single line events, but it seems to me that some kind of sed
trick could be used, like so:
... | eval dupfield=_raw | rex mode=sed field=dupfield "s/[\r\n]/<EOL>/g" | transaction dupfield maxspan=1s keepevicted=true
BTW, I found the transaction
based approach to be much faster than using stats
approach suggested in the comments above and much less restrictive. (It seems like stats
has a a 10,000 entry limit on the "by" clause)
Also, in my case I was trying to not only get a count of duplicate events but figure out the extra volume (in bytes) that could have been avoided if the data was de-duped externally before being loaded. I used a search like this:
sourcetype=my_source_type | rename _raw as raw | eval raw_bytes=len(raw) | transaction raw maxspan=1s keepevicted=true | search eventcount>1 | eval extra_events=eventcount-1 | eval extra_bytes=extra_events*raw_bytes | timechart span=1d sum(extra) as exta_events, sum(eval(extra_bytes/1024.0/1024.0)) as extra_mb
This shows you the impact in megabytes per day.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For a better and easy, you can use below SPL and replace your index name for any duplicates in Splunk.
index=* | stats count by _raw, index, sourcetype, source, host | where count>1
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I hate to dig up an old thread but this appears ion the most voted list. Is it still valid to say that transaction is more effective than stats in this instance?
For example the following search should return the same results
sourcetype=* | streamstats count as dupes by _time,_raw
| search dupes> 1
| stats count as extra_events by _raw,host,source
| eval raw_bytes=len(_raw) |eval extra_mb=extra_events*raw_bytes/1024
| stats sum(extra_events) as extra_events, sum(extra_mb) values(source) by source
I ran a 4 hour search over one of our higher volume indexes , and it took about 10 minutes to run over about 30 million events. Using the accepted answer to this question, I cancelled the search when it was 3% complete after 15 minutes.
So is transaction no longer appropriate for finding duplicates? Or is this an edge case for very large indices.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

It would be very nice to have an answer to this question, as I've seen similar numbers. As jplumsdaine22 also points out, if this is a dead thread a pointer to a better dupe search resource would be valuable. Thanks!
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi folks!
I am working with a problem where one transaction may get logged several times and I would need to find events with identical transactionIDs. What I manage to do is
index=myindex loglines STATUS_CODE=200
| top TRXID
| search count > 1
This gives me the transactions that have been multiple times logged, but when I try doing what is suggested earlier I only find identical log lines.
index=myindex loglines STATUS_CODE=200
| eval dupfield = _raw
| transaction dupfield maxspan=1m keepevicted=true
| search eventcount > 1
| eval extra_events=eventcount-1
| stats sum(extra_events) as extra_events by CLIENTID
What I am trying to find out, how to tell transaction to treat two lines to be associated to one single transaction. The TRXID and CLIENTID mentioned in the examples are present on all the lines matched with the keyword loglines.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I know it's an old topic but I'll chip in as well here. In addition to dmaislin_splunk's suggestions I've added source values to show the two or more sources where the duplicates are. This gives me better depth of understanding as to why items are duplicated (e.g. are they in a cluster?)
sourcetype=*
| rename _raw as raw
| eval raw_bytes=len(raw)
| transaction raw maxspan=1s keepevicted=true
| search eventcount>1
| eval extra_events=eventcount-1
| eval extra_bytes=extra_events*raw_bytes
| stats sum(extra_events) as extra_events, sum(eval(extra_bytes/1024.0/1024.0)) as extra_mb values(source) by source
| rename "values(source)" as "Duplicated in"
Regards,
Ken
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Original fixed due to some typos:
sourcetype=* | rename _raw as raw | eval raw_bytes=len(raw) | transaction raw maxspan=1s keepevicted=true | search eventcount>1 | eval extra_events=eventcount-1 | eval extra_bytes=extra_events*raw_bytes | timechart span=1s sum(extra_events) as extra_events, sum(eval(extra_bytes/1024.0/1024.0)) as extra_mb
To show number of events and size by sourcetype:
sourcetype=* | rename _raw as raw | eval raw_bytes=len(raw) | transaction raw maxspan=1s keepevicted=true | search eventcount>1 | eval extra_events=eventcount-1 | eval extra_bytes=extra_events*raw_bytes |stats sum(extra_events) as extra_events, sum(eval(extra_bytes/1024.0/1024.0)) as extra_mb by host,sourcetype
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think it's safe to assume that if an event is duplicated (same value for _raw
) than the duplicates and the original should have the same timestamp. Therefore, it should be possible to include maxspan=1s
, like so:
... | eval dupfield=_raw | transaction dupfield maxspan=1s keepevicted=true
I'm not sure about Gerald's comment about multi-line events, since my de-dedup catching was limited to single line events, but it seems to me that some kind of sed
trick could be used, like so:
... | eval dupfield=_raw | rex mode=sed field=dupfield "s/[\r\n]/<EOL>/g" | transaction dupfield maxspan=1s keepevicted=true
BTW, I found the transaction
based approach to be much faster than using stats
approach suggested in the comments above and much less restrictive. (It seems like stats
has a a 10,000 entry limit on the "by" clause)
Also, in my case I was trying to not only get a count of duplicate events but figure out the extra volume (in bytes) that could have been avoided if the data was de-duped externally before being loaded. I used a search like this:
sourcetype=my_source_type | rename _raw as raw | eval raw_bytes=len(raw) | transaction raw maxspan=1s keepevicted=true | search eventcount>1 | eval extra_events=eventcount-1 | eval extra_bytes=extra_events*raw_bytes | timechart span=1d sum(extra) as exta_events, sum(eval(extra_bytes/1024.0/1024.0)) as extra_mb
This shows you the impact in megabytes per day.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Stephen it would be nice if there was a search command that could remove duplicates -1, I'm not what the impact would be.
* | tag_dupes | delete
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Lowell is absolutely right that this transaction will be MUCH, MUCH faster than anything involving stats because of its favorable eviction policy. Transaction, especially with maxspan set, will only keep data for the current second in memory, as search scans backwards through time.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Try appending this search string to your current search to find duplicates:
| transaction fields="_time,_raw" connected=f keepevicted=t | search linecount > 1
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content


+1, needed. Has it been filed?
(Don't forget to accept your current answer, unless it doesn't satisfy.)
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Agreed. showdupes filter=all|latest would be very beneficial, especially when debugging input configs.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Actually now that I think about it: | stats count by _time,_raw | rename _raw as raw | where count > 1
might be better. But an ER for search command to showdupes might be best.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

This won't work if the original data is multiline. But you could fix that with | rename duration as original_duration | transaction _time,_raw | search duration=*
The transaction
will also be rather more efficient if you set maxspan=0
and maxopentxn=1
if your duplicates will be consecutive.
