I know it's an old topic but I'll chip in as well here. In addition to dmaislin_splunk's suggestions I've added source values to show the two or more sources where the duplicates are. This gives me better depth of understanding as to why items are duplicated (e.g. are they in a cluster?)
sourcetype=* | rename _raw as raw | eval raw_bytes=len(raw) | transaction raw maxspan=1s keepevicted=true | search eventcount>1 | eval extra_events=eventcount-1 | eval extra_bytes=extra_events*raw_bytes | stats sum(extra_events) as extra_events, sum(eval(extra_bytes/1024.0/1024.0)) as extra_mb values(source) by source | rename "values(source)" as "Duplicated in"
I hate to dig up an old thread but this appears ion the most voted list. Is it still valid to say that transaction is more effective than stats in this instance?
For example the following search should return the same results
sourcetype=* | streamstats count as dupes by _time,_raw | search dupes> 1 | stats count as extra_events by _raw,host,source | eval raw_bytes=len(_raw) |eval extra_mb=extra_events*raw_bytes/1024 | stats sum(extra_events) as extra_events, sum(extra_mb) values(source) by source
I ran a 4 hour search over one of our higher volume indexes , and it took about 10 minutes to run over about 30 million events. Using the accepted answer to this question, I cancelled the search when it was 3% complete after 15 minutes.
So is transaction no longer appropriate for finding duplicates? Or is this an edge case for very large indices.
I am working with a problem where one transaction may get logged several times and I would need to find events with identical transactionIDs. What I manage to do is
index=myindex loglines STATUS_CODE=200 | top TRXID | search count > 1
This gives me the transactions that have been multiple times logged, but when I try doing what is suggested earlier I only find identical log lines.
index=myindex loglines STATUS_CODE=200 | eval dupfield = _raw | transaction dupfield maxspan=1m keepevicted=true | search eventcount > 1 | eval extra_events=eventcount-1 | stats sum(extra_events) as extra_events by CLIENTID
What I am trying to find out, how to tell transaction to treat two lines to be associated to one single transaction. The TRXID and CLIENTID mentioned in the examples are present on all the lines matched with the keyword loglines.
It would be very nice to have an answer to this question, as I've seen similar numbers. As jplumsdaine22 also points out, if this is a dead thread a pointer to a better dupe search resource would be valuable. Thanks!