I've been looking into some ways to remove duplicate events using a search. Finding them is not an issue. We can use various methods such as noted here -
Or
http://splunk-base.splunk.com/answers/432/how-do-i-find-all-duplicate-events
Great, this is fantastic information – however, |delete won't work after a non-streaming command, so actually removing the the events as described won't work.
Yes, dedup works fine….except in cases where you have an oops and all of your data in a particular index is now duplicated (or in some cases triplicated for some reason resulting in a multitude of dupes over a multitude of events). Now any dashboards that rely on historical data need to be edited to include dedup for clarity. Our goal is to use |delete for the events so they are by default not searched.
The why's and wherefore's of HOW the data came to be duplicated is not of interest here.
Does anyone have a straightforward Search for removing duplicates (using |delete) that can take a large number of duplicate records and leave only 1?
There is no straightforward and easy way to do certain things efficiently in a distributed environment. This is especially true for deduping, which, by its very nature can prove complex and expensive. And it is precisely this reason that the why and the how should definitely be discussed so that the problem can be solved, instead of the symptom. Attacking the root cause may prevent this from happening again in the future. Having said that, here are your options:
a. modify your searches to include a deduplication pipeline (perhaps a transaction
with maxspan=1s
) before your stats or other reporting commands. Wrap that in a macro and inject it in every search that operates on said data.
b. dedup your data and while at it, create a summary index using only the fields that your interested in. Run your searches on the summary.
c. Use | delete
in the long and non-trivial way:
Each event in any splunk deployment, distributed or not, can be uniquely identified by a combination (ex. concatenation) of the following fields: index
, splunk_server
and _cd
. Let's call this field id.
transaction
for this (or your own preferred method)mvlist
option in transaction)mvindex
. delete
.Sample search to build the lookup table. Modify transaction options as necessary.
source="/tmp/dupes.txt"
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats count by delete_id
| fields - count
| outputlookup dupes
Proof of concept:
# /opt/splunk/bin/splunk add oneshot /tmp/fullODupes.txt
source="/tmp/fullODupes.txt"
| eval id=_cd."|".index."|".splunk_server
| search
[ search source="/tmp/fullODupes.txt"
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats count by delete_id
| fields – count
| return 20 id=delete_id]
| delete
Note that the search has not been tested with a large number of events. It may be susceptible to stats
or return
limits. A lookup table may be the best way go about it.
This solution should delete every duplicate value. You will need one unique field in your index of events.
My solution was to do a subsearch that returns a deduped list of events, where the returned value was a unique field. Remember you will need a user role that has delete capabilities to do the delete. Check your capabilities before you attempt this. Delete is a capability.
1) Run the search index=<your_index>
Record the number of events returned by the search. This is the count of events including the duplicates. Also note a day and time where you see duplicates.
2) Next run the search index=<your_index> ([search index=<your_index> | dedup <field_with_duplicates> | field <unique_field>])
This will return your index less the duplicates. Check the results and make sure there are no duplicates where you noticed duplicates in Step 1. Record the number of events returned by the search.
If Step 2 returns gets an error, because the subsearch failed, reduce the search results by changing the timing to a single day or some other reduction that doesn't cause an error. Repeat step one with that reduction.
3) Subtract the count from Step 1 from the count from Step 2. Record this number. This is the number of duplicates.
4) Run the search from step 2 with a VERY IMPORTANT addition (the NOT). index=<your_index> NOT([search index=<your_index> | dedup <field_with_duplicates> | field <unique_field>])
Check the number of events, this search should return the same number of events as the number you recorded in Step 3.
5) If the event count of Step 4 wasn't the same event count you recorded in Step 3, then STOP, you've done something wrong.
6) If the event count of Step 4 IS THE SAME EVENT COUNT YOU RECORDED IN Step 3, then you are ready to delete.
7) Final Step... index=<your_index> NOT([search index=<your_index> | dedup <field_with_duplicates> | field <unique_field>])|delete
😎 Carefully clear this search once it runs. Leaving a search with |delete
in the search box is super dangerous. This should return a deleted event count the same as the count of Step 3, the count of duplicates. That is what you deleted.
9) Final check... Run your original search index=<your_index>
. The returned event count should be the same as the count from Step 2. This should be the deduped events.
I found a great simple way to do it using streamstats and _cd to only get the correct events. Th e trick is to first find the events and then use it as a subquery to find this events, as you can't delete events after using streamstats.
Use at your own discretion. You should run the search without delete first to verify you get only what you want to delete
Example:
<your search for relevant events> | eval eid=_cd | search [<same search for relevant events> | streamstats count by _raw | search count>1 | eval eid=_cd | fields eid] | delete
@drippler Your search is worked for me as well.. Its pretty much straight and simple one. Thanks
Note: Its deletes only 10k events only you have to run the search for multiple times
index="Indexname" sourcetype="sourcetype" | eval eid=_cd | search [search index="Indexname" sourcetype="sourcetype" | streamstats count by _raw | search count>1 | eval eid=_cd | fields eid] |delete
Works like a charm. You got a cool hack, using the streamstats and then search count > 1.
Thanks for the share !
Its, in fact, bit expensive because of a sub-search, but this is possibly the most straight-forward way.
Thanks drippler, worked great for my situation.
Also worked for me, but must note that the sub-search only allows a max of 10,000 returned events. Had to run it a couple of times for my case.
Hmm, I tried this and its not working. Something wrong with my syntax? I did have to put in a second "search" after the "[" for the subquery to get past a splunk error thown ("Unknown Search Command index").
index=myindex source=mysource Description=mydescription | eval eid=_cd | search [ search index=myindex source=mysource Description=mydescription | streamstats count by _raw | search count>1 | eval eid=_cd | fields eid]
Thank you!
Chris
You're right about the inner "search" command. What did not work for you? Are you sure your events are duplicate including their timestamp?
Perfect! Works for me. Thanks
There is no straightforward and easy way to do certain things efficiently in a distributed environment. This is especially true for deduping, which, by its very nature can prove complex and expensive. And it is precisely this reason that the why and the how should definitely be discussed so that the problem can be solved, instead of the symptom. Attacking the root cause may prevent this from happening again in the future. Having said that, here are your options:
a. modify your searches to include a deduplication pipeline (perhaps a transaction
with maxspan=1s
) before your stats or other reporting commands. Wrap that in a macro and inject it in every search that operates on said data.
b. dedup your data and while at it, create a summary index using only the fields that your interested in. Run your searches on the summary.
c. Use | delete
in the long and non-trivial way:
Each event in any splunk deployment, distributed or not, can be uniquely identified by a combination (ex. concatenation) of the following fields: index
, splunk_server
and _cd
. Let's call this field id.
transaction
for this (or your own preferred method)mvlist
option in transaction)mvindex
. delete
.Sample search to build the lookup table. Modify transaction options as necessary.
source="/tmp/dupes.txt"
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats count by delete_id
| fields - count
| outputlookup dupes
Proof of concept:
# /opt/splunk/bin/splunk add oneshot /tmp/fullODupes.txt
source="/tmp/fullODupes.txt"
| eval id=_cd."|".index."|".splunk_server
| search
[ search source="/tmp/fullODupes.txt"
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats count by delete_id
| fields – count
| return 20 id=delete_id]
| delete
Note that the search has not been tested with a large number of events. It may be susceptible to stats
or return
limits. A lookup table may be the best way go about it.
Thanks Brian! Is "fields | format" more efficient than "table"? I mean, why not just "table delete_id" in the subsearch? btw, I made macros of this, but pondering making an external program that would do it, display an overview of the search, ask for confirmation, then do the delete. Useful?
5) remove delete capability (through UI or CLI) to prevent unauthorized or accidental deletions
2) review what we're going to delete
| inputlookup delete_these.csv
3) enable delete capability (through UI or CLI)
4) perform delete by searching against original data and eliminating things we want to keep (search for only items we want to delete using our list of unique event identifiers) and pipe that list to the delete command. Hint, try it without delete command first!
index=* sourcetype=dupes
| eval delete_id=_cd."|".index."|".splunk_server
| search [|inputlookup delete_these.csv | fields delete_id | format "(" "(" "OR" ")" "OR" ")"]
| delete
well when d explained it completely why you further showing off, there are almost every question of splunk is not answered well, i think you know a lot can you solve this one. ??
http://answers.splunk.com/answers/236068/duplicate-logs-are-coming-after-a-time-period.html
1) mark for delete
note that we create a tuple that uniquely identifies the event with eval id=...
we then group duplicate events with transaction command and throw away the last one
...because we want to keep one of the events
finally, we record the ones we've marked for deletion in a csv file
index=* sourcetype=dupes
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats c by delete_id
| outputlookup delete_these.csv
d and watsm10 have it right. We must first build a list of events to delete and separately delete them. This is for a few reasons.
A) We want to verify what we're going to delete first
B) We can't delete after non-streaming command (such as transforms)
So we want to take these high level steps:
1) mark for delete
2) review what we're going to delete
3) enable delete capability
4) delete
5) remove delete capability
I would like to say...
Thank you so much for this answer!!!
I used the following search to work out the duplicates:
index=
and the following search to delete them:
index=