Solved: Re: How to delete duplicate events?

Flynt · ‎12-20-2012

I've been looking into some ways to remove duplicate events using a search. Finding them is not an issue. We can use various methods such as noted here -

http://splunk-base.splunk.com/answers/67033/how-to-remove-duplicate-events-in-search-results-without...

Or

http://splunk-base.splunk.com/answers/432/how-do-i-find-all-duplicate-events

Great, this is fantastic information – however, |delete won't work after a non-streaming command, so actually removing the the events as described won't work.

Yes, dedup works fine….except in cases where you have an oops and all of your data in a particular index is now duplicated (or in some cases triplicated for some reason resulting in a multitude of dupes over a multitude of events). Now any dashboards that rely on historical data need to be edited to include dedup for clarity. Our goal is to use |delete for the events so they are by default not searched.

The why's and wherefore's of HOW the data came to be duplicated is not of interest here.

Does anyone have a straightforward Search for removing duplicates (using |delete) that can take a large number of duplicate records and leave only 1?

_d_ · ‎12-21-2012

There is no straightforward and easy way to do certain things efficiently in a distributed environment. This is especially true for deduping, which, by its very nature can prove complex and expensive. And it is precisely this reason that the why and the how should definitely be discussed so that the problem can be solved, instead of the symptom. Attacking the root cause may prevent this from happening again in the future. Having said that, here are your options:

a. modify your searches to include a deduplication pipeline (perhaps a transaction with maxspan=1s) before your stats or other reporting commands. Wrap that in a macro and inject it in every search that operates on said data.

b. dedup your data and while at it, create a summary index using only the fields that your interested in. Run your searches on the summary.

c. Use | delete in the long and non-trivial way:
Each event in any splunk deployment, distributed or not, can be uniquely identified by a combination (ex. concatenation) of the following fields: index, splunk_server and _cd. Let's call this field id.

Run a search that identifies all dupes and their respective ids. You can use transaction for this (or your own preferred method)
Put ids in a multivalued field (mvlist option in transaction)
For each transacted event, create a new field called delete_id which will contain id values of all events to be deleted. This is every value of the field id except for one and can be achieved by using mvindex.
Create a lookup table out of all delete_ids.
Run a new search over your data. Look for events where their ids match lookup table's delete_ids and pipe them thru delete.

Sample search to build the lookup table. Modify transaction options as necessary.

Proof of concept:

oneshot a file full o'dupes:

# /opt/splunk/bin/splunk add oneshot /tmp/fullODupes.txt

run a search to both, find and delete dupes:

Note that the search has not been tested with a large number of events. It may be susceptible to stats or return limits. A lookup table may be the best way go about it.

View solution in original post

efullman · ‎05-01-2016

This solution should delete every duplicate value. You will need one unique field in your index of events.

My solution was to do a subsearch that returns a deduped list of events, where the returned value was a unique field. Remember you will need a user role that has delete capabilities to do the delete. Check your capabilities before you attempt this. Delete is a capability.

1) Run the search index=<your_index> Record the number of events returned by the search. This is the count of events including the duplicates. Also note a day and time where you see duplicates.

2) Next run the search index=<your_index> ([search index=<your_index> | dedup <field_with_duplicates> | field <unique_field>]) This will return your index less the duplicates. Check the results and make sure there are no duplicates where you noticed duplicates in Step 1. Record the number of events returned by the search.

If Step 2 returns gets an error, because the subsearch failed, reduce the search results by changing the timing to a single day or some other reduction that doesn't cause an error. Repeat step one with that reduction.

3) Subtract the count from Step 1 from the count from Step 2. Record this number. This is the number of duplicates.

4) Run the search from step 2 with a VERY IMPORTANT addition (the NOT). index=<your_index> NOT([search index=<your_index> | dedup <field_with_duplicates> | field <unique_field>]) Check the number of events, this search should return the same number of events as the number you recorded in Step 3.

5) If the event count of Step 4 wasn't the same event count you recorded in Step 3, then STOP, you've done something wrong.

6) If the event count of Step 4 IS THE SAME EVENT COUNT YOU RECORDED IN Step 3, then you are ready to delete.

7) Final Step... index=<your_index> NOT([search index=<your_index> | dedup <field_with_duplicates> | field <unique_field>])|delete

😎 Carefully clear this search once it runs. Leaving a search with |delete in the search box is super dangerous. This should return a deleted event count the same as the count of Step 3, the count of duplicates. That is what you deleted.

9) Final check... Run your original search index=<your_index>. The returned event count should be the same as the count from Step 2. This should be the deduped events.

drippler · ‎06-28-2015

I found a great simple way to do it using streamstats and _cd to only get the correct events. Th e trick is to first find the events and then use it as a subquery to find this events, as you can't delete events after using streamstats.
Use at your own discretion. You should run the search without delete first to verify you get only what you want to delete

Example:

<your search for relevant events> | eval eid=_cd | search [<same search for relevant events> | streamstats count by _raw  | search count>1 | eval eid=_cd | fields eid] | delete

Mayana · ‎09-01-2023

@drippler Your search is worked for me as well.. Its pretty much straight and simple one. Thanks

Note: Its deletes only 10k events only you have to run the search for multiple times

index="Indexname" sourcetype="sourcetype" | eval eid=_cd | search [search index="Indexname" sourcetype="sourcetype"  | streamstats count by _raw  | search count>1 | eval eid=_cd | fields eid] |delete

luhadia_aditya · ‎02-24-2016

Works like a charm. You got a cool hack, using the streamstats and then search count > 1.
Thanks for the share !

Its, in fact, bit expensive because of a sub-search, but this is possibly the most straight-forward way.

gisnetsec · ‎07-10-2015

Thanks drippler, worked great for my situation.

rjthibod · ‎12-01-2015

Also worked for me, but must note that the sub-search only allows a max of 10,000 returned events. Had to run it a couple of times for my case.

chrisboy68 · ‎09-02-2015

Hmm, I tried this and its not working. Something wrong with my syntax? I did have to put in a second "search" after the "[" for the subquery to get past a splunk error thown ("Unknown Search Command index").

    index=myindex source=mysource Description=mydescription | eval eid=_cd | search [ search index=myindex source=mysource Description=mydescription  | streamstats count by _raw  | search count>1 | eval eid=_cd | fields eid]

Thank you!

Chris

drippler · ‎09-16-2015

You're right about the inner "search" command. What did not work for you? Are you sure your events are duplicate including their timestamp?

Deepdive · ‎09-05-2018

Perfect! Works for me. Thanks

_d_ · ‎12-21-2012

There is no straightforward and easy way to do certain things efficiently in a distributed environment. This is especially true for deduping, which, by its very nature can prove complex and expensive. And it is precisely this reason that the why and the how should definitely be discussed so that the problem can be solved, instead of the symptom. Attacking the root cause may prevent this from happening again in the future. Having said that, here are your options:

a. modify your searches to include a deduplication pipeline (perhaps a transaction with maxspan=1s) before your stats or other reporting commands. Wrap that in a macro and inject it in every search that operates on said data.

b. dedup your data and while at it, create a summary index using only the fields that your interested in. Run your searches on the summary.

c. Use | delete in the long and non-trivial way:
Each event in any splunk deployment, distributed or not, can be uniquely identified by a combination (ex. concatenation) of the following fields: index, splunk_server and _cd. Let's call this field id.

Run a search that identifies all dupes and their respective ids. You can use transaction for this (or your own preferred method)
Put ids in a multivalued field (mvlist option in transaction)
For each transacted event, create a new field called delete_id which will contain id values of all events to be deleted. This is every value of the field id except for one and can be achieved by using mvindex.
Create a lookup table out of all delete_ids.
Run a new search over your data. Look for events where their ids match lookup table's delete_ids and pipe them thru delete.

Sample search to build the lookup table. Modify transaction options as necessary.

Proof of concept:

oneshot a file full o'dupes:

# /opt/splunk/bin/splunk add oneshot /tmp/fullODupes.txt

run a search to both, find and delete dupes:

Note that the search has not been tested with a large number of events. It may be susceptible to stats or return limits. A lookup table may be the best way go about it.

narwhal · ‎03-03-2014

Thanks Brian! Is "fields | format" more efficient than "table"? I mean, why not just "table delete_id" in the subsearch? btw, I made macros of this, but pondering making an external program that would do it, display an overview of the search, ask for confirmation, then do the delete. Useful?

bwooden · ‎03-03-2014

5) remove delete capability (through UI or CLI) to prevent unauthorized or accidental deletions

bwooden · ‎03-03-2014

2) review what we're going to delete

| inputlookup delete_these.csv

3) enable delete capability (through UI or CLI)

4) perform delete by searching against original data and eliminating things we want to keep (search for only items we want to delete using our list of unique event identifiers) and pipe that list to the delete command. Hint, try it without delete command first!

knoldus001 · ‎05-12-2015

well when d explained it completely why you further showing off, there are almost every question of splunk is not answered well, i think you know a lot can you solve this one. ??
http://answers.splunk.com/answers/236068/duplicate-logs-are-coming-after-a-time-period.html

bwooden · ‎03-03-2014

1) mark for delete
note that we create a tuple that uniquely identifies the event with eval id=...
we then group duplicate events with transaction command and throw away the last one
...because we want to keep one of the events
finally, we record the ones we've marked for deletion in a csv file

bwooden · ‎03-03-2014

d and watsm10 have it right. We must first build a list of events to delete and separately delete them. This is for a few reasons.

A) We want to verify what we're going to delete first
B) We can't delete after non-streaming command (such as transforms)

So we want to take these high level steps:
1) mark for delete
2) review what we're going to delete
3) enable delete capability
4) delete
5) remove delete capability

watsm10 · ‎09-10-2013

I would like to say...
Thank you so much for this answer!!!

I used the following search to work out the duplicates:

and the following search to delete them:

How to delete duplicate events?

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Enterprise Security (ES) Essentials 8.3 is Now GA — Smarter Detections, Faster ...

Unlock Instant Security Insights from Amazon S3 with Splunk Cloud — Try Federated ...

Are you a member of the Splunk Community?

How to delete duplicate events?

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Enterprise Security (ES) Essentials 8.3 is Now GA — Smarter Detections, Faster ...

Unlock Instant Security Insights from Amazon S3 with Splunk Cloud — Try Federated ...