Splunk Search

How to delete duplicate events?

Flynt
Splunk Employee
Splunk Employee

I've been looking into some ways to remove duplicate events using a search. Finding them is not an issue. We can use various methods such as noted here -

http://splunk-base.splunk.com/answers/67033/how-to-remove-duplicate-events-in-search-results-without...

Or

http://splunk-base.splunk.com/answers/432/how-do-i-find-all-duplicate-events

Great, this is fantastic information – however, |delete won't work after a non-streaming command, so actually removing the the events as described won't work.

Yes, dedup works fine….except in cases where you have an oops and all of your data in a particular index is now duplicated (or in some cases triplicated for some reason resulting in a multitude of dupes over a multitude of events). Now any dashboards that rely on historical data need to be edited to include dedup for clarity. Our goal is to use |delete for the events so they are by default not searched.

The why's and wherefore's of HOW the data came to be duplicated is not of interest here.

Does anyone have a straightforward Search for removing duplicates (using |delete) that can take a large number of duplicate records and leave only 1?

1 Solution

_d_
Splunk Employee
Splunk Employee

There is no straightforward and easy way to do certain things efficiently in a distributed environment. This is especially true for deduping, which, by its very nature can prove complex and expensive. And it is precisely this reason that the why and the how should definitely be discussed so that the problem can be solved, instead of the symptom. Attacking the root cause may prevent this from happening again in the future. Having said that, here are your options:

a. modify your searches to include a deduplication pipeline (perhaps a transaction with maxspan=1s) before your stats or other reporting commands. Wrap that in a macro and inject it in every search that operates on said data.

b. dedup your data and while at it, create a summary index using only the fields that your interested in. Run your searches on the summary.

c. Use | delete in the long and non-trivial way:
Each event in any splunk deployment, distributed or not, can be uniquely identified by a combination (ex. concatenation) of the following fields: index, splunk_server and _cd. Let's call this field id.

  • Run a search that identifies all dupes and their respective ids. You can use transaction for this (or your own preferred method)
  • Put ids in a multivalued field (mvlist option in transaction)
  • For each transacted event, create a new field called delete_id which will contain id values of all events to be deleted. This is every value of the field id except for one and can be achieved by using mvindex.
  • Create a lookup table out of all delete_ids.
  • Run a new search over your data. Look for events where their ids match lookup table's delete_ids and pipe them thru delete.

Sample search to build the lookup table. Modify transaction options as necessary.

source="/tmp/dupes.txt"
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats count by delete_id
| fields - count
| outputlookup dupes

Proof of concept:

  • oneshot a file full o'dupes:

# /opt/splunk/bin/splunk add oneshot /tmp/fullODupes.txt

  • run a search to both, find and delete dupes:

source="/tmp/fullODupes.txt"
| eval id=_cd."|".index."|".splunk_server
| search
[ search source="/tmp/fullODupes.txt"
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats count by delete_id
| fields – count
| return 20 id=delete_id]
| delete

Note that the search has not been tested with a large number of events. It may be susceptible to stats or return limits. A lookup table may be the best way go about it.

View solution in original post

efullman
Path Finder

This solution should delete every duplicate value. You will need one unique field in your index of events.

My solution was to do a subsearch that returns a deduped list of events, where the returned value was a unique field. Remember you will need a user role that has delete capabilities to do the delete. Check your capabilities before you attempt this. Delete is a capability.

1) Run the search index=<your_index> Record the number of events returned by the search. This is the count of events including the duplicates. Also note a day and time where you see duplicates.

2) Next run the search index=<your_index> ([search index=<your_index> | dedup <field_with_duplicates> | field <unique_field>]) This will return your index less the duplicates. Check the results and make sure there are no duplicates where you noticed duplicates in Step 1. Record the number of events returned by the search.

If Step 2 returns gets an error, because the subsearch failed, reduce the search results by changing the timing to a single day or some other reduction that doesn't cause an error. Repeat step one with that reduction.

3) Subtract the count from Step 1 from the count from Step 2. Record this number. This is the number of duplicates.

4) Run the search from step 2 with a VERY IMPORTANT addition (the NOT). index=<your_index> NOT([search index=<your_index> | dedup <field_with_duplicates> | field <unique_field>]) Check the number of events, this search should return the same number of events as the number you recorded in Step 3.

5) If the event count of Step 4 wasn't the same event count you recorded in Step 3, then STOP, you've done something wrong.

6) If the event count of Step 4 IS THE SAME EVENT COUNT YOU RECORDED IN Step 3, then you are ready to delete.

7) Final Step... index=<your_index> NOT([search index=<your_index> | dedup <field_with_duplicates> | field <unique_field>])|delete

😎 Carefully clear this search once it runs. Leaving a search with |delete in the search box is super dangerous. This should return a deleted event count the same as the count of Step 3, the count of duplicates. That is what you deleted.

9) Final check... Run your original search index=<your_index>. The returned event count should be the same as the count from Step 2. This should be the deduped events.

drippler
Explorer

I found a great simple way to do it using streamstats and _cd to only get the correct events. Th e trick is to first find the events and then use it as a subquery to find this events, as you can't delete events after using streamstats.
Use at your own discretion. You should run the search without delete first to verify you get only what you want to delete

Example:

<your search for relevant events> | eval eid=_cd | search [<same search for relevant events> | streamstats count by _raw  | search count>1 | eval eid=_cd | fields eid] | delete
Tags (1)

Mayana
Engager

@drippler Your search is worked for me as well.. Its pretty much straight and simple one. Thanks

Note: Its deletes only 10k events only you have to run the search for multiple times 

 

index="Indexname" sourcetype="sourcetype" | eval eid=_cd | search [search index="Indexname" sourcetype="sourcetype"  | streamstats count by _raw  | search count>1 | eval eid=_cd | fields eid] |delete

 

0 Karma

luhadia_aditya
Path Finder

Works like a charm. You got a cool hack, using the streamstats and then search count > 1.
Thanks for the share !

Its, in fact, bit expensive because of a sub-search, but this is possibly the most straight-forward way.

0 Karma

gisnetsec
Explorer

Thanks drippler, worked great for my situation.

0 Karma

rjthibod
Champion

Also worked for me, but must note that the sub-search only allows a max of 10,000 returned events. Had to run it a couple of times for my case.

0 Karma

chrisboy68
Contributor

Hmm, I tried this and its not working. Something wrong with my syntax? I did have to put in a second "search" after the "[" for the subquery to get past a splunk error thown ("Unknown Search Command index").

    index=myindex source=mysource Description=mydescription | eval eid=_cd | search [ search index=myindex source=mysource Description=mydescription  | streamstats count by _raw  | search count>1 | eval eid=_cd | fields eid]

Thank you!

Chris

drippler
Explorer

You're right about the inner "search" command. What did not work for you? Are you sure your events are duplicate including their timestamp?

0 Karma

Deepdive
Engager

Perfect! Works for me. Thanks

0 Karma

_d_
Splunk Employee
Splunk Employee

There is no straightforward and easy way to do certain things efficiently in a distributed environment. This is especially true for deduping, which, by its very nature can prove complex and expensive. And it is precisely this reason that the why and the how should definitely be discussed so that the problem can be solved, instead of the symptom. Attacking the root cause may prevent this from happening again in the future. Having said that, here are your options:

a. modify your searches to include a deduplication pipeline (perhaps a transaction with maxspan=1s) before your stats or other reporting commands. Wrap that in a macro and inject it in every search that operates on said data.

b. dedup your data and while at it, create a summary index using only the fields that your interested in. Run your searches on the summary.

c. Use | delete in the long and non-trivial way:
Each event in any splunk deployment, distributed or not, can be uniquely identified by a combination (ex. concatenation) of the following fields: index, splunk_server and _cd. Let's call this field id.

  • Run a search that identifies all dupes and their respective ids. You can use transaction for this (or your own preferred method)
  • Put ids in a multivalued field (mvlist option in transaction)
  • For each transacted event, create a new field called delete_id which will contain id values of all events to be deleted. This is every value of the field id except for one and can be achieved by using mvindex.
  • Create a lookup table out of all delete_ids.
  • Run a new search over your data. Look for events where their ids match lookup table's delete_ids and pipe them thru delete.

Sample search to build the lookup table. Modify transaction options as necessary.

source="/tmp/dupes.txt"
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats count by delete_id
| fields - count
| outputlookup dupes

Proof of concept:

  • oneshot a file full o'dupes:

# /opt/splunk/bin/splunk add oneshot /tmp/fullODupes.txt

  • run a search to both, find and delete dupes:

source="/tmp/fullODupes.txt"
| eval id=_cd."|".index."|".splunk_server
| search
[ search source="/tmp/fullODupes.txt"
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats count by delete_id
| fields – count
| return 20 id=delete_id]
| delete

Note that the search has not been tested with a large number of events. It may be susceptible to stats or return limits. A lookup table may be the best way go about it.

narwhal
Splunk Employee
Splunk Employee

Thanks Brian! Is "fields | format" more efficient than "table"? I mean, why not just "table delete_id" in the subsearch? btw, I made macros of this, but pondering making an external program that would do it, display an overview of the search, ask for confirmation, then do the delete. Useful?

0 Karma

bwooden
Splunk Employee
Splunk Employee

5) remove delete capability (through UI or CLI) to prevent unauthorized or accidental deletions

bwooden
Splunk Employee
Splunk Employee

2) review what we're going to delete

| inputlookup delete_these.csv

3) enable delete capability (through UI or CLI)

4) perform delete by searching against original data and eliminating things we want to keep (search for only items we want to delete using our list of unique event identifiers) and pipe that list to the delete command. Hint, try it without delete command first!

index=* sourcetype=dupes
| eval delete_id=_cd."|".index."|".splunk_server
| search [|inputlookup delete_these.csv | fields delete_id | format "(" "(" "OR" ")" "OR" ")"]
| delete

knoldus001
New Member

well when d explained it completely why you further showing off, there are almost every question of splunk is not answered well, i think you know a lot can you solve this one. ??
http://answers.splunk.com/answers/236068/duplicate-logs-are-coming-after-a-time-period.html

0 Karma

bwooden
Splunk Employee
Splunk Employee

1) mark for delete
note that we create a tuple that uniquely identifies the event with eval id=...
we then group duplicate events with transaction command and throw away the last one
...because we want to keep one of the events
finally, we record the ones we've marked for deletion in a csv file

index=* sourcetype=dupes
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats c by delete_id
| outputlookup delete_these.csv

bwooden
Splunk Employee
Splunk Employee

d and watsm10 have it right. We must first build a list of events to delete and separately delete them. This is for a few reasons.

A) We want to verify what we're going to delete first
B) We can't delete after non-streaming command (such as transforms)

So we want to take these high level steps:
1) mark for delete
2) review what we're going to delete
3) enable delete capability
4) delete
5) remove delete capability

watsm10
Communicator

I would like to say...
Thank you so much for this answer!!!

I used the following search to work out the duplicates:

index= | streamstats dc(info_search_time) as count by _time | where count!=1 | eval delete_id=_cd."|".index."|".splunk_server | stats count by delete_id | fields - count | outputcsv dupes.csv

and the following search to delete them:

index= | eval delete_id=_cd."|".index."|".splunk_server | search [|inputcsv dupes.csv | format "(" "(" "OR" ")" "OR" ")"] | delete

Get Updates on the Splunk Community!

Monitoring MariaDB and MySQL

In a previous post, we explored monitoring PostgreSQL and general best practices around which metrics to ...

Financial Services Industry Use Cases, ITSI Best Practices, and More New Articles ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

Splunk Federated Analytics for Amazon Security Lake

Thursday, November 21, 2024  |  11AM PT / 2PM ET Register Now Join our session to see the technical ...