Hello all,
I need to delete duplicated events, since one of my data sources sends duplicated events, there is a field "id" and also a field "version" so I can identify the last one in order to keep it and delete the others. I need this process to run automatically every hour for example. Any suggestions?
Thanks in advance
I don't know if that would be wise. It certainly might have unintended conseqences.
But, there are solutions.
Is there any way to fix the sending device to have it not do duplicates?
If that's not possible, then to "delete" the others, you could do one of a few things.
You could set up a summary index, possibly, and send only ... no, because they'd still be duplicated - you'd have to clean out that summary index regularly. Hmm.
You could build a lookup out of those, if they're not too big, and just overwrite it with only the 'best current values' every hour as a scheduled report doing an outputlookup at the end.
But, easiest, is probably just work around it in your SPL, perhaps a subsearch is easiest.
index=foo sourcetype=bar
[ index=foo sourcetype=bar
| stats max(version) as version by id
| fields version id ]
If you haven't dealt with subsearches before, ... well, they're pretty useful at times.
The subsearch is inside the [] brackets, and *it runs first*. Once it completes, it returns its results back into the main search (formatted by default with () and AND and OR and whatnot). Then it's part of the main search's search terms.
Like this example for my little firewall. My APs run different versions of software (because I upgrade one of the two, and usually wait a few days before upgrading the other.). If I wanted to only get records where the version was on the latest, I could do the following:
index=fw
[ search index=fw
| stats max(host_version) as host_version by host
| fields host_version host ]
The subsearch runs and ends up returning a list like
( ( host="AP_Downstairs" AND host_version="v4.3.21.11325" ) OR ( host="AP_Upstairs" AND host_version="v4.3.21.11325" ) OR ( host="curie" ) )
THAT search then gets appended right into the main search, so your full search resolves down to
index=fw ( ( host="AP_Downstairs" AND host_version="v4.3.21.11325" ) OR ( host="AP_Upstairs" AND host_version="v4.3.21.11325" ) OR ( host="curie" ) )
If you ever need a different set of AND/OR/() or things grouped differently, there's a 'format' command you can use, it's a little obtuse but look at the examples. https://docs.splunk.com/Documentation/Splunk/8.0.6/SearchReference/Format
If you search for Splunk subsearches, you'll find all sorts of help on them. Here's a good set of starting points:
The search tutorial's examples: https://docs.splunk.com/Documentation/Splunk/8.0.6/SearchTutorial/Useasubsearch
And about subsearches: https://docs.splunk.com/Documentation/Splunk/8.0.6/Search/Aboutsubsearches
happy splunking!
-Rich
Are you saying the events for a given "id" field have different "version" field values? If so, then they are not really duplicate events.
If you still want to get rid of them then there may be other ways to do so besides | delete (which doesn't actually delete anything).
If there is a way to use a regular expression to identify the duplicate events then using a transform to send the unwanted events to nullQueue is better because you are not using license quota for events that will never be seen.
Failing that, then using delete may be the final option. Create a scheduled search that runs every few minutes, looks at the previous few minutes for duplicates, and deletes them. The schedule search must be owned by a user that has the "can_delete" role. Do NOT use this user for any other activity or you risk other data being deleted. CAUTION: Here There Be Dragons. I doubt an auditor will approve of this procedure so use it only if you are not subject to audits.