There is no straightforward and easy way to do certain things efficiently in a distributed environment. This is especially true for deduping, which, by its very nature can prove complex and expensive. And it is precisely this reason that the why and the how should definitely be discussed so that the problem can be solved, instead of the symptom. Attacking the root cause may prevent this from happening again in the future. Having said that, here are your options:
a. modify your searches to include a deduplication pipeline (perhaps a transaction with maxspan=1s ) before your stats or other reporting commands. Wrap that in a macro and inject it in every search that operates on said data.
b. dedup your data and while at it, create a summary index using only the fields that your interested in. Run your searches on the summary.
c. Use | delete in the long and non-trivial way:
Each event in any splunk deployment, distributed or not, can be uniquely identified by a combination (ex. concatenation) of the following fields: index , splunk_server and _cd . Let's call this field id.
Run a search that identifies all dupes and their respective ids. You can use transaction for this (or your own preferred method)
Put ids in a multivalued field ( mvlist option in transaction)
For each transacted event, create a new field called delete_id which will contain id values of all events to be deleted. This is every value of the field id except for one and can be achieved by using mvindex .
Create a lookup table out of all delete_ids.
Run a new search over your data. Look for events where their ids match lookup table's delete_ids and pipe them thru delete .
Sample search to build the lookup table. Modify transaction options as necessary.
source="/tmp/dupes.txt"
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats count by delete_id
| fields - count
| outputlookup dupes
Proof of concept:
oneshot a file full o'dupes:
# /opt/splunk/bin/splunk add oneshot /tmp/fullODupes.txt
run a search to both, find and delete dupes:
source="/tmp/fullODupes.txt"
| eval id=_cd."|".index."|".splunk_server
| search
[ search source="/tmp/fullODupes.txt"
| eval id=_cd."|".index."|".splunk_server
| transaction _raw maxspan=1s keepevicted=true mvlist=t
| search eventcount>1
| eval delete_id=mvindex(id, 1, -1)
| stats count by delete_id
| fields – count
| return 20 id=delete_id]
| delete
Note that the search has not been tested with a large number of events. It may be susceptible to stats or return limits. A lookup table may be the best way go about it.
... View more