How to remove duplicate events in INDEX , not on S...

jadengoho · ‎12-10-2017

I do have many data including duplicate data , and i want to remove duplicate data from the index , without using the ""DEDUP" command since it only remove the event on SEARCH not in INDEX , can somebody help me ?

niketn · ‎12-11-2017

@jadengoho, are these duplicates old or your data will keep on having duplicate data in future as well? If there will be duplicates, what is the source/cause/frequency of duplicate data?

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

jadengoho · ‎12-11-2017

it is a daily logs data , so duplicate data is a problem , cause they are just stacking .

niketn · ‎12-11-2017

If you can fix data while ingestion that would be best. Else you can run a daily scheduled search (to run after data is ingested), which will list all daily data with dedup and push it to separate index.

Refer to Splunk Documentation: https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Collect#Moving_events_to_a_diffe...

PS:
You can use collect command to do this, however, to me seems overhead unless fixed prior to indexing.
You can also think of scripted input to do this in case there are no other means of preventing duplicated events from being indexed.
Using collect command if you define sourcetype other than stash, it will count against your license.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

mjlsnombrado · ‎12-11-2017

I have the same problem, do I need to use a script to fix this issue? If yes, what kind of script should I use?

nickhills · ‎12-11-2017

You will need to create a search which finds your duplicated data, and returns all but the last copy (or first - depending on your needs).
Once you are happy your search correctly identifies ONLY the duplicated events you can pipe the results to |delete which will remove the data from the indexes.

You will need to be a user with 'can delete' permissions - no user has this be default (not even admin) so you may need to add this capability to your user first - its also a good idea to remove this capability when you have finished to prevent accidents! (been there)

Its worth noting that this will not remove the data from disk - it simply marks it as deleted in the buckets, so it wont be returned in future searches

If my comment helps, please give it a thumbs up!

How to remove duplicate events in INDEX , not on Search ?

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

New in Observability Cloud - Explicit Bucket Histograms