I do have many data including duplicate data , and i want to remove duplicate data from the index , without using the ""DEDUP" command since it only remove the event on SEARCH not in INDEX , can somebody help me ?
@jadengoho, are these duplicates old or your data will keep on having duplicate data in future as well? If there will be duplicates, what is the source/cause/frequency of duplicate data?
it is a daily logs data , so duplicate data is a problem , cause they are just stacking .
If you can fix data while ingestion that would be best. Else you can run a daily scheduled search (to run after data is ingested), which will list all daily data with dedup and push it to separate index.
Refer to Splunk Documentation: https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Collect#Moving_events_to_a_diffe...
PS:
You can use collect
command to do this, however, to me seems overhead unless fixed prior to indexing.
You can also think of scripted input to do this in case there are no other means of preventing duplicated events from being indexed.
Using collect command if you define sourcetype
other than stash
, it will count against your license.
I have the same problem, do I need to use a script to fix this issue? If yes, what kind of script should I use?
You will need to create a search which finds your duplicated data, and returns all but the last copy (or first - depending on your needs).
Once you are happy your search correctly identifies ONLY the duplicated events you can pipe the results to |delete
which will remove the data from the indexes.
You will need to be a user with 'can delete' permissions - no user has this be default (not even admin) so you may need to add this capability to your user first - its also a good idea to remove this capability when you have finished to prevent accidents! (been there)
Its worth noting that this will not remove the data from disk - it simply marks it as deleted in the buckets, so it wont be returned in future searches