Solved: Large-scale index deduplication-What are the best ...

brayps · ‎05-02-2023

My team has duplicate events in our index (~600 GB). We have fixed duplicate source and need to remove the existing duplicates from the index.

What are the best practices for managing duplicates over a large index?
So far we've explored two options
- Create a summary index with duplicates removed
    - its a large compute load to run this deduplication job and populate a new index all at once. How can we do this efficiently and prevent our job from auto-cancelling?
    - We would like to be able to update the new index from the one containing duplicates on ingest. Are there best practices for doing this reliably?
- Delete duplicate events from current index
    - this is less attractive, due to permanent deletion

gcusello · ‎05-02-2023

Hi @brayps,

in Splunk, so you cannot physically delete an event, only logically using the delete command.

When an event is marked as deleted isn't possible to access again it but it remains in the index until the bucket is removed at the end of its life cycle.

So my hint is to identify the source of the duplicated events and remove the duplication.

then you could save the not duplicated events in a summary index and use it for your searches (cleaning the original index), but this is a long job because you have to modify all your dashboards to use the new index instead the old one.

It could be a fast job is you used eventtypes instead indexes in the main search.

So in my opinion, the easiest way is to remove the duplicated source as soon as possible and to leave the duplicated data (eventually deleted) in the bucket until the retention period will be finished.

About the other questions: in my opinion there's no reason to "Create a summary index with duplicates removed" because if they are duplicated, you don't need them.

then, the second question, when you run the search to populate the summary index send it in "backgound mode" so it will not go in standby.

About best practices, there aren't.

Ciao.

Giuseppe

View solution in original post

PickleRick · ‎05-03-2023

I know it's a bit of a "I told you so" type of remark but that's why you should perform your onboarding process properly - if possible, testing in dev instance, ingesting into a temporary test index and at the end, when everything works OK, creating the final configuration ingesting the data into the proper destination index.

So the best practice is not for "managing duplicates" but the best practice is to not let the duplicates appear in the first place.

But anyway.

As @gcusello already mentioned, there is no way to physically delete data from the index. You could use the delete command to mark selected events as inaccessible (so that they will not be showing in search results) but it will not affect your storage in any way. Also, the delete command is considered risky (and rightly so) so it needs special capability to execute.

Of course you can create summary index with just a single copy of duplicated events but be aware that it will either break your searches pointing to the original index and possibly datamodel definitions and many other things - ok, maybe not "break" but render them useless. If your search times out, you can divide it into smaller chunks by limiting time range. Also remember that your summary index events will have the "stash" sourcetype or you will incur additional license usage.

But the question is if you should bother at all. Maybe it would be easier to just account for the fact that you have "double data" for some time period and just adjust your results accordingly until the data rolls to frozen. Or maybe it would be easier to re-ingest the data if you have the "source material" available.

gcusello · ‎05-02-2023