My team has duplicate events in our index (~600 GB). We have fixed duplicate source and need to remove the existing duplicates from the index.
What are the best practices for managing duplicates over a large index? So far we've explored two options - Create a summary index with duplicates removed - its a large compute load to run this deduplication job and populate a new index all at once. How can we do this efficiently and prevent our job from auto-cancelling? - We would like to be able to update the new index from the one containing duplicates on ingest. Are there best practices for doing this reliably? - Delete duplicate events from current index - this is less attractive, due to permanent deletion
... View more