What's the best way for removing duplicates vs rem...

jpashak · ‎08-24-2022

The issue: a file that is being monitored was ingested again via batch. The back story is not critical. We know what happened and it shouldn't happen again. We now have duplicates in our index back to 03/16/2021. The user of this data wants the duplicates removed. I have looked at solutions to removing duplicates and with the amount of data involved would be very time consuming.

The user asks the question: can we remove all the data based on index, host, sourcetype and source and then reload the data?

My process would be (for each file being monitored)

1) Turn off monitoring off the file

2) Remove the matching data.

3) Turn back on monitoring of the file.

When monitoring is turned back on, will it ingest the entire file the first time it is updated?

I am open to other solutions to this as well.

Thank you!

gcusello · ‎08-24-2022

Hi @jpashak,

at first, when you say remove data, do you mean logically or phisically deleting?

If a logical deletion is sufficient for you, you could run a simple search to find duplicates and use the delete command at the end.

In this way you logically (not phisically) delete data from your index.

In other words, these events are marked as deleted and not more visible but they remain in the index.

If you want a phisical deletion, you should delete all data and manually reindex all of them (I don't hint this solution!).

About reindexing of deleted data, by default Splunk doesn't permit to reindex already indexed data, so you should manually index them or create a different temporary input.

In conclusion, you have three choices:

logically delete duplicated data maintaining one copy of them,
logically delete all duplicated data and manually reindex them,
physically clean your index and reindex all data.

In my opinion the best solution is the first approach.

Tell me if you need help in duplicated events deletion.

Ciao.

Giuseppe

jpashak · ‎08-24-2022

The delete command would be sufficient for what we are doing. Marked as deleted but not removed from the index would be perfect.

I have found a couple different solutions for deleting duplicate events.

The one I've used for a test uses the streamsstats command which has a limit of 10,000 events, so it would need multiple runs to clear all the data I need "deleted".

I would be interested in getting help with duplicated events deletion.

Thank you.

Jeff

gcusello · ‎08-25-2022

Hi @jpashak,

the main work is to identify the duplicated content: if it's duplicated the whole event, you can simply run something like this:

index=your_index
| stats earliest(_time) AS earliest latest(_time) AS latest count BY _raw
| where count>1

in this way you can identify the exact period with duplicatd events and then run the delete command.

Otherwise, if e.g. is duplicated the whole event but the timestamp, you have to identify the fields to use to find the duplicated events.

Ciao.

Giuseppe

richgalloway · ‎08-24-2022

It depends on the nature of the data, which you know best. If the duplicate events were ingested after a certain time then consider running a search that finds all events from that source and filters out the "old" ones.

index=foo source=bar
| eval indextime=_index_time
| where indextime > <<time of accidental ingest>>

---
If this reply helps you, Karma would be appreciated.

richgalloway · ‎08-24-2022

This could be a job for the delete command. It depends on if you need the duplicate data physically removed or just hidden from searches.

To hide the data, run a search to locate *only* the duplicate data. When you're satisfied that only dups are in the results, append | delete to the search to prevent those events from appearing in search results in the future. An admin will have to give you the can_delete role so you can perform that command - it is not available otherwise.

Physically removing the data is another matter. The only way to do that is by index. Delete the index then re-onboard the data, assuming it's still available. Note that turning off monitoring of a file will not cause it to be re-indexed as monitoring will continue from where it left off. You can delete the fish bucket to cause a file to be re-indexed, but that may not get you data from 2021.

---
If this reply helps you, Karma would be appreciated.

What's the best way for removing duplicates vs removing all data that match our criteria and and re-indexing the file?

heavy forwarder

indexer

Unlock Instant Security Insights from Amazon S3 with Splunk Cloud — Try Federated ...

What's New in Splunk Observability - November 2025

Splunk Enterprise Security(ES) 7.3 is approaching the end of support. Get ready for ...

Are you a member of the Splunk Community?