The issue: a file that is being monitored was ingested again via batch. The back story is not critical. We know what happened and it shouldn't happen again. We now have duplicates in our index back to 03/16/2021. The user of this data wants the duplicates removed. I have looked at solutions to removing duplicates and with the amount of data involved would be very time consuming.
The user asks the question: can we remove all the data based on index, host, sourcetype and source and then reload the data?
My process would be (for each file being monitored)
1) Turn off monitoring off the file
2) Remove the matching data.
3) Turn back on monitoring of the file.
When monitoring is turned back on, will it ingest the entire file the first time it is updated?
I am open to other solutions to this as well.
Thank you!
Hi @jpashak,
at first, when you say remove data, do you mean logically or phisically deleting?
If a logical deletion is sufficient for you, you could run a simple search to find duplicates and use the delete command at the end.
In this way you logically (not phisically) delete data from your index.
In other words, these events are marked as deleted and not more visible but they remain in the index.
If you want a phisical deletion, you should delete all data and manually reindex all of them (I don't hint this solution!).
About reindexing of deleted data, by default Splunk doesn't permit to reindex already indexed data, so you should manually index them or create a different temporary input.
In conclusion, you have three choices:
In my opinion the best solution is the first approach.
Tell me if you need help in duplicated events deletion.
Ciao.
Giuseppe
The delete command would be sufficient for what we are doing. Marked as deleted but not removed from the index would be perfect.
I have found a couple different solutions for deleting duplicate events.
The one I've used for a test uses the streamsstats command which has a limit of 10,000 events, so it would need multiple runs to clear all the data I need "deleted".
I would be interested in getting help with duplicated events deletion.
Thank you.
Jeff
Hi @jpashak,
the main work is to identify the duplicated content: if it's duplicated the whole event, you can simply run something like this:
index=your_index
| stats earliest(_time) AS earliest latest(_time) AS latest count BY _raw
| where count>1
in this way you can identify the exact period with duplicatd events and then run the delete command.
Otherwise, if e.g. is duplicated the whole event but the timestamp, you have to identify the fields to use to find the duplicated events.
Ciao.
Giuseppe
It depends on the nature of the data, which you know best. If the duplicate events were ingested after a certain time then consider running a search that finds all events from that source and filters out the "old" ones.
index=foo source=bar
| eval indextime=_index_time
| where indextime > <<time of accidental ingest>>
This could be a job for the delete command. It depends on if you need the duplicate data physically removed or just hidden from searches.
To hide the data, run a search to locate *only* the duplicate data. When you're satisfied that only dups are in the results, append | delete to the search to prevent those events from appearing in search results in the future. An admin will have to give you the can_delete role so you can perform that command - it is not available otherwise.
Physically removing the data is another matter. The only way to do that is by index. Delete the index then re-onboard the data, assuming it's still available. Note that turning off monitoring of a file will not cause it to be re-indexed as monitoring will continue from where it left off. You can delete the fish bucket to cause a file to be re-indexed, but that may not get you data from 2021.