I have found duplicates in the search results as identical events from the same host and same source (file) with exactly the same timestamp. Sometimes there are even more copies, as many as 5.
This is extremely annoying and messes up the statistics gathering on how may times certain functionality gets invoked within our environment.
I have seen the problem on multiple sources (files) and hosts.
There is a great app/python script named remove-duplicate-event-data-from-index
by zpavic that identifies and remove duplicate events -
it helped me clean my indexes,
also you can see this answer to use the script on specific dates:
To find these events, you can run the following search
...|eventstats count as duplicate by _raw _host _time | where duplicate>1
As a temporary measure you can remove the duplicates from each search with the dedup command
...| dedup _raw _host _time
BUT this is inefficient so you need to prevent and get rid of the duplicates. If you have multiple indexers, look for data going to more than one, look for almost duplicate files, avoid using crcsalt in inputs.conf etc.
Once you have got rid of the cause, get rid of the duplicates using the following search
* | eventstats count as duplicates first(_cd) as cd by _raw host _time | where cd!=_cd
I have deliberately not joined the delete to the above search as it is good practice to check the data before deleting it. Confirm it is only bringing back duplicates and not the original then pipe to delete. You will need to temporarily add the candelete roll to your account for this to work.
Sounds like there's a problem with a data input, possibly where an input is monitoring a file or directory and it believes for some reason that the entire file has changed. Is there any common pattern to the files or directories where its occurring?
As simeon says you can use the dedup command to mask it but the root cause should be fixable.