I'm trying to validate if we have a large amount of data duplication.
Whenever I run the dedup _raw command the number of results is typically halved but based on the onprem set up I don't see how the data could be ingested twice (our typical data flow syslog to a HF that is then ingested by an indexer cluster). If I run index=sample|transaction _raw I only get a small number of results where the raw results are combined.
How can I confirm how big my data duplication problem might be?
Hi @Dmikos1271,
I'd run a simple search like the following:
index=*
| stats values(index) AS index values(sourcetype) AS sourcetype values(host) AS host values(source) AS source count BY _raw
| where count>1
in this way you can find the perimeter of your issue and analyze your configuration.
Usually this issue is caused by one of this misconfigurations:
Ciao.
Giuseppe