Duplicate Data Being indexed- How can I confirm ho...

Dmikos1271 · ‎04-26-2023

I'm trying to validate if we have a large amount of data duplication.

Whenever I run the dedup _raw command the number of results is typically halved but based on the onprem set up I don't see how the data could be ingested twice (our typical data flow syslog to a HF that is then ingested by an indexer cluster). If I run index=sample|transaction _raw I only get a small number of results where the raw results are combined.

How can I confirm how big my data duplication problem might be?

gcusello · ‎04-27-2023

Hi @Dmikos1271,

I'd run a simple search like the following:

index=*
| stats values(index) AS index values(sourcetype) AS sourcetype values(host) AS host values(source) AS source count BY _raw
| where count>1

in this way you can find the perimeter of your issue and analyze your configuration.

Usually this issue is caused by one of this misconfigurations:

Universal Forwarder installed in a active/active cluster and botrh send the same data,
there are logs in rotating files and in the input there's the option "crcSalt = <SOUCE>"
there's a syslog data frow sent to more than one receiver without using a Load Balancer

Ciao.

Giuseppe

Duplicate Data Being indexed- How can I confirm how big my data duplication problem might be?

data

indexer

monitor

syslog

Strengthen Your Future: A Look Back at Splunk 10 Innovations and .conf25 Highlights!

Now Offering the AI Assistant Usage Dashboard in Cloud Monitoring Console

Stay Connected: Your Guide to October Tech Talks, Office Hours, and Webinars!

Are you a member of the Splunk Community?