Getting Data In

Duplicate Data Being indexed- How can I confirm how big my data duplication problem might be?

Dmikos1271
Explorer

I'm trying to validate if we have a large amount of data duplication.

Whenever I run the dedup _raw command the number of results is typically halved but based on the onprem set up I don't see how the data could be ingested twice (our typical data flow syslog to a HF that is then ingested by an indexer cluster). If I run index=sample|transaction _raw I only get a small number of results where the raw results are combined. 

How can I confirm how big my data duplication problem might be?

Labels (4)
0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Dmikos1271,

I'd run a simple search like the following:

index=*
| stats values(index) AS index values(sourcetype) AS sourcetype values(host) AS host values(source) AS source count BY _raw
| where count>1

in this way you can find the perimeter of your issue and analyze your configuration.

Usually this issue is caused by one of this misconfigurations:

  • Universal Forwarder installed in a active/active cluster and botrh send the same data,
  • there are logs in rotating files and in the input there's the option "crcSalt = <SOUCE>"
  • there's a syslog data frow sent to more than one receiver without using a Load Balancer

Ciao.

Giuseppe

0 Karma
Get Updates on the Splunk Community!

Index This | When is October more than just the tenth month?

October 2025 Edition  Hayyy Splunk Education Enthusiasts and the Eternally Curious!   We’re back with this ...

Observe and Secure All Apps with Splunk

  Join Us for Our Next Tech Talk: Observe and Secure All Apps with SplunkAs organizations continue to innovate ...

What’s New & Next in Splunk SOAR

 Security teams today are dealing with more alerts, more tools, and more pressure than ever.  Join us for an ...