Getting Data In

Duplicate Data Being indexed- How can I confirm how big my data duplication problem might be?

Dmikos1271
Explorer

I'm trying to validate if we have a large amount of data duplication.

Whenever I run the dedup _raw command the number of results is typically halved but based on the onprem set up I don't see how the data could be ingested twice (our typical data flow syslog to a HF that is then ingested by an indexer cluster). If I run index=sample|transaction _raw I only get a small number of results where the raw results are combined. 

How can I confirm how big my data duplication problem might be?

Labels (4)
0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Dmikos1271,

I'd run a simple search like the following:

index=*
| stats values(index) AS index values(sourcetype) AS sourcetype values(host) AS host values(source) AS source count BY _raw
| where count>1

in this way you can find the perimeter of your issue and analyze your configuration.

Usually this issue is caused by one of this misconfigurations:

  • Universal Forwarder installed in a active/active cluster and botrh send the same data,
  • there are logs in rotating files and in the input there's the option "crcSalt = <SOUCE>"
  • there's a syslog data frow sent to more than one receiver without using a Load Balancer

Ciao.

Giuseppe

0 Karma
Get Updates on the Splunk Community!

Data Management Digest – November 2025

  Welcome to the inaugural edition of Data Management Digest! As your trusted partner in data innovation, the ...

Splunk Mobile: Your Brand-New Home Screen

Meet Your New Mobile Hub  Hello Splunk Community!  Staying connected to your data—no matter where you are—is ...

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Real progress on your strategic priorities starts with knowing the business outcomes your teams are delivering ...