Getting Data In

Duplicate Data Being indexed- How can I confirm how big my data duplication problem might be?

Dmikos1271
Explorer

I'm trying to validate if we have a large amount of data duplication.

Whenever I run the dedup _raw command the number of results is typically halved but based on the onprem set up I don't see how the data could be ingested twice (our typical data flow syslog to a HF that is then ingested by an indexer cluster). If I run index=sample|transaction _raw I only get a small number of results where the raw results are combined. 

How can I confirm how big my data duplication problem might be?

Labels (4)
0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Dmikos1271,

I'd run a simple search like the following:

index=*
| stats values(index) AS index values(sourcetype) AS sourcetype values(host) AS host values(source) AS source count BY _raw
| where count>1

in this way you can find the perimeter of your issue and analyze your configuration.

Usually this issue is caused by one of this misconfigurations:

  • Universal Forwarder installed in a active/active cluster and botrh send the same data,
  • there are logs in rotating files and in the input there's the option "crcSalt = <SOUCE>"
  • there's a syslog data frow sent to more than one receiver without using a Load Balancer

Ciao.

Giuseppe

0 Karma
Get Updates on the Splunk Community!

Strengthen Your Future: A Look Back at Splunk 10 Innovations and .conf25 Highlights!

The Big One: Splunk 10 is Here!  The moment many of you have been waiting for has arrived! We are thrilled to ...

Now Offering the AI Assistant Usage Dashboard in Cloud Monitoring Console

Today, we’re excited to announce the release of a brand new AI assistant usage dashboard in Cloud Monitoring ...

Stay Connected: Your Guide to October Tech Talks, Office Hours, and Webinars!

What are Community Office Hours? Community Office Hours is an interactive 60-minute Zoom series where ...