Getting Data In

Duplicate Data Being indexed- How can I confirm how big my data duplication problem might be?

Dmikos1271
Explorer

I'm trying to validate if we have a large amount of data duplication.

Whenever I run the dedup _raw command the number of results is typically halved but based on the onprem set up I don't see how the data could be ingested twice (our typical data flow syslog to a HF that is then ingested by an indexer cluster). If I run index=sample|transaction _raw I only get a small number of results where the raw results are combined. 

How can I confirm how big my data duplication problem might be?

Labels (4)
0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Dmikos1271,

I'd run a simple search like the following:

index=*
| stats values(index) AS index values(sourcetype) AS sourcetype values(host) AS host values(source) AS source count BY _raw
| where count>1

in this way you can find the perimeter of your issue and analyze your configuration.

Usually this issue is caused by one of this misconfigurations:

  • Universal Forwarder installed in a active/active cluster and botrh send the same data,
  • there are logs in rotating files and in the input there's the option "crcSalt = <SOUCE>"
  • there's a syslog data frow sent to more than one receiver without using a Load Balancer

Ciao.

Giuseppe

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...