Getting Data In

How to avoid indexing duplicates?

subasm
Loves-to-Learn

We are trying to ingest large (peta bytes) information into Splunk. 

The Events are in JSON file structure like - 'audit_events_ip-10-23-186-200_1.1512077259453.json'

The pipeline is like - 

JSON files > Folder > UF > HF Cluster > Indexer Cluster

 

~ UF - inputs.conf

[batch:///folder]

_TCP_ROUTING = p2s_au_hf

crcSalt = <SOURCE>

disabled = false

move_policy = sinkhole

recursive = false

whitelist = \.json$

 

We are seeing the events from specific files (NOT all) are getting duplicated. It indexes from some file 2 times exactly. 

As it is [batch:///] which suppose to delete the file after reading it & crcSalt=<SOURCE>, we are NOT able to figure out why & what creates the duplicates. 

Would appreciate any help, reference or pointers. Thanks in advance!!!

Labels (1)
0 Karma

subasm
Loves-to-Learn

Apparently the source files transfer to folder is in our control - it is verified that the data is NOT duplicates. 

It seems to me there are issues while the data is inflight UF -> HF -> Indexers.

Not sure how the ACK works in this set up.  

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @subasm,

I'm quite sure that the issue is in the data.

Open a case to Splunk Support to be sure.

Ciao.

Giuseppe

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @subasm,

probably your logs are rotated in a different file at midnight, so the crcSal option duplicates your indexed data, did you tried without this option?

Ciao.

Giuseppe

0 Karma

subasm
Loves-to-Learn

We are manually copying the files to the <DIR> and from there onwards UF is supposed to pick up.

So I don't think there is rolling over of the same files at midnight. 

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @subasm,

if there isn't a rotation, the data are duplicatd at the origin, anyway, if you don't use crcSalt option you have sure to avoid duplicates because Splunk uses its archive (_fishbuckets) to store the already ingested data.

Ciao.

Giuseppe

0 Karma
Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...