How to avoid indexing duplicates?

subasm · ‎12-18-2023

We are trying to ingest large (peta bytes) information into Splunk.

The Events are in JSON file structure like - 'audit_events_ip-10-23-186-200_1.1512077259453.json'

The pipeline is like -

JSON files > Folder > UF > HF Cluster > Indexer Cluster

~ UF - inputs.conf

[batch:///folder]

_TCP_ROUTING = p2s_au_hf

crcSalt = <SOURCE>

disabled = false

move_policy = sinkhole

recursive = false

whitelist = \.json$

We are seeing the events from specific files (NOT all) are getting duplicated. It indexes from some file 2 times exactly.

As it is [batch:///] which suppose to delete the file after reading it & crcSalt=<SOURCE>, we are NOT able to figure out why & what creates the duplicates.

Would appreciate any help, reference or pointers. Thanks in advance!!!

subasm · ‎12-19-2023

Apparently the source files transfer to folder is in our control - it is verified that the data is NOT duplicates.

It seems to me there are issues while the data is inflight UF -> HF -> Indexers.

Not sure how the ACK works in this set up.

gcusello · ‎12-19-2023

Hi @subasm,

I'm quite sure that the issue is in the data.

Open a case to Splunk Support to be sure.

Ciao.

Giuseppe

gcusello · ‎12-18-2023

Hi @subasm,

probably your logs are rotated in a different file at midnight, so the crcSal option duplicates your indexed data, did you tried without this option?

Ciao.

Giuseppe

subasm · ‎12-18-2023

We are manually copying the files to the <DIR> and from there onwards UF is supposed to pick up.

So I don't think there is rolling over of the same files at midnight.

gcusello · ‎12-18-2023

Hi @subasm,

if there isn't a rotation, the data are duplicatd at the origin, anyway, if you don't use crcSalt option you have sure to avoid duplicates because Splunk uses its archive (_fishbuckets) to store the already ingested data.

Ciao.

Giuseppe

How to avoid indexing duplicates?

inputs.conf

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio