topic How to avoid indexing duplicates? in Getting Data In

How to avoid indexing duplicates?

subasm — Tue, 19 Dec 2023 04:42:06 GMT

We are trying to ingest large (peta bytes) information into Splunk.

The Events are in JSON file structure like - 'audit_events_ip-10-23-186-200_1.1512077259453.json'

The pipeline is like -

JSON files > Folder > UF > HF Cluster > Indexer Cluster

~ UF - inputs.conf

[batch:///folder]

_TCP_ROUTING = p2s_au_hf

crcSalt = <SOURCE>

disabled = false

move_policy = sinkhole

recursive = false

whitelist = \.json$

We are seeing the events from specific files (NOT all) are getting duplicated. It indexes from some file 2 times exactly.

As it is [batch:///] which suppose to delete the file after reading it & crcSalt=<SOURCE>, we are NOT able to figure out why & what creates the duplicates.

Would appreciate any help, reference or pointers. Thanks in advance!!!

Re: How to avoid indexing duplicates?

gcusello — Tue, 19 Dec 2023 07:10:02 GMT

Hi @subasm,

probably your logs are rotated in a different file at midnight, so the crcSal option duplicates your indexed data, did you tried without this option?

Ciao.

Giuseppe

Re: How to avoid indexing duplicates?

subasm — Tue, 19 Dec 2023 07:33:15 GMT

We are manually copying the files to the <DIR> and from there onwards UF is supposed to pick up.

So I don't think there is rolling over of the same files at midnight.

Re: How to avoid indexing duplicates?

gcusello — Tue, 19 Dec 2023 07:38:07 GMT

Hi @subasm,

if there isn't a rotation, the data are duplicatd at the origin, anyway, if you don't use crcSalt option you have sure to avoid duplicates because Splunk uses its archive (_fishbuckets) to store the already ingested data.

Ciao.

Giuseppe

Re: How to avoid indexing duplicates?

subasm — Tue, 19 Dec 2023 13:47:10 GMT

Apparently the source files transfer to folder is in our control - it is verified that the data is NOT duplicates.

It seems to me there are issues while the data is inflight UF -> HF -> Indexers.

Not sure how the ACK works in this set up.

Re: How to avoid indexing duplicates?

gcusello — Tue, 19 Dec 2023 14:05:30 GMT

Hi @subasm,

I'm quite sure that the issue is in the data.

Open a case to Splunk Support to be sure.

Ciao.

Giuseppe