We are trying to ingest large (peta bytes) information into Splunk.
The Events are in JSON file structure like - 'audit_events_ip-10-23-186-200_1.1512077259453.json'
The pipeline is like -
JSON files > Folder > UF > HF Cluster > Indexer Cluster
~ UF - inputs.conf
_TCP_ROUTING = p2s_au_hf
crcSalt = <SOURCE>
disabled = false
move_policy = sinkhole
recursive = false
whitelist = \.json$
We are seeing the events from specific files (NOT all) are getting duplicated. It indexes from some file 2 times exactly.
As it is [batch:///] which suppose to delete the file after reading it & crcSalt=<SOURCE>, we are NOT able to figure out why & what creates the duplicates.
Would appreciate any help, reference or pointers. Thanks in advance!!!
Apparently the source files transfer to folder is in our control - it is verified that the data is NOT duplicates.
It seems to me there are issues while the data is inflight UF -> HF -> Indexers.
Not sure how the ACK works in this set up.
Hi @subasm,
I'm quite sure that the issue is in the data.
Open a case to Splunk Support to be sure.
Ciao.
Giuseppe
Hi @subasm,
probably your logs are rotated in a different file at midnight, so the crcSal option duplicates your indexed data, did you tried without this option?
Ciao.
Giuseppe
We are manually copying the files to the <DIR> and from there onwards UF is supposed to pick up.
So I don't think there is rolling over of the same files at midnight.
Hi @subasm,
if there isn't a rotation, the data are duplicatd at the origin, anyway, if you don't use crcSalt option you have sure to avoid duplicates because Splunk uses its archive (_fishbuckets) to store the already ingested data.
Ciao.
Giuseppe