Getting Data In

How to prevent duplicates in batch mode?

trenin
Explorer

I have a number of small files, each of which maps to a single event. Since these files aren't actively added to (one-shot deals), I have been using batch mode to ingest them.

I recently saw a problem during some testing. If I added the same file to the monitor directory a second time, the event in the file was ingested again and the search now shows multiple instances of the same event. I thought that Splunk was able to avoid duplicate entries by hashing the file and doing a CRC check. However, that does not seem to be working. The file that I have added has the same contents and the same name. The only thing different is the timestamp of the file since it was recently copied into the monitor directory.

Here are the contents of my inputs.conf.

[batch:///data/metadata]
disabled = false
move_policy = sinkhole
sourcetype = meta

I haved tried with and without

 crcSalt = <SOURCE>

but it made no difference; the file is re-ingested every time I copy it into the batch monitor directory creating another duplicate event.

Is there a way to avoid re-ingestion? Failing that, can I do something in search so that everyone only sees the most recent instance of the event?

Thanks!

meleperuma
Explorer

Hi @trenin, I know this might be too late for a reply, but letting you know anyway.

I was having the same issue and came across your question while searching for an answer.

Are you using any REGEXes for field extractions or timestamp prefixes in your sourcetype?

My problem was that one of the (lengthy) REGEX i was using in the one of the sourcetype definitions was split in the middle and carrid over to the line below and splunk was only considering the first line (malformed REGEX). I spotted it when I was restating splunk and it warned me of it during the validation of .conf files.

After I fixed that it, I had not duplication. So, for me the duplication was for only for the inputs that was using that particular sourcetype.

0 Karma

amitm05
Builder

Can you make sure, the files at the monitor location aren't being updated on their name OR getting edited multiple times ? May be to check, If there are any log rotation policy defined on these files.

Also, can you check the index times and indexing servers for the individual entries of these multiple duplicated events. This will help troubleshoot that at what time your events are getting duplicated and whether they are being sent to different indexers ?

0 Karma

trenin
Explorer

I am manually copying the file in to the monitor directory, watching it get ingested and then manually copying the file a second time, and watching it get ingested a second time. There are now two events in Splunk UI for the exact same file. I know that this scenario will occur, so I am testing it. I thought/was hoping Splunk was able to detect duplicates with the hash and not re-ingest. There is only one indexer.

0 Karma

amitm05
Builder

Have you tried

crcSalt = <string>

Also, take a look at -
From: http://docs.splunk.com/Documentation/Splunk/latest/Admin/Inputsconf
initCrcLength =
How much of a file, in bytes, that the input reads before trying to
identify whether it is a file that has already been seen. You might want to
adjust this if you have many files with common headers (comment headers,
long CSV headers, etc) and recurring filenames.
Cannot be less than 256 or more than 1048576.
CAUTION: Improper use of this setting will cause data to be re-indexed. You
might want to consult with Splunk Support before adjusting this value - the
default is fine for most installations.
Default: 256 (bytes).

Let me know if that resolves your problem
OR we can also start talking about using monitor:// instead of batch://

trenin
Explorer

I tried

crsSalt = ConstantString

Made no difference. I tried defining a crclength of 1024. No difference.

I am using batch instead of monitor because the files are one offs each containing a single event. Once ingested, the file is no longer necessary. I will be getting 10s of millions of files and don't want them to hang around in the monitor directory.

0 Karma

amitm05
Builder

No, you have to mention exactly like -

crcSalt = <string>

i dont think crsSalt = ConstantString means anything

0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...