I have a number of small files, each of which maps to a single event. Since these files aren't actively added to (one-shot deals), I have been using batch mode to ingest them.
I recently saw a problem during some testing. If I added the same file to the monitor directory a second time, the event in the file was ingested again and the search now shows multiple instances of the same event. I thought that Splunk was able to avoid duplicate entries by hashing the file and doing a CRC check. However, that does not seem to be working. The file that I have added has the same contents and the same name. The only thing different is the timestamp of the file since it was recently copied into the monitor directory.
Here are the contents of my inputs.conf.
[batch:///data/metadata] disabled = false move_policy = sinkhole sourcetype = meta
I haved tried with and without
crcSalt = <SOURCE>
but it made no difference; the file is re-ingested every time I copy it into the batch monitor directory creating another duplicate event.
Is there a way to avoid re-ingestion? Failing that, can I do something in search so that everyone only sees the most recent instance of the event?
Hi @trenin, I know this might be too late for a reply, but letting you know anyway.
I was having the same issue and came across your question while searching for an answer.
Are you using any REGEXes for field extractions or timestamp prefixes in your sourcetype?
My problem was that one of the (lengthy) REGEX i was using in the one of the sourcetype definitions was split in the middle and carrid over to the line below and splunk was only considering the first line (malformed REGEX). I spotted it when I was restating splunk and it warned me of it during the validation of .conf files.
After I fixed that it, I had not duplication. So, for me the duplication was for only for the inputs that was using that particular sourcetype.
Can you make sure, the files at the monitor location aren't being updated on their name OR getting edited multiple times ? May be to check, If there are any log rotation policy defined on these files.
Also, can you check the index times and indexing servers for the individual entries of these multiple duplicated events. This will help troubleshoot that at what time your events are getting duplicated and whether they are being sent to different indexers ?
I am manually copying the file in to the monitor directory, watching it get ingested and then manually copying the file a second time, and watching it get ingested a second time. There are now two events in Splunk UI for the exact same file. I know that this scenario will occur, so I am testing it. I thought/was hoping Splunk was able to detect duplicates with the hash and not re-ingest. There is only one indexer.
Have you tried
crcSalt = <string>
Also, take a look at -
How much of a file, in bytes, that the input reads before trying to
identify whether it is a file that has already been seen. You might want to
adjust this if you have many files with common headers (comment headers,
long CSV headers, etc) and recurring filenames.
Cannot be less than 256 or more than 1048576.
CAUTION: Improper use of this setting will cause data to be re-indexed. You
might want to consult with Splunk Support before adjusting this value - the
default is fine for most installations.
Default: 256 (bytes).
Let me know if that resolves your problem
OR we can also start talking about using monitor:// instead of batch://
crsSalt = ConstantString
Made no difference. I tried defining a crclength of 1024. No difference.
I am using batch instead of monitor because the files are one offs each containing a single event. Once ingested, the file is no longer necessary. I will be getting 10s of millions of files and don't want them to hang around in the monitor directory.