I am using the directory monitoring feature to index files below a specific path. The stanza in inputs.conf looks like this:
[monitor://E:\Logs\UTC] disabled = false host_regex = \.?(?<host>[A-Za-z_]*)_[0-9] sourcetype = tsv
Looking at the Splunk data though, I occasionally see that files placed in to that directory do not get indexed. I can manually index each of these files using the oneshot CLI command, but was hoping to figure out why they were skipped in the first place. Has anyone seen this before?
Any assistance would be appreciated.
this is a common question, often seen here.
Most of the time, this is caused by files with large headers. I see your input is a tsv, so this can also apply to your problem.
Splunk uses a hash value to determine if a file has already been indexed. To calculate the hashvalue, the first few lines or signs (how much to read is configurable) of a file are read and the hash is calculated. If your file has a large header, there is the posibility that the hash is equal for several files.
You have two options:
1) Expand the amount of lines or signs splunk reads to calculate the hash.
Details here, you have to search for "initCrcLength".
2) Add a salt to the read lines / signs.
Your theory about the header is probably right. I ended up fixing the issue by setting
In inputs.conf. It did result in my files being double indexed after resetting the Splunk server. That was a bit of a pain to resolve, but once I did the issue appears to have been fixed. Thanks for the tip!