We want Spunk to index some csv data that we put folder on the search head, and set up the Files & Directories Monitor. But some of the files aren't indexed at all. I've seen this issue pop up before on Answers, but still not sure if there's a solution to fix it.
How can I force Splunk to index all complete files within the in folder? Is this a bug with the file & directories monitor, or Splunk automagically trying to figure out what I want to actually index? And if this is Splunk trying to outsmart me (e.g., not reading a file if the first xyz characters are repeated from previous files), how can I get Splunk to do what I want and index everything?
In Splunk settings, it shows that it is monitoring the correct 'number of files' that should be monitored. But when I look at the source files in the index, a lot of them are missing. It's too many files to manually upload, and we haven't gone the route of sending the files with a forwarder instead because this is a dev demo box that we need to get working.
Thanks
It's likely the "Splunk is trying to outsmart me" case. In an attempt to avoid reindexing the same data (that'd be pretty churlish, right, charging for ingest but reindexing the same stuff over and over?!), Splunk does some stuff with checksums on the file. It's looking at the first 256 bytes (by default, configurable as initCrcLength) and a few other factors. If the headers or preambles of the log files (common with JVM sources) are the same, raising the length can be enough. For your case, there's another route and that is to tell Splunk to include the filename of the log as part of its checksum inputs. Within the inputs.conf that says "monitor this directory", add this setting: crcSalt = <SOURCE> (with literal < > there).
If you have files that are copies of others (same 1024 chars) you cannot index them without adding to inputs.conf crcsalt = <SOURCE>
Bye.
Giuseppe
It's likely the "Splunk is trying to outsmart me" case. In an attempt to avoid reindexing the same data (that'd be pretty churlish, right, charging for ingest but reindexing the same stuff over and over?!), Splunk does some stuff with checksums on the file. It's looking at the first 256 bytes (by default, configurable as initCrcLength) and a few other factors. If the headers or preambles of the log files (common with JVM sources) are the same, raising the length can be enough. For your case, there's another route and that is to tell Splunk to include the filename of the log as part of its checksum inputs. Within the inputs.conf that says "monitor this directory", add this setting: crcSalt = <SOURCE> (with literal < > there).
Ok thanks, it sounds like crcSalt= is exactly what i'm looking for. I just added this to the inputs.conf. Will it add all non-indexed files retroactively now that the inputs.conf is updated?
It should, yes, given time. You'll need to restart Splunk to pick up the changes to inputs.conf.
Ok thanks for the help!