So I've been unable to understand how Splunk works with log ingestion from Folder Monitor when it comes to a document that has already been ingested but has been changed since.
A basic example is a security log. Splunk identifies it, ingests and indexes it etc.
A new entry is added to that security log.
What happens at that point?
Does it reingest the log, duplicating the old data?
Does it not reingest since it has already done so once?
Is it ridiculously smart and just reingests the new data?
Thanks for the help!
Typically, log files are written to in an appending manner. So by default, Splunk keeps track of aspects about each file it's monitored, (including observed size, modtime, bytes read, and checksums) in a data structure known as "the fishbucket." So in your scenario, Splunk has indexed the file previously, a new entry is added to the end and when Splunk reads the file again, and it only sends the new entry to be indexed.
Entries to the fishbucket are keyed by a checksum of the beginning of the file (so that when log files are rolled, you do not wind up with duplication just because a log file now has a different file name). You can have issues with duplicate indexing if your rolling is also doing compression and you haven't setup Splunk to ignore the compressed log files on your monitor stanza (since checksum of compressed bytes won't usually match checksum of uncompressed bytes).
But there are a lot of settings in inputs.conf and props.conf to control this behavior. In fact, in cases where the file that you're monitoring a file that is not a log file, where you actually want to reindex the whole file if it's changed, you could set Splunk to only check the modtime or checksum of the entire file and resend everything using a props file.