The documentation says Splunk is creating a CRC hash of the first and last 256 bytes of a file in order to detect weather the file's content has already been processed (eg. log file rotation). Is this true? Recent observations made me believe that only the first 256 bytes and the file size are relevant. How does this similar file detection work exactly?
What are the options to override/tune this behavior other than crcSalt=<SOURCE>? Is there a way to increase this 256 byte window? (eg. let splunk use the first 512 byte to detect simliar files).
Here is an example, to illustrate what I mean:
First 256 byte of every file the directory is the same:
sp@locutus:test_input$ for f in $(ls -1 .); do echo "head -c 256 $f | md5 = $(head -c 256 $f | md5)"; done
head -c 256 timings1_0.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_1.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_2.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_3.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_4.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_5.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings2_0.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings2_1.csv | md5 = e665ba09f505913aa5fe05d603fde49a
sp@locutus:test_input$ for f in $(ls -1 .); do echo "du -h $f $(du -h $f)"; done
du -h timings1_0.csv 2,3M timings1_0.csv
du -h timings1_1.csv 8,6M timings1_1.csv
du -h timings1_2.csv 3,4M timings1_2.csv
du -h timings1_3.csv 3,1M timings1_3.csv
du -h timings1_4.csv 2,8M timings1_4.csv
du -h timings1_5.csv 2,8M timings1_5.csv
du -h timings2_0.csv 2,3M timings2_0.csv
du -h timings2_1.csv 7,3M timings2_1.csv
Added to Splunk (it hasn't been on this instance before) into an empty index "test":
sp@locutus:test_input$ splunk add monitor . -index test -sourcetype splunk_dup_test
Your session is invalid. Please login.
Splunk username: admin
Added monitor of '/Users/sp/temp/test_input'.
Waited a fair amount of time (Splunk finished indexing):
It uses the first 256 bytes and the last 256 bytes by default. There are two other available methods. Adding crcSalt=<SOURCE> simply adds the file path to the hash, so if the file path is invariant, this doesn't actually change things.
You can use the CHECK_METHOD paramater in props.conf to select one of the other methods. You would most likely specify this in a [source::] stanza on the forwarder. From props.conf.spec:
CHECK_METHOD = endpoint_md5 | entire_md5 | modtime
* Set to 'endpoint_md5' to have Splunk checksum of the first and last 256 bytes of a file. When matches are found, Splunk lists the file as already indexed and indexes only new data, or ignores it if there is no new data.
* Set this to "entire_md5" to use the checksum of the entire file.
* Alternatively, set this to "modtime" to check only the modification time of the file.
* Settings other than endpoint_md5 will cause splunk to index the entire file for each detected change.
* Defaults to endpoint_md5.
So, if Splunk uses the first and last 256 bytes of the files, why isn't it indexing more than one file of the example (added to the question)? I've added the CHECK_METHOD to the props.conf as well, didn't make a difference.