Getting Data In
Highlighted

Splunk's mechanism to detect files with the same content

Influencer

The documentation says Splunk is creating a CRC hash of the first and last 256 bytes of a file in order to detect weather the file's content has already been processed (eg. log file rotation). Is this true? Recent observations made me believe that only the first 256 bytes and the file size are relevant. How does this similar file detection work exactly?

What are the options to override/tune this behavior other than crcSalt=<SOURCE>? Is there a way to increase this 256 byte window? (eg. let splunk use the first 512 byte to detect simliar files).

EDIT:

Here is an example, to illustrate what I mean:

First 256 byte of every file the directory is the same:

sp@locutus:test_input$ for f in $(ls -1 .); do echo "head -c 256 $f | md5 = $(head -c 256 $f | md5)"; done
head -c 256 timings1_0.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_1.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_2.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_3.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_4.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_5.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings2_0.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings2_1.csv | md5 = e665ba09f505913aa5fe05d603fde49a
...

Last 256 bytes are different:

sp@locutus:test_input$ for f in $(ls -1 .); do echo "tail -c 256 $f | md5 = $(tail -c 256 $f | md5)"; done
tail -c 256 timings1_0.csv | md5 = de07cfe6f9b7209cbfdc3c63b5e45f66
tail -c 256 timings1_1.csv | md5 = b17470e217afcb23017596a569ce759a
tail -c 256 timings1_2.csv | md5 = 3aa94dfeb5014537e33bdd67ab7d16d0
tail -c 256 timings1_3.csv | md5 = 290d8c33f80a79a83bd02d10417ee8af
tail -c 256 timings1_4.csv | md5 = 292a292f17b01a4d4483712b70eddc68
tail -c 256 timings1_5.csv | md5 = 102566f80f0fb29a1ed8d5db5b26cce6
tail -c 256 timings2_0.csv | md5 = 61caa775c378b1c8887f2a442b546758
tail -c 256 timings2_1.csv | md5 = fd097acdbbb32391a4e0d9bccc37bc68
...

Filesize is different as well:

sp@locutus:test_input$ for f in $(ls -1 .); do echo "du -h $f $(du -h $f)"; done
du -h timings1_0.csv 2,3M   timings1_0.csv
du -h timings1_1.csv 8,6M   timings1_1.csv
du -h timings1_2.csv 3,4M   timings1_2.csv
du -h timings1_3.csv 3,1M   timings1_3.csv
du -h timings1_4.csv 2,8M   timings1_4.csv
du -h timings1_5.csv 2,8M   timings1_5.csv
du -h timings2_0.csv 2,3M   timings2_0.csv
du -h timings2_1.csv 7,3M   timings2_1.csv
...

Added to Splunk (it hasn't been on this instance before) into an empty index "test":

sp@locutus:test_input$ splunk add monitor . -index test -sourcetype splunk_dup_test
Your session is invalid.  Please login.
Splunk username: admin
Password: 
Added monitor of '/Users/sp/temp/test_input'.

Waited a fair amount of time (Splunk finished indexing):

splunk search "index=test | stats count by source"

                 source                  count
---------------------------------------- -----
/Users/sp/temp/test_input/timings1_0.csv 11662

(Only 1 file got indexed)

Tags (1)
Highlighted

Re: Splunk's mechanism to detect files with the same content

Splunk Employee
Splunk Employee

It uses the first 256 bytes and the last 256 bytes by default. There are two other available methods. Adding crcSalt=<SOURCE> simply adds the file path to the hash, so if the file path is invariant, this doesn't actually change things.

You can use the CHECK_METHOD paramater in props.conf to select one of the other methods. You would most likely specify this in a [source::] stanza on the forwarder. From props.conf.spec:

CHECK_METHOD = endpoint_md5 | entire_md5 | modtime
* Set to 'endpoint_md5' to have Splunk checksum of the first and last 256 bytes of a file.  When matches are found, Splunk lists the file as already indexed and indexes only new data, or ignores it if there is no new data.
* Set this to "entire_md5" to use the checksum of the entire file.
* Alternatively, set this to "modtime" to check only the modification time of the file.
* Settings other than endpoint_md5 will cause splunk to index the entire file for each detected change.
* Defaults to endpoint_md5.
Highlighted

Re: Splunk's mechanism to detect files with the same content

Influencer

So, if Splunk uses the first and last 256 bytes of the files, why isn't it indexing more than one file of the example (added to the question)? I've added the CHECK_METHOD to the props.conf as well, didn't make a difference.

Highlighted

Re: Splunk's mechanism to detect files with the same content

Splunk Employee
Splunk Employee

since 5.0 you also can set the parameter initCrcLength (default is 256)
http://docs.splunk.com/Documentation/Splunk/latest/Admin/Inputsconf