topic Splunk as CMDB - initCrcLength set to max creates duplicates in Getting Data In

Splunk as CMDB - initCrcLength set to max creates duplicates

f1dot4 — Fri, 29 Jan 2016 10:33:23 GMT

Hi,
i want to use splunk as GUI for a CMDB. I know, that not the default use case, but splunk exists already and i like the possibilities for visualization.

I'm indexing textfiles with meta-data of hosts as content, the filename contains timestamp & hostname (this is already working). The point is, that those text-files are created every day, and most of the time, there is no change between the days (no software/hw change, etc) - so there is no need to index them again.

As written in the docs, splunk looks for the first 256 bytes (initCrcLength) to check if the file is already indexed to handle logrotation. Since my case is not a normal logfile, the important change in my files can occur also at the end of the textfile. To ensure that i won't miss a change at the end of the file, i increased initCrcLength to it's maximum of 1048576 (Bytes).
My files are smaller than 1048576 Bytes, so from my point of view, splunk should not index files with the same content (checked from beginning to the end of the file).

My Problem with this configuration is, that exactly this is happening (duplicates are going to be indexed), my inputs.conf:

[monitor://D:\CMDB\lokal\test\*.txt]
host_regex = test\\(.*?)_
initCrcLength = 1048576
disabled = false
index = cmdb
sourcetype = cmdb

Any ideas?
BR, Lukas

Re: Splunk as CMDB - initCrcLength set to max creates duplicates

jplumsdaine22 — Fri, 29 Jan 2016 11:56:06 GMT

What error are you getting that shows the files are being reindexed?

If you search for index_internal you should see why the reindexing is occuring.

Re: Splunk as CMDB - initCrcLength set to max creates duplicates

f1dot4 — Fri, 29 Jan 2016 12:07:22 GMT

There is no error with reindexing - but when i have 2 files with different filenames and exact the same content, they're both indexed. Since most of the files are identical - the index is filling up with lots of duplicate entries - this is what i'm trying to avoid with the initCrcLength setting.

Re: Splunk as CMDB - initCrcLength set to max creates duplicates

jplumsdaine22 — Fri, 29 Jan 2016 15:21:29 GMT

if you run

$ head -c   1048576 <filename>.txt | md5sum

against all those files, do you get the same hash?

Re: Splunk as CMDB - initCrcLength set to max creates duplicates

dwaddle — Fri, 29 Jan 2016 15:35:34 GMT

Let's try this differently. Leave initCrcLength alone and set in props.conf:

[source::D:\CMDB\lokal\test\*.txt]
CHECK_METHOD=entire_md5

Re: Splunk as CMDB - initCrcLength set to max creates duplicates

jplumsdaine22 — Fri, 29 Jan 2016 15:36:19 GMT

You might check if they're the same. I think something similar to this python to this is happening under the hood:

import glob, os, hashlib
os.chdir("D:/CMDB/lokal/test/")    

for file in glob.glob("*.txt"):
 currentfile = open(file, 'rb')
 hash = hashlib.md5()
 hash.update(currentfile.read(1048576))
 print currentfile, hash.hexdigest()
 currentfile.close()

Re: Splunk as CMDB - initCrcLength set to max creates duplicates

jplumsdaine22 — Fri, 29 Jan 2016 15:43:53 GMT

Looks good to me - although I was unclear whether Splunk considers the hostname when comparing crcs

Re: Splunk as CMDB - initCrcLength set to max creates duplicates

f1dot4 — Tue, 29 Sep 2020 08:35:05 GMT

Hi, in theory this sounds good. i removed the initCrcLength param from inputs.conf and added CHECK_METHOD = entire_md5 to props.conf. WIth btool, i checked that this config is active.

When i copy the txt file, the md5sum is equal, but it is going to be indexed again.

props.conf:

[source::D:\CMDB\lokal\test\*.txt]
CHECK_METHOD = entire_md5

inputs.conf

 [monitor://D:\CMDB\lokal\test\*.txt]
 host_regex = test\\(.*?)_
 disabled = false
 index = cmdb
 sourcetype = cmdb

md5 check:

md5sum "AUD-S-K001-01__10.txt"
768b4568fa45d8d6771d5ca8160dc483 *AUD-S-K001-01__10.txt

md5sum "AUD-S-K001-01__11.txt"
768b4568fa45d8d6771d5ca8160dc483 *AUD-S-K001-01__11.txt