Getting Data In

CSV Monitoring issues

Contributor

[monitor:///home/paul/trainingstatus/]
whitelist = (.csv$|.CSV$)
blacklist = .filepart$
index=training
index
sourcetype=training_status
crcSalt = &ltSOURCE&gt

The file gets updated once per week. In many cases, the file is not being fully consumed. The most recent update missed 19 records (which were consumed the last time the file was updated )

Splunkd.log shows:

04-06-2017 07:39:19.584 -0700 INFO WatchedFile - Will begin reading at offset=4234 for file='/home/paul/training_status/filename.csv

So, my uneducated guess would be that splunkd is seeing data that it's already consumed and thus ignoring those 19 records before it starts ingesting.

How do I prevent this? I thought setting crcSalt=&ltSOURCE&gt was supposed to handle this.

Thank you.

Tags (2)
0 Karma
1 Solution

SplunkTrust
SplunkTrust

crcSalt=<SOURCE> instructs splunk to use the entire filepath and name, in addition to the first 256 bytes, to determine if it has already indexed a file. If you are not changing the filename, then splunk will start indexing wherever it left off (or wherever the data is changed).

If you want the same records to be consumed again each time the file is updated, then the easy ways are (A) put a timestamp on the file name, (B) add an update timestamp column to each row of the csv, or (C) add a timestamp to the header in the file.

Alternately, assuming the "source of record" for the file is someplace safe, you could have splunk delete the file when it is finished indexing, so that any file found will be "new".

View solution in original post

SplunkTrust
SplunkTrust

crcSalt=<SOURCE> instructs splunk to use the entire filepath and name, in addition to the first 256 bytes, to determine if it has already indexed a file. If you are not changing the filename, then splunk will start indexing wherever it left off (or wherever the data is changed).

If you want the same records to be consumed again each time the file is updated, then the easy ways are (A) put a timestamp on the file name, (B) add an update timestamp column to each row of the csv, or (C) add a timestamp to the header in the file.

Alternately, assuming the "source of record" for the file is someplace safe, you could have splunk delete the file when it is finished indexing, so that any file found will be "new".

View solution in original post

Contributor

Thank you. This makes things very clear. - Cheers