Getting Data In

CSV Monitoring issues

pkeller
Contributor

[monitor:///home/paul/training_status/]
whitelist = (.csv$|.CSV$)
blacklist = .filepart$
index=training_index
sourcetype=training_status
crcSalt = &ltSOURCE&gt

The file gets updated once per week. In many cases, the file is not being fully consumed. The most recent update missed 19 records (which were consumed the last time the file was updated )

Splunkd.log shows:

04-06-2017 07:39:19.584 -0700 INFO WatchedFile - Will begin reading at offset=4234 for file='/home/paul/training_status/filename.csv

So, my uneducated guess would be that splunkd is seeing data that it's already consumed and thus ignoring those 19 records before it starts ingesting.

How do I prevent this? I thought setting crcSalt=&ltSOURCE&gt was supposed to handle this.

Thank you.

Tags (2)
0 Karma
1 Solution

DalJeanis
Legend

crcSalt=<SOURCE> instructs splunk to use the entire filepath and name, in addition to the first 256 bytes, to determine if it has already indexed a file. If you are not changing the filename, then splunk will start indexing wherever it left off (or wherever the data is changed).

If you want the same records to be consumed again each time the file is updated, then the easy ways are (A) put a timestamp on the file name, (B) add an update timestamp column to each row of the csv, or (C) add a timestamp to the header in the file.

Alternately, assuming the "source of record" for the file is someplace safe, you could have splunk delete the file when it is finished indexing, so that any file found will be "new".

View solution in original post

DalJeanis
Legend

crcSalt=<SOURCE> instructs splunk to use the entire filepath and name, in addition to the first 256 bytes, to determine if it has already indexed a file. If you are not changing the filename, then splunk will start indexing wherever it left off (or wherever the data is changed).

If you want the same records to be consumed again each time the file is updated, then the easy ways are (A) put a timestamp on the file name, (B) add an update timestamp column to each row of the csv, or (C) add a timestamp to the header in the file.

Alternately, assuming the "source of record" for the file is someplace safe, you could have splunk delete the file when it is finished indexing, so that any file found will be "new".

pkeller
Contributor

Thank you. This makes things very clear. - Cheers

Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Mile High Learning with Splunk University, Denver, Colorado

If Denver is known for its mile-high elevation, Splunk University is about to raise the bar on technical ...

IT Service Intelligence 5.0 Series: Your Guide to the June Launch

We are excited to announce the June release of Splunk IT Service Intelligence (ITSI) 5.0. This update ...

Agent Mode Engaged! Enchaining Agentic Operations with Splunk AI Assistant 2.0

    Are you ready to transform how your team handles complex data requests? We invite you to our upcoming ...