Getting Data In

CSV Monitoring issues

pkeller
Contributor

[monitor:///home/paul/training_status/]
whitelist = (.csv$|.CSV$)
blacklist = .filepart$
index=training_index
sourcetype=training_status
crcSalt = &ltSOURCE&gt

The file gets updated once per week. In many cases, the file is not being fully consumed. The most recent update missed 19 records (which were consumed the last time the file was updated )

Splunkd.log shows:

04-06-2017 07:39:19.584 -0700 INFO WatchedFile - Will begin reading at offset=4234 for file='/home/paul/training_status/filename.csv

So, my uneducated guess would be that splunkd is seeing data that it's already consumed and thus ignoring those 19 records before it starts ingesting.

How do I prevent this? I thought setting crcSalt=&ltSOURCE&gt was supposed to handle this.

Thank you.

Tags (2)
0 Karma
1 Solution

DalJeanis
Legend

crcSalt=<SOURCE> instructs splunk to use the entire filepath and name, in addition to the first 256 bytes, to determine if it has already indexed a file. If you are not changing the filename, then splunk will start indexing wherever it left off (or wherever the data is changed).

If you want the same records to be consumed again each time the file is updated, then the easy ways are (A) put a timestamp on the file name, (B) add an update timestamp column to each row of the csv, or (C) add a timestamp to the header in the file.

Alternately, assuming the "source of record" for the file is someplace safe, you could have splunk delete the file when it is finished indexing, so that any file found will be "new".

View solution in original post

DalJeanis
Legend

crcSalt=<SOURCE> instructs splunk to use the entire filepath and name, in addition to the first 256 bytes, to determine if it has already indexed a file. If you are not changing the filename, then splunk will start indexing wherever it left off (or wherever the data is changed).

If you want the same records to be consumed again each time the file is updated, then the easy ways are (A) put a timestamp on the file name, (B) add an update timestamp column to each row of the csv, or (C) add a timestamp to the header in the file.

Alternately, assuming the "source of record" for the file is someplace safe, you could have splunk delete the file when it is finished indexing, so that any file found will be "new".

pkeller
Contributor

Thank you. This makes things very clear. - Cheers

Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Announcing Modern Navigation: A New Era of Splunk User Experience

We are excited to introduce the Modern Navigation feature in the Splunk Platform, available to both cloud and ...

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

We are excited to announce that the upcoming releases of Splunk Enterprise 10.2.x and Splunk Cloud Platform ...

Step into “Hunt the Insider: An Splunk ES Premier Mystery” to catch a cybercriminal ...

After a whole week of being on call, you fell asleep on your keyboard, and you hit a sequence of buttons that ...