Getting Data In

Problem with CSV file monitoring importing corrupted data: end-of-line not respected

Hi,

Using Splunk 6.5.1 with either directing monitoring and indexing and search on a single machine,
or using a dedicated forwarder feeding the indexer/search head machine.

I've setup a monitoring of a directory where some binary updates a CSV file all day long:
2017.07.06.jobs
That CSV file has 31 fields on each line like:

FIELDS: ID,PROJECT,USER,OSGROUP,DIR,ENV,TOOL,JOBNAME,PRIORITY,RESOURCES,SUBMITHOST,EXECHOST,SUBMITTIME,STARTTIME,ENDTIME

For the sourcetype, I'm using the built-in "csv" complemented with a TIMESTAMP_FIELDS = SUBMITTIME.

The data loaded in my index is corrupted: I am seeing that sometimes a line is only half-read, so only the first half of the fields is populated. But then, the second-half of the line is treated as a new line with the first half of the fields being populated with the second half of the fields: aka: I see some EXECHOST name values in the PROJECT field.

I cannot find any warning of interest in the splunkd.log file,
apart maybe from:
07-06-2017 11:36:53.585 -0700 INFO WatchedFile - Resetting fd to re-extract header.

Any ideas?

0 Karma

SplunkTrust
SplunkTrust

Since the 'Resetting fd' message is info-level, it's probably not a big deal, but you may want to try putting a FIELDS attribute in props.conf to see if it keeps Splunk from re-reading the header.

As for the partial events, I've seen that happen with multi-line events where the extra lines took a while to write. Adjusting the time_before_close setting usually helps with that. Hard to believe it would take 3 seconds for your app to write a single line, though.

---
If this reply helps you, an upvote would be appreciated.
0 Karma

Hi, the FIELD_NAMES = in props.conf did fix that message in the splunkd.log.

I've also tried to increase the timebeforeclose up to 65, and I am still seeing corrupted lines being read.

0 Karma