Getting Data In

Is there a way to setup splunk to not create duplicate events when indexing a csv file overwritten with duplicate data?

dlovett
Path Finder

We have a process to identify, capture, and write high priority/urgent events to a csv file that gets overwritten every time the process executes. The contents may not change for days. However, splunk is indexing the whole file every time the process runs--even if the contents of the file haven't changed.

The program that creates the csv file calls an external vendor SOAP web service. I could add a whole bunch of logic to the program to persist a timestamp and use it as a filter for new service responses; But we prefer to index data as received/logged. I'm not sure what CRC values splunk is using to determine how to read the file. The next step is to see if those values are available in a debug/log file.

Has anybody ran into this and know of a solution? ie I don't rule out user error. I'm fairly new to Splunk.

Any help would be appreciated.

Tags (4)
0 Karma
1 Solution

dlovett
Path Finder

Problem solved. The issue was a bug in the the program: it was opening and closing the file multiple times during the process. I'm guessing this is why the CRC wasn't matching. I refactored the code and all is well.

View solution in original post

0 Karma

dlovett
Path Finder

Problem solved. The issue was a bug in the the program: it was opening and closing the file multiple times during the process. I'm guessing this is why the CRC wasn't matching. I refactored the code and all is well.

View solution in original post

0 Karma

Drainy
Champion

By default Splunk will build a CRC from the first 256 bytes of a file, regardless of if it is a CSV or other.
Is it possible that even though the contents aren't changing some other heading information is changing?

If you have a look at - http://docs.splunk.com/Documentation/Splunk/latest/admin/inputsconf You can edit the amount of bytes it builds the CRC from by changing initCrcLength.

Otherwise you may be better to actually store the CSV in a lookups folder and search it as a lookup?
This way you could build a dashboard that uses | inputlookup to pull in the CSV and then do a search against that for certain criteria.

handygecko
Explorer

I'm also fairly new to splunk and have been searching for an answer to this problem for the last two days. I see several similar questions with no clear answer. I am also using csv files that are frequently overwritten with same or new data and each time splunk re-indexes the data creating duplicates in splunk. I'm getting the following in the splunkd.log:

WatchedFile - Checksum for seekptr didn't match, will re-read entire file='D:\fd\myfile.csv'.

Not sure if this is related. Any help is appreciated!

0 Karma
.conf21 CFS Extended through 5/20!

Don't miss your chance
to share your Splunk
wisdom in-person or
virtually at .conf21!

Call for Speakers has
been extended through
Thursday, 5/20!