Getting Data In

Is there a way to setup splunk to not create duplicate events when indexing a csv file overwritten with duplicate data?

dlovett
Path Finder

We have a process to identify, capture, and write high priority/urgent events to a csv file that gets overwritten every time the process executes. The contents may not change for days. However, splunk is indexing the whole file every time the process runs--even if the contents of the file haven't changed.

The program that creates the csv file calls an external vendor SOAP web service. I could add a whole bunch of logic to the program to persist a timestamp and use it as a filter for new service responses; But we prefer to index data as received/logged. I'm not sure what CRC values splunk is using to determine how to read the file. The next step is to see if those values are available in a debug/log file.

Has anybody ran into this and know of a solution? ie I don't rule out user error. I'm fairly new to Splunk.

Any help would be appreciated.

Tags (4)
0 Karma
1 Solution

dlovett
Path Finder

Problem solved. The issue was a bug in the the program: it was opening and closing the file multiple times during the process. I'm guessing this is why the CRC wasn't matching. I refactored the code and all is well.

View solution in original post

0 Karma

dlovett
Path Finder

Problem solved. The issue was a bug in the the program: it was opening and closing the file multiple times during the process. I'm guessing this is why the CRC wasn't matching. I refactored the code and all is well.

0 Karma

Drainy
Champion

By default Splunk will build a CRC from the first 256 bytes of a file, regardless of if it is a CSV or other.
Is it possible that even though the contents aren't changing some other heading information is changing?

If you have a look at - http://docs.splunk.com/Documentation/Splunk/latest/admin/inputsconf You can edit the amount of bytes it builds the CRC from by changing initCrcLength.

Otherwise you may be better to actually store the CSV in a lookups folder and search it as a lookup?
This way you could build a dashboard that uses | inputlookup to pull in the CSV and then do a search against that for certain criteria.

handygecko
Explorer

I'm also fairly new to splunk and have been searching for an answer to this problem for the last two days. I see several similar questions with no clear answer. I am also using csv files that are frequently overwritten with same or new data and each time splunk re-indexes the data creating duplicates in splunk. I'm getting the following in the splunkd.log:

WatchedFile - Checksum for seekptr didn't match, will re-read entire file='D:\fd\myfile.csv'.

Not sure if this is related. Any help is appreciated!

0 Karma
Get Updates on the Splunk Community!

Registration for Splunk University is Now Open!

Are you ready for an adventure in learning?   Brace yourselves because Splunk University is back, and it's ...

Splunkbase | Splunk Dashboard Examples App for SimpleXML End of Life

The Splunk Dashboard Examples App for SimpleXML will reach end of support on Dec 19, 2024, after which no new ...

Understanding Generative AI Techniques and Their Application in Cybersecurity

Watch On-Demand Artificial intelligence is the talk of the town nowadays, with industries of all kinds ...