Solved: How do I prevent duplicate data being indexed from...

scottrunyon · ‎10-02-2018

i have multiple applications that place login information (Logon Date/Time, Logoff Date/Time, userid, etc.) into existing CSV files (one per application). I am monitoring these files, but when they are indexed, the old data is reindexed, so I have multiple events per logon. This is causing errors in reporting (I shouldn't have to do a dedup) and is ballooning the size of each index (wasting disk space).

My understanding is that when a file being monitored, a beginning and end CRC is generated to fingerprint the file along with a Seek Address.

Documentation states:

"A matching record for the CRC from the file beginning in the database, the content at the Seek Address location matches the stored CRC for that location in the file, and the size of the file is larger than the Seek Address that Splunk Enterprise stored. While Splunk Enterprise has seen the file before, data has been added since it was last read. Splunk Enterprise opens the file, seeks to Seek Address--the end of the file when Splunk Enterprise last finished with it--and starts reading the new from that point."

I take this to mean that existing events are not added and only new events are indexed. This isn't happening in my case.

I have read the questions concerning "duplicate data" and two settings keep appearing. One is "followTail", reading the doc for this, i see "WARNING: Use of followTail should be considered an advanced administrative action." and "DO NOT leave followTail enabled in an ongoing fashion.". This doesn't look to be a good fit for my problem.

The second is "crcSalt". The question I have on that setting is if I do set it, does that ignore the Seek Address causing the entire file to be indexed, which is where I am now.

Thank you in advance for any help that can be provided.

Scott

scottrunyon · ‎11-21-2018

I opened a support case and the outcome is that in the event of a new file is created, it is indexed starting at the beginning of the file. The Splunk UF is working as designed.

View solution in original post

scottrunyon · ‎11-21-2018

I opened a support case and the outcome is that in the event of a new file is created, it is indexed starting at the beginning of the file. The Splunk UF is working as designed.

FrankVl · ‎11-21-2018

How does that line up with your comments that it was 1 file per application, with just lines being added? Where does the "new file is created" behavior come in?

FrankVl · ‎10-04-2018

Let's start at the basics: what does your inputs.conf look like for this monitor? And do you see any messages in splunkd.log related to these files or to that input stanza that could shed some light on why you're getting duplicates?

harsmarvania57 · ‎10-03-2018

Hi @scottrunyon,

When CSV file will be updated by applications, will it append data only ? It looks like CRC is mismatching when application will update existing CSV file and due to that splunk is thinking that this is new file and it will reindex whole file which will end up with duplicate data.

You can use btprobe command to check CRC which is stored in fishbucket and you can compute the CRC of CSV file before CSV will update and after CSV will update.

Below command is to check CRC in splunk fishbucket

$SPLUNK_HOME/bin/splunk cmd btprobe -d $SPLUNK_HOME/var/lib/splunk/fishbucket/splunk_private_db/ --file <CSV_FILE_PATH>

And below command is to compute CRC of CSV file.

$SPLUNK_HOME/bin/splunk cmd btprobe --compute-crc <CSV_FILE_PATH>

scottrunyon · ‎10-03-2018

Thank you for the trouble shooting tips.

To answer the question about where the files are being appended, new lines are added at the end of the file. I checked this morning and verified that. There are 14 new lines at the end of the file. I ran the btprobe compute command on the file before and after the update. The crc and decimal values are the same.
The key value in the fishbucket is the same as the crc value of the file.

What else do I need to look at?

Regards,

Scott

harsmarvania57 · ‎10-03-2018

As you already mentioned that you are appending lines to existing CSV, just one more question are you removing any file from CSV ?

Additionally have you checked CRC in fishbucket before and after CSV update using below command ?

 $SPLUNK_HOME/bin/splunk cmd btprobe -d $SPLUNK_HOME/var/lib/splunk/fishbucket/splunk_private_db/ --file <CSV_FILE_PATH>

scottrunyon · ‎10-04-2018

I ran the commands before and after the csv file was updated. The crc and decimal values of the compute command are the same. The results of the -d command show some differences. The key, fcrc and flen results are the same. The scrc, sptr, mdtm and wrtm values are different. The crc and key values match both times.

harsmarvania57 · ‎10-04-2018

In this case it should not pickup whole file again, worth to raise splunk support case.

scottrunyon · ‎10-04-2018

A case with Splunk has been opened.

scottrunyon · ‎10-03-2018

The file that I looked at is only adding entries to the end of the file. I don't see where any lines that are being removed from the file.

I ran the command for the fishbucket only after the file was updated. I will have to check tomorrow morning when the file is updated.

How do I prevent duplicate data being indexed from CSV files that is forwarded using a universal forwarder (UF)?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

What’s New in Splunk AI: Volume 02

Splunk App Dev Quarterly Roundup: AI, Agents, and Innovation!

Value Insights: Now Generally Available in the CMC

Join the Conversation