Getting Data In

How do I prevent duplicate data being indexed from CSV files that is forwarded using a universal forwarder (UF)?

scottrunyon
Contributor

i have multiple applications that place login information (Logon Date/Time, Logoff Date/Time, userid, etc.) into existing CSV files (one per application). I am monitoring these files, but when they are indexed, the old data is reindexed, so I have multiple events per logon. This is causing errors in reporting (I shouldn't have to do a dedup) and is ballooning the size of each index (wasting disk space).

My understanding is that when a file being monitored, a beginning and end CRC is generated to fingerprint the file along with a Seek Address.

Documentation states:

"A matching record for the CRC from the file beginning in the database, the content at the Seek Address location matches the stored CRC for that location in the file, and the size of the file is larger than the Seek Address that Splunk Enterprise stored. While Splunk Enterprise has seen the file before, data has been added since it was last read. Splunk Enterprise opens the file, seeks to Seek Address--the end of the file when Splunk Enterprise last finished with it--and starts reading the new from that point."

I take this to mean that existing events are not added and only new events are indexed. This isn't happening in my case.

I have read the questions concerning "duplicate data" and two settings keep appearing. One is "followTail", reading the doc for this, i see "WARNING: Use of followTail should be considered an advanced administrative action." and "DO NOT leave followTail enabled in an ongoing fashion.". This doesn't look to be a good fit for my problem.

The second is "crcSalt". The question I have on that setting is if I do set it, does that ignore the Seek Address causing the entire file to be indexed, which is where I am now.

Thank you in advance for any help that can be provided.

Scott

0 Karma
1 Solution

scottrunyon
Contributor

I opened a support case and the outcome is that in the event of a new file is created, it is indexed starting at the beginning of the file. The Splunk UF is working as designed.

View solution in original post

0 Karma

scottrunyon
Contributor

I opened a support case and the outcome is that in the event of a new file is created, it is indexed starting at the beginning of the file. The Splunk UF is working as designed.

0 Karma

FrankVl
Ultra Champion

How does that line up with your comments that it was 1 file per application, with just lines being added? Where does the "new file is created" behavior come in?

0 Karma

FrankVl
Ultra Champion

Let's start at the basics: what does your inputs.conf look like for this monitor? And do you see any messages in splunkd.log related to these files or to that input stanza that could shed some light on why you're getting duplicates?

0 Karma

harsmarvania57
Ultra Champion

Hi @scottrunyon,

When CSV file will be updated by applications, will it append data only ? It looks like CRC is mismatching when application will update existing CSV file and due to that splunk is thinking that this is new file and it will reindex whole file which will end up with duplicate data.

You can use btprobe command to check CRC which is stored in fishbucket and you can compute the CRC of CSV file before CSV will update and after CSV will update.

Below command is to check CRC in splunk fishbucket

$SPLUNK_HOME/bin/splunk cmd btprobe -d $SPLUNK_HOME/var/lib/splunk/fishbucket/splunk_private_db/ --file <CSV_FILE_PATH>

And below command is to compute CRC of CSV file.

$SPLUNK_HOME/bin/splunk cmd btprobe --compute-crc <CSV_FILE_PATH>
0 Karma

scottrunyon
Contributor

Thank you for the trouble shooting tips.

To answer the question about where the files are being appended, new lines are added at the end of the file. I checked this morning and verified that. There are 14 new lines at the end of the file. I ran the btprobe compute command on the file before and after the update. The crc and decimal values are the same.
The key value in the fishbucket is the same as the crc value of the file.

What else do I need to look at?

Regards,

Scott

0 Karma

harsmarvania57
Ultra Champion

As you already mentioned that you are appending lines to existing CSV, just one more question are you removing any file from CSV ?

Additionally have you checked CRC in fishbucket before and after CSV update using below command ?

 $SPLUNK_HOME/bin/splunk cmd btprobe -d $SPLUNK_HOME/var/lib/splunk/fishbucket/splunk_private_db/ --file <CSV_FILE_PATH>
0 Karma

scottrunyon
Contributor

I ran the commands before and after the csv file was updated. The crc and decimal values of the compute command are the same. The results of the -d command show some differences. The key, fcrc and flen results are the same. The scrc, sptr, mdtm and wrtm values are different. The crc and key values match both times.

0 Karma

harsmarvania57
Ultra Champion

In this case it should not pickup whole file again, worth to raise splunk support case.

0 Karma

scottrunyon
Contributor

A case with Splunk has been opened.

0 Karma

scottrunyon
Contributor

The file that I looked at is only adding entries to the end of the file. I don't see where any lines that are being removed from the file.

I ran the command for the fishbucket only after the file was updated. I will have to check tomorrow morning when the file is updated.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In November, the Splunk Threat Research Team had one release of new security content via the Enterprise ...

Index This | Divide 100 by half. What do you get?

November 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...

Stay Connected: Your Guide to December Tech Talks, Office Hours, and Webinars!

❄️ Celebrate the season with our December lineup of Community Office Hours, Tech Talks, and Webinars! ...