Getting Data In

How to index updates from existing CSV file?

Juan_Leon
Explorer

Hello,  I have a CSV file with 2 fields. (field1,field2). The file is monitored and the content is indexed however the content of the file is updated on a daily basis and I want to index only the changes of the file. 

Example : 

Day 1 

abcd,100122

abde,100122

abcdf,100122

 

Day 2 (where the last 2 lines are new in the csv file and needs to be ingested)

abcd,100122

abde,100122

abcdf,100122

bcda,100222

bcdb,100222

 

 

Labels (3)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

Short answer is - no.

Long answer is that if you read a file-based input you can either read lines that are appended to a file if the file appears to be in a known state or re-read the complete file if splunk decides that its contents changed so it's a new file. Remember that file based inputs are meant for log-type files which are appended to during their life cycle.

So you can't "diff" a file using splunk's built-in mechanisms. Also remember that after you read data on a forwarder and push it to indexer, the input on the forwarder has no knowledge of what happened to the events, which events have been filtered out, modified during indexing, which are duplicated vs. what is already in the index (even the indexer doesn't know that - if it gets the same event twice it will happily index it twice; there is no deduplication on indexers). So in order to be able to ingest only changes from somewhere in the middle of the file, your input would have to store whole contents of the file from the "previous known state" to be able to perform such analysis and find differences. There are of course further issues of "what if the same line was repeatedly added and removed" and so on.

So your task may seem like an easy feat but is not. You could, however, monitor the file and reingest the file as a whole to a temporary index each time someone updates it and then do a scheduled search to find only non-existing entries and collect the results to a destination index. Ugly - yes, will incur additional license usage - yes (but with typical csv sizes that might not matter much).

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Juan_Leon,

the only way is pre process the csv file using a script that using Splunk REST API executes a search on indexed logs and deletes the already indexed records.

But only one question: why to do this?

if you reindex all the file you have the last updates situation and you can trace the eventual updates and display a situation at a defined time, in my opinion it's better!

Ciao.

Giuseppe

0 Karma
Get Updates on the Splunk Community!

.conf25 Registration is OPEN!

Ready. Set. Splunk! Your favorite Splunk user event is back and better than ever. Get ready for more technical ...

Detecting Cross-Channel Fraud with Splunk

This article is the final installment in our three-part series exploring fraud detection techniques using ...

Splunk at Cisco Live 2025: Learning, Innovation, and a Little Bit of Mr. Brightside

Pack your bags (and maybe your dancing shoes)—Cisco Live is heading to San Diego, June 8–12, 2025, and Splunk ...