Hello, I have a CSV file with 2 fields. (field1,field2). The file is monitored and the content is indexed however the content of the file is updated on a daily basis and I want to index only the changes of the file.
Example :
Day 1
abcd,100122
abde,100122
abcdf,100122
Day 2 (where the last 2 lines are new in the csv file and needs to be ingested)
abcd,100122
abde,100122
abcdf,100122
bcda,100222
bcdb,100222
Short answer is - no.
Long answer is that if you read a file-based input you can either read lines that are appended to a file if the file appears to be in a known state or re-read the complete file if splunk decides that its contents changed so it's a new file. Remember that file based inputs are meant for log-type files which are appended to during their life cycle.
So you can't "diff" a file using splunk's built-in mechanisms. Also remember that after you read data on a forwarder and push it to indexer, the input on the forwarder has no knowledge of what happened to the events, which events have been filtered out, modified during indexing, which are duplicated vs. what is already in the index (even the indexer doesn't know that - if it gets the same event twice it will happily index it twice; there is no deduplication on indexers). So in order to be able to ingest only changes from somewhere in the middle of the file, your input would have to store whole contents of the file from the "previous known state" to be able to perform such analysis and find differences. There are of course further issues of "what if the same line was repeatedly added and removed" and so on.
So your task may seem like an easy feat but is not. You could, however, monitor the file and reingest the file as a whole to a temporary index each time someone updates it and then do a scheduled search to find only non-existing entries and collect the results to a destination index. Ugly - yes, will incur additional license usage - yes (but with typical csv sizes that might not matter much).
Hi @Juan_Leon,
the only way is pre process the csv file using a script that using Splunk REST API executes a search on indexed logs and deletes the already indexed records.
But only one question: why to do this?
if you reindex all the file you have the last updates situation and you can trace the eventual updates and display a situation at a defined time, in my opinion it's better!
Ciao.
Giuseppe