Splunk Search

How to extract and report on field values from a large CSV file in a directory on my Splunk server that is updated daily?

apnetmedic
Explorer

I have a bit of a non-traditional application, but one which Splunk is pretty good at 95% of:

There's a big file (call it bigReport.csv), updated daily by a business intelligence system and deposited in a directory on my Splunk server. Let's say it's 25,000 entries showing status of orders. On any given day, lines may be added to the end, changed (order status updated, ship dates changed), or deleted (order is complete and falls out of the search criteria of the BI report). The file has the same name every day.

The mission is to take the data in this file, extract some values from fields, do some lookups against other reports (which have similar ingest problems), and produce some summary data.

I had started by doing a watched directory for this, and ingesting the file. In order to do that, I had to set props.conf CHECK_METHOD = modtime, in case the beginning and the end of the file stayed the same and the CRC's wouldn't show changes.

Pulling only the latest set of data is a challenge too. The file gets updated at approximately the same time every day, but not exactly. So doing earliest=-1d@d might not work, depending on what time of day you access a report.

I came up with:

source=bigReport.csv earliest=-2d | eventstats max(_time) as LatestTime 
| where _time > LatestTime-30 | rest-of-search

This is pretty expensive, though. Especially when multiple files are involved in the same way, requiring subsearches to be performed in the same way.

So the bottom line is.. is there a better way to do this? |inputcsv seems tempting, though that has its own issues in terms of data access (those CSVs are readable by any user with search access who can find the file name)

0 Karma
1 Solution

somesoni2
Revered Legend

For data ingested in Splunk, try this (I would always suggest to provide the index and sourcetype name as well in all your queries)

index=yourindex sourcetype=yoursourcetype source=bigReport.csv [| tstats max(_time) as earliest WHERE index=yourindex sourcetype=yoursourcetype source=bigReport.csv earliest=-2d@d | eval earliest=relative_time(earliest,"@d") ] | rest of the search

The subsearch will get you the date, as earliest, of the latest report. Also, it uses tstats command and works on metadata so is much faster.

You can also explore option of lookup table files OR KV store for this requirement as well.

View solution in original post

somesoni2
Revered Legend

For data ingested in Splunk, try this (I would always suggest to provide the index and sourcetype name as well in all your queries)

index=yourindex sourcetype=yoursourcetype source=bigReport.csv [| tstats max(_time) as earliest WHERE index=yourindex sourcetype=yoursourcetype source=bigReport.csv earliest=-2d@d | eval earliest=relative_time(earliest,"@d") ] | rest of the search

The subsearch will get you the date, as earliest, of the latest report. Also, it uses tstats command and works on metadata so is much faster.

You can also explore option of lookup table files OR KV store for this requirement as well.

apnetmedic
Explorer

I like it! I left out index and sourcetype for brevity in the original question. I've been having other ingest problems with a file this large, which has made me question the whole method. Sounds like I can still make it work.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...