Solved: How to extract and report on field values from a l...

apnetmedic · ‎07-06-2016

I have a bit of a non-traditional application, but one which Splunk is pretty good at 95% of:

There's a big file (call it bigReport.csv), updated daily by a business intelligence system and deposited in a directory on my Splunk server. Let's say it's 25,000 entries showing status of orders. On any given day, lines may be added to the end, changed (order status updated, ship dates changed), or deleted (order is complete and falls out of the search criteria of the BI report). The file has the same name every day.

The mission is to take the data in this file, extract some values from fields, do some lookups against other reports (which have similar ingest problems), and produce some summary data.

I had started by doing a watched directory for this, and ingesting the file. In order to do that, I had to set props.conf CHECK_METHOD = modtime, in case the beginning and the end of the file stayed the same and the CRC's wouldn't show changes.

Pulling only the latest set of data is a challenge too. The file gets updated at approximately the same time every day, but not exactly. So doing earliest=-1d@d might not work, depending on what time of day you access a report.

I came up with:

source=bigReport.csv earliest=-2d | eventstats max(_time) as LatestTime 
| where _time > LatestTime-30 | rest-of-search

This is pretty expensive, though. Especially when multiple files are involved in the same way, requiring subsearches to be performed in the same way.

So the bottom line is.. is there a better way to do this? |inputcsv seems tempting, though that has its own issues in terms of data access (those CSVs are readable by any user with search access who can find the file name)

somesoni2 · ‎07-06-2016

For data ingested in Splunk, try this (I would always suggest to provide the index and sourcetype name as well in all your queries)

index=yourindex sourcetype=yoursourcetype source=bigReport.csv [| tstats max(_time) as earliest WHERE index=yourindex sourcetype=yoursourcetype source=bigReport.csv earliest=-2d@d | eval earliest=relative_time(earliest,"@d") ] | rest of the search

The subsearch will get you the date, as earliest, of the latest report. Also, it uses tstats command and works on metadata so is much faster.

You can also explore option of lookup table files OR KV store for this requirement as well.

View solution in original post

somesoni2 · ‎07-06-2016

For data ingested in Splunk, try this (I would always suggest to provide the index and sourcetype name as well in all your queries)

index=yourindex sourcetype=yoursourcetype source=bigReport.csv [| tstats max(_time) as earliest WHERE index=yourindex sourcetype=yoursourcetype source=bigReport.csv earliest=-2d@d | eval earliest=relative_time(earliest,"@d") ] | rest of the search

The subsearch will get you the date, as earliest, of the latest report. Also, it uses tstats command and works on metadata so is much faster.

You can also explore option of lookup table files OR KV store for this requirement as well.

apnetmedic · ‎07-06-2016

I like it! I left out index and sourcetype for brevity in the original question. I've been having other ingest problems with a file this large, which has made me question the whole method. Sounds like I can still make it work.

How to extract and report on field values from a large CSV file in a directory on my Splunk server that is updated daily?

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases

Are you a member of the Splunk Community?

How to extract and report on field values from a large CSV file in a directory on my Splunk server that is updated daily?

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases