I have a bit of a non-traditional application, but one which Splunk is pretty good at 95% of:
There's a big file (call it bigReport.csv), updated daily by a business intelligence system and deposited in a directory on my Splunk server. Let's say it's 25,000 entries showing status of orders. On any given day, lines may be added to the end, changed (order status updated, ship dates changed), or deleted (order is complete and falls out of the search criteria of the BI report). The file has the same name every day.
The mission is to take the data in this file, extract some values from fields, do some lookups against other reports (which have similar ingest problems), and produce some summary data.
I had started by doing a watched directory for this, and ingesting the file. In order to do that, I had to set props.conf CHECK_METHOD = modtime, in case the beginning and the end of the file stayed the same and the CRC's wouldn't show changes.
Pulling only the latest set of data is a challenge too. The file gets updated at approximately the same time every day, but not exactly. So doing earliest=-1d@d might not work, depending on what time of day you access a report.
I came up with:
source=bigReport.csv earliest=-2d | eventstats max(_time) as LatestTime
| where _time > LatestTime-30 | rest-of-search
This is pretty expensive, though. Especially when multiple files are involved in the same way, requiring subsearches to be performed in the same way.
So the bottom line is.. is there a better way to do this? |inputcsv seems tempting, though that has its own issues in terms of data access (those CSVs are readable by any user with search access who can find the file name)
I like it! I left out index and sourcetype for brevity in the original question. I've been having other ingest problems with a file this large, which has made me question the whole method. Sounds like I can still make it work.