I'm having a bit of trouble trying to backfill a couple days in my summary index from a query using the
collect command. Although events are returned from the query, and placed in the summary index, but for some reason splunk isn't recognizing any of the fields that are already applied to that source_type (even though when you summary index data, it's saved as a sourcetype of
stash). If I populate the summary index from a saved search, everything is fine, it's just not when I execute a search (from the search app, for instance), and use collect to save the data to it.
Here's an example of the same query, trying to backfill data from my _raw index for October 1st-2nd:
index="cdn_download_logs" resource_relative_uri="*.exe" OR resource_relative_uri="*.msi" OR resource_relative_uri="*.dmg" earliest=10/01/2012:00:00:00 latest=10/02/2012:00:00:00 | eval lastFileByte=filesize-1 | eval endByteInt=if(endByte>0,toNumber(endByte,10), lastFileByte) | eval startByteInt=if(startByte>0, toNumber(startByte,10), 0) | eval leftToSend=((endByteInt-startByteInt)-sc_bytes) | eval downloadStatus=if(endByteInt=lastFileByte AND leftToSend<=0 ,"SUCCESS", "FAILURE") | search downloadStatus="SUCCESS" | collect index="summary_download_success_events"
Is there a subtle nuance that I'm missing that is causing my field extractions to not get applied? The weirdest part is data that is returned from a saved search and added to a summary index works perfectly. I'm not sure if the eval arguments in my example query above are causing some unwanted behavior (although, this is the exact same query I have running in the scheduled search that works).
I also know there's a python script somewhere in the Splunk directory that is written to assist in backfilling summary index data. Is this a better option? If so, why is it a better option?
Any help/feedback is much appreciated.
Are your field extractions tied to sourcetype? If so, did you check which sourcetype you're getting for the events when you've collected them to a summary index?
fields command to select exactly which fields you wish to pull to carry into the summary data.
The python script is 'fillsummaryindex.py' and can be used to backfill summary data. It iterates through the scheduled runtimes of the searches you name to run them as though they were being run at that historical time. If you've added a new summary indexing search, and want to have data available historically, you can use this script.
Yes they are being applied by sourcetype. The weird aspect of that element of the issue is that the sourcetype applied to summary indexed data (stash) is also applied in the scheduled search, which adds indexed data to the summary index with the original fields in place. I originally thought the sourcetype was the issue
The problem I've been having with the backfill script, is I cannot get it to parse earliest/latest time parameters that are actual UTC dates as opposed to dynamic dates like '5d@d'. I get an error when I try and run a command like:
.\splunk cmd python fillsummaryindex.py -app search -name "SummaryIndex_DownloadSuccessEvents" -owner admin -et "10/01/2012:00:00:00" -lt "10/02/2012:00:00:00" -dedup true -auth admin:password
because it claims: "Failed to get list of scheduled times for saved search". I think this is because the scheduled search uses -et -day@day -lt @day
It turns out that the fields are lost because my field extractions are applied to the sourcetype given to the raw indexed data (which I pull from to build the summary index). This is weird because when the scheduled search I described runs to backfill the summary index for each day, the fields were not lost. However, if I run fillsummary_index.py, or run a query in the search app using
| collect index="summaryIndex", the fields are.
To fix this, I had to recreate my field extraction regex's to also be applied to a sourcetype of "stash"
Wait, what? The sourcetype of stash is specifically for the summary indexed data itself. Typically, when data is summary indexed, the raw events are written as key=value pairs, so Splunk should be extracting them automatically.
That is also what I would assume, however, splunk does not seem to be behaving this way.
I'm experiencing the same behaviour in Splunk 6: when using collect(), only _raw is included in the summary index.