I'm trying to write instructions for some people to set up an app while onsite, and one of the steps involves backfilling a lot of summary index data.
I've followed the steps to use the script Splunk provides for this (fill_summary_index.py),
http://www.splunk.com/base/Documentation/4.2.1/Knowledge/Managesummaryindexgapsandoverlaps
But this process is incredibly slow, much slower than I would expect. One big 'stats count by foo bar' over my entire test dataset takes only about 30 seconds but running this backfill script against the same data is going to take an hour or more for each saved search at this rate, which is crazy. I expected the backfill to take a little longer than one giant search but not thousands of times longer. This is a big problem because if it takes hours on this tiny dataset it'll take days on bigger data, which isnt OK at all.
So now I'm thinking maybe advanced users arent supposed to use the python script? That with the oldschool collect
command and a bit of stats count by foo bar
and a dash of bin
to get the timestamps and a dash of addinfo
maybe to add the search-time, and a backgrounded search I could probably generate the entire run of backfilled events with one long running search.
http://www.splunk.com/base/Documentation/latest/SearchReference/Collect
And at this point though I'm sure someone's way ahead of me which is what brings me here. Anyone have an emerging best practice they care to share? Or have I just completely missed a piece of documentation? thanks.
Thanks for this, I've always backfilled using the python script but it is incredibly buggy and as you mentioned also slow. I had never heard of the collect command now, I'll be using this from now on, it's infinitely better than the script
Great questions & observations Nick.
I still use the shipped script and experience the challenges you mention. I get slightly quicker results by making the machine do as much work as it can by setting a concurrency flag (usually to 8). I leverage a text file when order is important. Most of my summaries have weeklies which are built on dailies which are built on hourlies. The dailies and weeklies are quick, but some time is definitely invested in the hourlies.
I like your approach if it adds speed. I'm trying to think of how a dedup would work with that method as I rely on that flag to avoid re-summarizing what has been summarized.
Basically, I run a command like this...
$SPLUNK_HOME/bin/splunk cmd python $SPLUNK_HOME/bin/fill_summary_index.py -app test_app -namefile $SPLUNK_HOME/etc/apps/test_app/bin/summary.jobs -et -90d -lt now -j 8 -dedup true
where summary.jobs looks like this:
dashboard_a_base_summary-1h
dashboard_b_base_summary-1h
dashboard_c_base_summary-1h
dashboard_d_base_summary-1h
dashboard_a_base_summary-1d
dashboard_b_base_summary-1d
dashboard_c_base_summary-1d
dashboard_d_base_summary-1d
dashboard_a_base_summary-1w
dashboard_b_base_summary-1w
dashboard_c_base_summary-1w
dashboard_d_base_summary-1w