At every set interval (while testing, 30 min interval), a search is issued to get min, max, and mean values of some perf counters. Those values are sent to a summary index, and this is where strange things start to happen.
In the live data indexes, those perf counters keep coming in frequently, nothing is missing. If I issue the "summary search" manually, I always get the right data, but when it is run by the Splunk scheduler, data gets into the index in an erratic manner. Here is the disturbing "pattern":
- initially, summary data was less than a day "late". The last 30 minutes samples would show up as 18-hours late data in the summary index.
- a few days later, it was around two days late, in the summary index
- still later, it was four days late
- all of a sudden, with no change, most data was 4+ days late, and there would be an isolated "peak" which would be only around 15 hours late.
- now it's again, at best, 2 days late.
When I set this up in the lab environment, I had no issue, and it runs just like it is supposed to run. However, when the exact same mechanism is set up in production, we get that strange behavior.
Here, it's not a matter of back filling older data. The "live" data is available.
I've seen several similar issues, including one which recommends to "delete the summary data for the time frame and then use back filling instead".
One last point. It seems the scheduler also behaves erratically, and does not respect the set schedule, neither the frequency, nor the time frame in which it should run, but when it does run almost right, summary inserted data is still way late.
It would be helpful to see what settings you are using with your summary search. How long does the search usually take to run? Did you set a schedule window for the search? Is this the only summary search you have that is displaying this behavior?
There are a couple of places to look, if you haven't already.
Check the spool directory ($SPLUNK_HOME/var/spool/splunk) on the server you are executing the summary search (I assume its a search head). This is where the summarized data sits before it is forwarded to the indexers. If you have files backing up here, that could indicate a bottleneck with moving the data over. If that is the case, check how much data the SH is forwarding, and then check the pipelines on the indexer (the DMC has good dashboards for this).
Check the internal log for errors associated with your summary search (index=internal sourcetype=scheduler). Look for the status field of the search (success, continued, skipped, etc). You can also search for ( index=_internal sourcetype=splunkd (searchscheduler OR dispatchmanager)) that should show you any errors related to scheduled searches, and you can see the searchscheduler metrics (group=searchscheduler). You might be hitting either a per-user or per-system search limit, and that would cause erratic behavior in your summary search performance.