We use summary indexing to improve search performance and to avoid unnecessary lookups and field extractions. It is supposed to run every 5 minutes and summarize the previous 5 minute window.
We schedule the saved search values:
earliest = -10m@m
latest = -5m@m
frequency = every 5 minutes
When investigating
index="_internal" sourcetype="scheduler"it becomes apparent that the scheduler is not firing our saves searches reliably every 5 minutes. Sometimes a search will only start 6 or 7 minutes after the previous search. This creates small gaps in the data (of 1 or 2 minutes) that is impossible to backfill with the backfill script provided. Also, it renders the summary index useless.
Is there a way to snap to a more accurate 5 minute window? Or a way to force the scheduler to run more reliably?
What's your setting for realtime_schedule
in your savedsearches.conf
entry?
I think in more release release creating a new summary indexing generating scheduled saved search now causes realtime_schedule
to be set to 0
. Generally this is what you want since this means that any missed runs get executed later (for example, in the scenario of a splunkd
restart). This also means that these saved searches could be delayed; however, this should not result in gaps in your summary index, this should help prevent them.
If you search your summary index for your summary events in question, you should see that search_now
should always reflect the precise 5 minute interval you have scheduled your searches for, where as info_search_time
will reflect the real (wall clock) time, which is when the search was actually kicked off. So basically, even though your search was delayed by a minute or two (which does seem high), you shouldn't be losing any data because each search should still cover the originally designated window.
You may also want to look into your limits.conf
settings as far as concurrency of saved searches and all that. (I think there are some questions about that flowing around on this site already.)
BTW, are you seeing your saved search show up as being "skipped", because then I would expect to see events being dropped. You can search with:
index="_internal" sourcetype="scheduler" status=skipped
Another thing to consider: Is it possible that you simply don't have any events to summarize for the 5 minute window in question? If this happens, you will see no new events in the summary index (which looks like a "gap"). This may or may not be likely based on your event data, but you should be able to confirm this very quickly with the search:
index="_internal" sourcetype="scheduler" result_count=0
Of course, if you have some sort of conditional logic, then perhaps this would be a better search:
index="_internal" sourcetype="scheduler" NOT alert_actions="*summary_index*"
What's your setting for realtime_schedule
in your savedsearches.conf
entry?
I think in more release release creating a new summary indexing generating scheduled saved search now causes realtime_schedule
to be set to 0
. Generally this is what you want since this means that any missed runs get executed later (for example, in the scenario of a splunkd
restart). This also means that these saved searches could be delayed; however, this should not result in gaps in your summary index, this should help prevent them.
If you search your summary index for your summary events in question, you should see that search_now
should always reflect the precise 5 minute interval you have scheduled your searches for, where as info_search_time
will reflect the real (wall clock) time, which is when the search was actually kicked off. So basically, even though your search was delayed by a minute or two (which does seem high), you shouldn't be losing any data because each search should still cover the originally designated window.
You may also want to look into your limits.conf
settings as far as concurrency of saved searches and all that. (I think there are some questions about that flowing around on this site already.)
BTW, are you seeing your saved search show up as being "skipped", because then I would expect to see events being dropped. You can search with:
index="_internal" sourcetype="scheduler" status=skipped
Another thing to consider: Is it possible that you simply don't have any events to summarize for the 5 minute window in question? If this happens, you will see no new events in the summary index (which looks like a "gap"). This may or may not be likely based on your event data, but you should be able to confirm this very quickly with the search:
index="_internal" sourcetype="scheduler" result_count=0
Of course, if you have some sort of conditional logic, then perhaps this would be a better search:
index="_internal" sourcetype="scheduler" NOT alert_actions="*summary_index*"
Yeah, that can be tricky to spot. I assume you know about the _indextime
field (add in 4.0), which can be quite helpful in tracking down this kind of issue. I think the general rule of thumb is to simply delay your summary indexing searches to the point at which you are certain all your events are loaded, but that may not be an option for you. (The file polling / indexing performance of 4.1 is much better than earlier versions, so if your running an older version and your mostly looking at events coming from log files, then upgrading may help here.) Best of luck!
We think we found our issue, some of the events get logged a lot later, but has a timestamp that sometimes falls in a Summary Indexing window that has already passed. At least we can confirm that Summary Indexing seems to work reliably. Will raise a new question for this backfill challenge. Thanks!
Is it possible that no events occurred with a 5 minute window? I've added a search above to check for that.
realtime_schedule is set to 0 for the saved searches in question.
I found some skipped saved searches using your search, but not for the day in question. I verified that the scehduled search events's scehduled_time field was correct (ie. 5 minute intervals).
Will need to dig deeper to find out why our summary index is missing events.