Currently I have two search head clusters, one has a smaller number of users and therefore less scheduled searches, latency is generally around 3 seconds which is great!
However my other cluster which has 4 nodes can go as high as 30+ seconds of latency during busy periods.
Since the default Splunk "run every 5 minutes" or "run every hour" defaults to been on the hour, the problem usually occurs around times such as 10 o'clock, 11 o'clock et cetera.
The search heads have a limited number of CPU's however they are only utilising 20-30% CPU on the Linux machines.
There are a large (and growing) number of alerts that will be run by these search heads.
I've very carefully checked and none of the searches are delayed due to quota enforcement that I can see.
In terms of configuration changes I have tried making the captain an ad-hoc search head, and I've increased the number of non-ad hoc search heads to 4 instead of 3 and there has been a very slight reduction in latency.
No search skipping is occurring so it's just a latency issue when executing, is there anything I can tune?
I am running Splunk 6.5.2 and looking at 6.6.1 now...
Happy to award the 25 points to anyone who can help in the tuning process and help resolve this!
I found out lately, that when saving an alert or scheduled search, Splunk's does not set schedule_window is default
schedule_window = <unsigned int> | auto * When schedule_window is non-zero, it indicates to the scheduler that the search does not require a precise start time. This gives the scheduler greater flexibility when it prioritizes searches. * When schedule_window is set to an integer greater than 0, it specifies the "window" of time (in minutes) a search may start within. + The schedule_window must be shorter than the period of the search. + Schedule windows are not recommended for searches that run every minute. * When set to 0, there is no schedule window. The scheduler starts the search as close to its scheduled time as possible. * When set to "auto," the scheduler calculates the schedule_window value automatically. + For more information about this calculation, see the search scheduler documentation. * Defaults to 0 for searches that are owned by users with the edit_search_schedule_window capability. For such searches, this value can be changed. * Defaults to "auto" for searches that are owned by users that do not have the edit_search_window capability. For such searches, this setting cannot be changed. * A non-zero schedule_window is mutually exclusive with a non-default schedule_priority (see schedule_priority for details).
performed the following search to check the status on this config:
| rest /servicesNS/-/-/saved/searches | search is_scheduled=1 | table title eai:acl.app eai:acl.owner cron_schedule next_scheduled_time schedule_window search
to my surprise, all searches had value of either "default" or "0" under the schedulewindow field
maybe the reason is that when you save a search, the following is what pops right away and may be a little confusing, see screenshot. in any case, changing all schedulewindow to "auto" immediately reduced skipped searches amount and latency on the environment i was working at.
hope it helps
I provided a vote as it's a good answer, not quite what I'm looking for.
I suspect there is a setting I can use to tune the scheduler threadpool or similar, or at least there should be.
Thankyou for your time and the tip!
Assuming that you are not I/O bound, which might be something to look into and double check, as well as your Indexers are able to keep up which is often the source of latency.
You could up the number of executor_workers in your server.conf
* Number of threads that can be used by the search head clustering
* Defaults to 10. A value of 0 will be interpreted as 1.
But again I would evaluate if your indexers are not able to keep up with the additional calls during your more heavily loaded windows. That is often where I see imparted latency is from the indexing layer, not the SHC.
I actually saw this setting but I'm unclear on what the " Number of threads that can be used by the search head clustering" is.
Is this the pool that relates to scheduling?
If so I'll accept the answer and request a clarification from the documentation team.
It is the number of threads that are in the threadpool, controlling the number of threads that are able to be addressed by the scheduler
I'm going to put a suggestion to the documentation team that the above wording is used or something similar so it's more clear...
Suggestion done, FYI that has managed to drop my scheduler latency to between 20 and 35 seconds, however upon later review that might be coincidence, I will need to re-measure after a longer period of time.
Any other settings I can tweak?
I did notice there are 2-4 simultaneous scheduled searches running per second per search head so I'm unsure if that can be changed.
FYI the indexers average 36% CPU around the time where the latency is high on the scheduler. Note that my issue is not the time the searches take to run the problem is how quickly the searches are getting kicked off by the scheduler.