Currently I have two search head clusters, one has a smaller number of users and therefore less scheduled searches, latency is generally around 3 seconds which is great!
However my other cluster which has 4 nodes can go as high as 30+ seconds of latency during busy periods.
Since the default Splunk "run every 5 minutes" or "run every hour" defaults to been on the hour, the problem usually occurs around times such as 10 o'clock, 11 o'clock et cetera.
The search heads have a limited number of CPU's however they are only utilising 20-30% CPU on the Linux machines.
There are a large (and growing) number of alerts that will be run by these search heads.
I've very carefully checked and none of the searches are delayed due to quota enforcement that I can see.
In terms of configuration changes I have tried making the captain an ad-hoc search head, and I've increased the number of non-ad hoc search heads to 4 instead of 3 and there has been a very slight reduction in latency.
No search skipping is occurring so it's just a latency issue when executing, is there anything I can tune?
I am running Splunk 6.5.2 and looking at 6.6.1 now...
Happy to award the 25 points to anyone who can help in the tuning process and help resolve this!
You need to review your job inspector and possibly search.log and determine where the latency is occurring.
Also, if you are on 6.6 or later, you need to revert max_searches_per_process
back to the default of 500. Many orgs ran into the bug prior to 6.6 which causes slow search start up and still have max_searches_per_process = 1
.
Assuming that you are not I/O bound, which might be something to look into and double check, as well as your Indexers are able to keep up which is often the source of latency.
You could up the number of executor_workers in your server.conf
[shclustering]
executor_workers =
* Number of threads that can be used by the search head clustering
threadpool.
* Defaults to 10. A value of 0 will be interpreted as 1.
But again I would evaluate if your indexers are not able to keep up with the additional calls during your more heavily loaded windows. That is often where I see imparted latency is from the indexing layer, not the SHC.
I actually saw this setting but I'm unclear on what the " Number of threads that can be used by the search head clustering" is.
Is this the pool that relates to scheduling?
If so I'll accept the answer and request a clarification from the documentation team.
Unaccepted the answer so we can continue this discussion, after double-checking today I can see minimal difference from this setting change.
I noticed the scheduler will run approx 2-4 jobs simultaneously so I assume there is either another setting or something that cannot be changed by the admin...I can see my search heads peak around 61 scheduled searches for a block of time however generally the scheduler runs 2-3 searches at a time, I have one occurrence where it spiked to 7 searches with the same dispatch time.
Search head CPU remains under 40% average across the board so it is busy but not as busy as I would expect.
Is there anything else I can tweak?
So this gets to the structure of how splunk works, the searches are dispatched to the indexer for them to do the work, so the number of threads open is not where you are running into the constraint, I would still assume it is related to the indexer returning the results quick enough for the search head. I would increase the number of concurrent searches you allow and see if that changes what you see on the SHC. Also look at the I/O on the indexer. That is so often the bottle neck that is why I keep pointing back that way.
When I refer to latency, I'm referring to the dispatched time of the search from the search head vs the scheduled time mentioned in the scheduler.log file.
So when I say I see a 50 second latency, I am saying a search scheduled for 12:00:00 actually ran at 12:00:50
I can see the indexers are below 40% CPU and do not have disk busy times, the searches are not hitting a quota that I can see.
The first mention in the scheduler.log of the search is around, in the 50 second example, 48 seconds.
Therefore I still suspect it is the search head causing the latency.
Please note that while some of the searches have an execution time of 2 seconds, they are delayed from starting until say 50 seconds past the minute, and then complete 2 seconds later.
Reading past presentations and watching my environment I can see some level of latency (<10 seconds) is perfectly normal, however I'm unsure why it takes 50 seconds to get scheduled if the server isn't really busy...(it's semi-busy at 30-50% CPU usage at the search head level, but it's not sitting at 80% CPU or similar).
Is the indexer speed involved in the dispatch time of the search? If it is I'm confused as to why 1 search head cluster stays under 10 seconds and the other is often around 50.
If it's not the indexer it does make sense as 1 search head cluster has 5X as many scheduled searches.
Was a final resolution on this issue?
It is the number of threads that are in the threadpool, controlling the number of threads that are able to be addressed by the scheduler
I'm going to put a suggestion to the documentation team that the above wording is used or something similar so it's more clear...
Thankyou
Suggestion done, FYI that has managed to drop my scheduler latency to between 20 and 35 seconds, however upon later review that might be coincidence, I will need to re-measure after a longer period of time.
Any other settings I can tweak?
I did notice there are 2-4 simultaneous scheduled searches running per second per search head so I'm unsure if that can be changed.
FYI the indexers average 36% CPU around the time where the latency is high on the scheduler. Note that my issue is not the time the searches take to run the problem is how quickly the searches are getting kicked off by the scheduler.
Hello there,
I found out lately, that when saving an alert or scheduled search, Splunk's does not set schedule_window is default
savedsaerches.conf.spec:
schedule_window = <unsigned int> | auto
* When schedule_window is non-zero, it indicates to the scheduler that the
search does not require a precise start time. This gives the scheduler
greater flexibility when it prioritizes searches.
* When schedule_window is set to an integer greater than 0, it specifies the
"window" of time (in minutes) a search may start within.
+ The schedule_window must be shorter than the period of the search.
+ Schedule windows are not recommended for searches that run every minute.
* When set to 0, there is no schedule window. The scheduler starts the search
as close to its scheduled time as possible.
* When set to "auto," the scheduler calculates the schedule_window value
automatically.
+ For more information about this calculation, see the search scheduler
documentation.
* Defaults to 0 for searches that are owned by users with the
edit_search_schedule_window capability. For such searches, this value can be
changed.
* Defaults to "auto" for searches that are owned by users that do not have the
edit_search_window capability. For such searches, this setting cannot be
changed.
* A non-zero schedule_window is mutually exclusive with a non-default
schedule_priority (see schedule_priority for details).
performed the following search to check the status on this config:
| rest /servicesNS/-/-/saved/searches
| search is_scheduled=1
| table title eai:acl.app eai:acl.owner cron_schedule next_scheduled_time schedule_window search
to my surprise, all searches had value of either "default" or "0" under the schedule_window field
maybe the reason is that when you save a search, the following is what pops right away and may be a little confusing, see screenshot. in any case, changing all schedule_window to "auto" immediately reduced skipped searches amount and latency on the environment i was working at.
hope it helps
I provided a vote as it's a good answer, not quite what I'm looking for.
I suspect there is a setting I can use to tune the scheduler threadpool or similar, or at least there should be.
Thankyou for your time and the tip!
thanks!
will be interesting to know of such setting
so what did you finally end up doing?