Solved: Search Head Cluster scheduled searches: What are t...

sat94541 · ‎09-08-2016

Looking at scheduler.log from a search head cluster member, the scheduler.log has a status column that can have following values:

delegated_remote
delegated_remote_completion
delegarte_remote_error
skipped
success

It will be useful to get information about each of these values in context of Search Head Clustering.

index=case_356897 host=* sourcetype=*sched* source="*scheduler.log" | timechart span=6month count by status

rbal_splunk · ‎09-08-2016

Captain is the scheduler in a Search Head Cluster.

Total Scheduled searches = (((base_max_searches + cpu_count*max_search_per_cpu) * max_searches_perc) / 100) * num_members

num_members = Number of members in the SHC which does not have "adhoc_searchhead=true" or "captain_is_adhoc_searchhead = true" (i.e can run scheduled searches)

Note that if a member goes to a DOWN state for a while, it will be counted for the quota, but might not get any searches from the captain, therefore, the overall load on the system will go up as the total number of searches will increase.

Captain checks the above quota (No param to turn it off/on) before it schedules a search and then a search goes through the following steps.

Dispatch a search to any UP member selected based on the load metrics. Captain puts the unique identifier (dsi_id dispatch search Id) to activeScheduledSearches aka delegated_scheduled
Member receives this search and creates an SID (search id/dispatch dir name) for the search and replies back to captain saying job is dispatched (delegated_remote). Captain removes the dsi_id from activeScheduledSearches and adds the SID to activeDelegatedSearch aka delegated_waiting
When the job is done, member informs captain (delegated_remote_completion) and the captain removes the SID from the activeDelegatedSearch
Error during delegation

If there is an error on the member and the job is not dispatched, then it goes through step 1, and then when the captain receives an error message (delegated_remote_error), it removes that dsi_id from the activeScheduledSearches.

The scheduler/captain calculates the total number of scheduled searches as activeScheduledSearches.size + activeDelegatedSearch.size

Both of these are emitted to metrics.log on the captain instance as below (see delegated_waiting and delegated_scheduled )

09-08-2016 12:13:31.469 -0700 INFO  Metrics - group=searchscheduler, eligible=7, delayed=0, dispatched=6, skipped=1, total_lag=6, max_lag=1, window_max_lag=0, window_total_lag=0, delegated=6, delegated_waiting=1, delegated_scheduled=0, max_running=0, actions_triggered=0, completed=0, total_runtime=0.000, max_runtime=0.000

What to do if I have a large count of activeScheduledSearches aka delegated_scheduled?

The delegation message from captain to member flows as a REST message with type SHPDelegateSearchJob. Check if the number of messages have piled up in the metrics.log of the SHC captain sample below.

09-07-2016 22:50:07.127 -0700 INFO Metrics - group=executor, name=poolmember_executor, jobs_added=6, jobs_finished=6, current_size=0, smallest_size=0, largest_size=1, max_size=0
09-07-2016 22:50:07.127 -0700 INFO Metrics - group=jobs, name=poolmember, SHPDelegateSearchJob=6

If the number of jobs_added is far more than the jobs_finished ( ~300 ish), then consider doubling the number of executor threads to 20 from 10 in server.conf as follows

[shclustering]
executor_workers = 20

View solution in original post

rbal_splunk · ‎09-08-2016