Deployment Architecture

Search Head Cluster scheduled searches: What are these Status values in scheduler.log?

sat94541
Communicator

Looking at scheduler.log from a search head cluster member, the scheduler.log has a status column that can have following values:

delegated_remote
delegated_remote_completion
delegarte_remote_error
skipped
success

It will be useful to get information about each of these values in context of Search Head Clustering.

index=case_356897 host=* sourcetype=*sched* source="*scheduler.log" | timechart span=6month count by status

alt text

1 Solution

rbal_splunk
Splunk Employee
Splunk Employee

Captain is the scheduler in a Search Head Cluster.

Total Scheduled searches = (((base_max_searches + cpu_count*max_search_per_cpu) * max_searches_perc) / 100) * num_members

num_members = Number of members in the SHC which does not have "adhoc_searchhead=true" or "captain_is_adhoc_searchhead = true" (i.e can run scheduled searches)

Note that if a member goes to a DOWN state for a while, it will be counted for the quota, but might not get any searches from the captain, therefore, the overall load on the system will go up as the total number of searches will increase.

Captain checks the above quota (No param to turn it off/on) before it schedules a search and then a search goes through the following steps.

  1. Dispatch a search to any UP member selected based on the load metrics. Captain puts the unique identifier (dsi_id dispatch search Id) to activeScheduledSearches aka delegated_scheduled

  2. Member receives this search and creates an SID (search id/dispatch dir name) for the search and replies back to captain saying job is dispatched (delegated_remote). Captain removes the dsi_id from activeScheduledSearches and adds the SID to activeDelegatedSearch aka delegated_waiting

  3. When the job is done, member informs captain (delegated_remote_completion) and the captain removes the SID from the activeDelegatedSearch
    Error during delegation

If there is an error on the member and the job is not dispatched, then it goes through step 1, and then when the captain receives an error message (delegated_remote_error), it removes that dsi_id from the activeScheduledSearches.

The scheduler/captain calculates the total number of scheduled searches as activeScheduledSearches.size + activeDelegatedSearch.size

Both of these are emitted to metrics.log on the captain instance as below (see delegated_waiting and delegated_scheduled )

09-08-2016 12:13:31.469 -0700 INFO  Metrics - group=searchscheduler, eligible=7, delayed=0, dispatched=6, skipped=1, total_lag=6, max_lag=1, window_max_lag=0, window_total_lag=0, delegated=6, delegated_waiting=1, delegated_scheduled=0, max_running=0, actions_triggered=0, completed=0, total_runtime=0.000, max_runtime=0.000

What to do if I have a large count of activeScheduledSearches aka delegated_scheduled?

The delegation message from captain to member flows as a REST message with type SHPDelegateSearchJob. Check if the number of messages have piled up in the metrics.log of the SHC captain sample below.

09-07-2016 22:50:07.127 -0700 INFO Metrics - group=executor, name=poolmember_executor, jobs_added=6, jobs_finished=6, current_size=0, smallest_size=0, largest_size=1, max_size=0
09-07-2016 22:50:07.127 -0700 INFO Metrics - group=jobs, name=poolmember, SHPDelegateSearchJob=6

If the number of jobs_added is far more than the jobs_finished ( ~300 ish), then consider doubling the number of executor threads to 20 from 10 in server.conf as follows

[shclustering]
executor_workers = 20

View solution in original post

rbal_splunk
Splunk Employee
Splunk Employee

Captain is the scheduler in a Search Head Cluster.

Total Scheduled searches = (((base_max_searches + cpu_count*max_search_per_cpu) * max_searches_perc) / 100) * num_members

num_members = Number of members in the SHC which does not have "adhoc_searchhead=true" or "captain_is_adhoc_searchhead = true" (i.e can run scheduled searches)

Note that if a member goes to a DOWN state for a while, it will be counted for the quota, but might not get any searches from the captain, therefore, the overall load on the system will go up as the total number of searches will increase.

Captain checks the above quota (No param to turn it off/on) before it schedules a search and then a search goes through the following steps.

  1. Dispatch a search to any UP member selected based on the load metrics. Captain puts the unique identifier (dsi_id dispatch search Id) to activeScheduledSearches aka delegated_scheduled

  2. Member receives this search and creates an SID (search id/dispatch dir name) for the search and replies back to captain saying job is dispatched (delegated_remote). Captain removes the dsi_id from activeScheduledSearches and adds the SID to activeDelegatedSearch aka delegated_waiting

  3. When the job is done, member informs captain (delegated_remote_completion) and the captain removes the SID from the activeDelegatedSearch
    Error during delegation

If there is an error on the member and the job is not dispatched, then it goes through step 1, and then when the captain receives an error message (delegated_remote_error), it removes that dsi_id from the activeScheduledSearches.

The scheduler/captain calculates the total number of scheduled searches as activeScheduledSearches.size + activeDelegatedSearch.size

Both of these are emitted to metrics.log on the captain instance as below (see delegated_waiting and delegated_scheduled )

09-08-2016 12:13:31.469 -0700 INFO  Metrics - group=searchscheduler, eligible=7, delayed=0, dispatched=6, skipped=1, total_lag=6, max_lag=1, window_max_lag=0, window_total_lag=0, delegated=6, delegated_waiting=1, delegated_scheduled=0, max_running=0, actions_triggered=0, completed=0, total_runtime=0.000, max_runtime=0.000

What to do if I have a large count of activeScheduledSearches aka delegated_scheduled?

The delegation message from captain to member flows as a REST message with type SHPDelegateSearchJob. Check if the number of messages have piled up in the metrics.log of the SHC captain sample below.

09-07-2016 22:50:07.127 -0700 INFO Metrics - group=executor, name=poolmember_executor, jobs_added=6, jobs_finished=6, current_size=0, smallest_size=0, largest_size=1, max_size=0
09-07-2016 22:50:07.127 -0700 INFO Metrics - group=jobs, name=poolmember, SHPDelegateSearchJob=6

If the number of jobs_added is far more than the jobs_finished ( ~300 ish), then consider doubling the number of executor threads to 20 from 10 in server.conf as follows

[shclustering]
executor_workers = 20

rajim
Path Finder

what about the other two statuses - skipped and success? How can I identify the failed saved searches? i.e. if I have to report only the failed scheduled then which status should I consider?

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...