Deployment Architecture

Search Head Cluster scheduled searches: What are these Status values in scheduler.log?

Communicator

Looking at scheduler.log from a search head cluster member, the scheduler.log has a status column that can have following values:

delegatedremote
delegated
remotecompletion
delegarte
remote_error
skipped
success

It will be useful to get information about each of these values in context of Search Head Clustering.

index=case_356897 host=* sourcetype=*sched* source="*scheduler.log" | timechart span=6month count by status

alt text

1 Solution

Splunk Employee
Splunk Employee

Captain is the scheduler in a Search Head Cluster.

Total Scheduled searches = (((base_max_searches + cpu_count*max_search_per_cpu) * max_searches_perc) / 100) * num_members

nummembers = Number of members in the SHC which does not have "adhocsearchhead=true" or "captainisadhoc_searchhead = true" (i.e can run scheduled searches)

Note that if a member goes to a DOWN state for a while, it will be counted for the quota, but might not get any searches from the captain, therefore, the overall load on the system will go up as the total number of searches will increase.

Captain checks the above quota (No param to turn it off/on) before it schedules a search and then a search goes through the following steps.

  1. Dispatch a search to any UP member selected based on the load metrics. Captain puts the unique identifier (dsiid dispatch search Id) to activeScheduledSearches aka delegatedscheduled

  2. Member receives this search and creates an SID (search id/dispatch dir name) for the search and replies back to captain saying job is dispatched (delegatedremote). Captain removes the dsiid from activeScheduledSearches and adds the SID to activeDelegatedSearch aka delegated_waiting

  3. When the job is done, member informs captain (delegatedremotecompletion) and the captain removes the SID from the activeDelegatedSearch
    Error during delegation

If there is an error on the member and the job is not dispatched, then it goes through step 1, and then when the captain receives an error message (delegatedremoteerror), it removes that dsi_id from the activeScheduledSearches.

The scheduler/captain calculates the total number of scheduled searches as activeScheduledSearches.size + activeDelegatedSearch.size

Both of these are emitted to metrics.log on the captain instance as below (see delegatedwaiting and delegatedscheduled )

09-08-2016 12:13:31.469 -0700 INFO  Metrics - group=searchscheduler, eligible=7, delayed=0, dispatched=6, skipped=1, total_lag=6, max_lag=1, window_max_lag=0, window_total_lag=0, delegated=6, delegated_waiting=1, delegated_scheduled=0, max_running=0, actions_triggered=0, completed=0, total_runtime=0.000, max_runtime=0.000

What to do if I have a large count of activeScheduledSearches aka delegated_scheduled?

The delegation message from captain to member flows as a REST message with type SHPDelegateSearchJob. Check if the number of messages have piled up in the metrics.log of the SHC captain sample below.

09-07-2016 22:50:07.127 -0700 INFO Metrics - group=executor, name=poolmember_executor, jobs_added=6, jobs_finished=6, current_size=0, smallest_size=0, largest_size=1, max_size=0
09-07-2016 22:50:07.127 -0700 INFO Metrics - group=jobs, name=poolmember, SHPDelegateSearchJob=6

If the number of jobsadded is far more than the jobsfinished ( ~300 ish), then consider doubling the number of executor threads to 20 from 10 in server.conf as follows

[shclustering]
executor_workers = 20

View solution in original post

Splunk Employee
Splunk Employee

Captain is the scheduler in a Search Head Cluster.

Total Scheduled searches = (((base_max_searches + cpu_count*max_search_per_cpu) * max_searches_perc) / 100) * num_members

nummembers = Number of members in the SHC which does not have "adhocsearchhead=true" or "captainisadhoc_searchhead = true" (i.e can run scheduled searches)

Note that if a member goes to a DOWN state for a while, it will be counted for the quota, but might not get any searches from the captain, therefore, the overall load on the system will go up as the total number of searches will increase.

Captain checks the above quota (No param to turn it off/on) before it schedules a search and then a search goes through the following steps.

  1. Dispatch a search to any UP member selected based on the load metrics. Captain puts the unique identifier (dsiid dispatch search Id) to activeScheduledSearches aka delegatedscheduled

  2. Member receives this search and creates an SID (search id/dispatch dir name) for the search and replies back to captain saying job is dispatched (delegatedremote). Captain removes the dsiid from activeScheduledSearches and adds the SID to activeDelegatedSearch aka delegated_waiting

  3. When the job is done, member informs captain (delegatedremotecompletion) and the captain removes the SID from the activeDelegatedSearch
    Error during delegation

If there is an error on the member and the job is not dispatched, then it goes through step 1, and then when the captain receives an error message (delegatedremoteerror), it removes that dsi_id from the activeScheduledSearches.

The scheduler/captain calculates the total number of scheduled searches as activeScheduledSearches.size + activeDelegatedSearch.size

Both of these are emitted to metrics.log on the captain instance as below (see delegatedwaiting and delegatedscheduled )

09-08-2016 12:13:31.469 -0700 INFO  Metrics - group=searchscheduler, eligible=7, delayed=0, dispatched=6, skipped=1, total_lag=6, max_lag=1, window_max_lag=0, window_total_lag=0, delegated=6, delegated_waiting=1, delegated_scheduled=0, max_running=0, actions_triggered=0, completed=0, total_runtime=0.000, max_runtime=0.000

What to do if I have a large count of activeScheduledSearches aka delegated_scheduled?

The delegation message from captain to member flows as a REST message with type SHPDelegateSearchJob. Check if the number of messages have piled up in the metrics.log of the SHC captain sample below.

09-07-2016 22:50:07.127 -0700 INFO Metrics - group=executor, name=poolmember_executor, jobs_added=6, jobs_finished=6, current_size=0, smallest_size=0, largest_size=1, max_size=0
09-07-2016 22:50:07.127 -0700 INFO Metrics - group=jobs, name=poolmember, SHPDelegateSearchJob=6

If the number of jobsadded is far more than the jobsfinished ( ~300 ish), then consider doubling the number of executor threads to 20 from 10 in server.conf as follows

[shclustering]
executor_workers = 20

View solution in original post

Path Finder

what about the other two statuses - skipped and success? How can I identify the failed saved searches? i.e. if I have to report only the failed scheduled then which status should I consider?

0 Karma