How to Calculate total Search Load for my Search ...

sat94541 · ‎09-08-2016

Can you please help us in letting know calculation on how our search concurrency limit is being hit in my Search Head Cluster Deployment? We will like to investigate when we see a Schedules search being skipped.

rbal_splunk · ‎09-08-2016

Also look at https://confluence.splunk.com/display/~mjose/Scheduler+activity+Debugging+in+SHC

efavreau · ‎01-16-2018

Link is broken

###

If this reply helps you, an upvote would be appreciated.

rbal_splunk · ‎09-08-2016

Response to your question is not very simple. At high level splunk run following type of searches

@adhoc searches

@Scheduled Searches ( running and delegated )
@Report Acceleration (running and delegate)
@datamodel acceleration (running and delegated)

To calculate number of SHC wide concurrent searches running at any given time it is required to calculate at adhoc searches+ scheduled searches + Report Acceleration scheduled searches + datamodel acceleration scheduled searches +delegrated searches .
Here are various log and searches that can be leveraged to get some stats, but these searches won’t provide you complete data. Splunk currently has an open Enhancement Request (SPL-125101:Comprehensive search concurrency metrics) to streamline these stats for reporting needs.)

1) The introspection log provide snapshot of all searches running on the SHC members. This snapshot is taken every 10sec for scheduled searches + Report Acceleration+ datamodel acceleration. You can use the search below to get trend of the searches being run in each category.

index=_internal  ( host=<> ….) 
            sourcetype=splunk_resource_usage component=PerProcess data.search_props.sid=*
                      | eval data.search_props.type = if(like('data.search_props.sid',"%_scheduler_%"),"scheduled",'data.search_props.type')
                      | bin _time span=10s
                      | stats dc(data.search_props.sid) AS distinct_search_count by _time,data.search_props.type 
                      | timechart bins=200 max(distinct_search_count) AS "median of search concurrency" by data.search_props.type| addtotals

Stats form introspection Data has following challenges :
@@Introspection Data is sampled every 10sec which means the searches that finished during 10s won’t get accounted.
@@ Introspection Data also doesn’t account for delegated searches

Due to these challenged introspection date can only be used to see the trend and may show stats below the actual search load.

2) To get the delegated searches I have been researching it in last few days and development has provided useful tips as published in https://answers.splunk.com/answers/449024/search-head-cluster-scheduled-searches-and-status.html

Based on this the scheduler/captain calculates the total number of scheduled searches can be derived from metrics (group=searchscheduler) as activeScheduledSearches.size + activeDelegatedSearch.size and below is the sample searches - but this metrics is missing adhoc searches.
Another limitation with this search is that it’s sampled(snapshotted ) every 30 sec. So even this data will miss the searches that finished in between those 30 sec

Scheduler Activity (based on metrics.log) :

index=_internal sourcetype=splunkd source=metrics group=searchscheduler | timechart span=3m sum(dispatched) as dispatched, sum(skipped) as skipped, sum(delegated) as delegated Max(delegated_waiting) as delegated_waiting, sum(delegated_scheduled) as delegated_scheduled, Max(max_pending) as max_pending, Max(max_running) as max_running

3)Here is another search that can be used to get scheduled ( running + skipped) from scheduler.log along with adhoc from _audit. To get meaning full data you need to run it for long time period like 4 hours or above. This is also missing delegated search. Another challenge is with audit log as it’s not always complete for ad-hoc searches. So number may be bit skewed.

Skipped searches vs concurrency:

host=<SHC_HOST_NAME>
(index=_internal source=*/scheduler.log*  (status=success run_time=*) OR status=skipped) OR
(index=_internal source=*/scheduler.log*  (status=success run_time=*) OR status=skipped) OR
((index=_audit action=search info=completed) (NOT search_id='scheduler_*' NOT search_id='rsa_*'))

| eval type=if(status="skipped", "skipped", "completed")
| eval run_time=coalesce(run_time, total_run_time)
| eval counter=-1
| appendpipe [
    | eval counter=1
    | eval _time=_time - run_time
]

| sort 0 _time
| streamstats sum(counter) as concurrency by type
| table _time concurrency counter run_time type
| timechart partial=f sep=_ span=1m count min(concurrency) as tmin max(concurrency) as tmax by type
| rename count_skipped as skipped     tmin_completed as min_concurrency     tmax_completed as max_concurrency
| fields + _time skipped *_concurrency
| filldown *_concurrency

Delayed-minutes vs concurrency:

host=<SHC_HOST_NAME>
index= _audit
(action=search info=completed)
(NOT search_id='scheduler_*' NOT search_id='rsa_*')

| eval run_time=coalesce(run_time, total_run_time)
| eval counter=-1
| appendpipe [
    | eval counter=1
    | eval _time=_time - run_time
]

| sort 0 _time
| streamstats sum(counter) as concurrency
| timechart partial=f sep=_ span=1m min(concurrency) as min_concurrency max(concurrency) as max_concurrency
| filldown *_concurrency

| join _time [
    | search index=internal host=<SHC_HOST_NAME>  source=*/scheduler.log* (status=success OR status=continued OR status=skipped)
    | eval dispatch_time =  coalesce(dispatch_time, _time)
    | eval scheduled_time = if(scheduled_time > 0, scheduled_time, "WTF")
    | eval window_time =    coalesce(window_time, "0")
    | eval execution_latency = max(dispatch_time - (scheduled_time + window_time), 0)
    | timechart partial=f sep=_ span=1m sum(execution_latency) as delayed_seconds
    | eval delayed_minutes=coalesce(delayed_seconds/60, 0)
    | fields + _time delayed_minutes

Due to these limitation currently splunk provide some challenges when you are trying to find Comprehensive search concurrency metrics .

How to Calculate total Search Load for my Search Head Clustering Deployment?

Troubleshooting the OpenTelemetry Collector

Adoption of Infrastructure Monitoring at Splunk

Modern way of developing distributed application using OTel