Monitoring Splunk

Trying to understand "Average Execution lag" on the scheduler_status view

Lowell
Super Champion

I'm looking on the "Overview" (scheduler_status) view in the Splunk 4.1 Search app and I'm trying to understand what the ""Average Execution lag" report is showing me exactly.

I'm seeing a very strong saw-tooth like patten in the MaxLag, which spikes multiple times a day. Immediately after the spike, the MaxLag drops significantly and then starts steadily climbing again. I'm not sure where this performance derogation is coming from, but I'd like to know.

Screenshot:

screenshot of graph

This graph uses the following search:

index="_internal" splunk_server="local" source="*metrics.log" group="searchscheduler" dispatched!="0" | timechart partial=false span=10m max(max_lag) As MaxLag, eval(sum(total_lag)/sum(dispatched)) AS AverageLag
Tags (2)
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

Well, it doesn't seem like ther's a problem. the max_lag for any period in metrics is just the longest after the scheduled time that a particular job runs, i.e., if a job is scheduled for 4:00 and doesn't run till 45 seconds later, then it's a 45-second lag.

Lag occurs just because of whatever, but longer ones occur mostly because there is a limit on the number of scheduled jobs that a server will launch. In 4.1.2, this is 25%*(4+(4*n_cpus)), that is, the number of CPUs/cores plus 1. If there are already that many jobs being run, then any new jobs that are scheduled will wait till there is a new scheduler job slot available.

So, if you have a bunch of summaries or other scheduled searches on regular schedules, and one of them happens to take a long time, other jobs that might be scheduled after it might get queued up waiting for it (or another job) to end. So pretty much your graph looks like you have a bunch of short searches every 15 minutes, some others every 30 minutes, and then some long running one(s) every 4 hours. Looks pretty reasonable, anyway, especially since you max lag looks like it's under 60 seconds anyway.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

Well, it doesn't seem like ther's a problem. the max_lag for any period in metrics is just the longest after the scheduled time that a particular job runs, i.e., if a job is scheduled for 4:00 and doesn't run till 45 seconds later, then it's a 45-second lag.

Lag occurs just because of whatever, but longer ones occur mostly because there is a limit on the number of scheduled jobs that a server will launch. In 4.1.2, this is 25%*(4+(4*n_cpus)), that is, the number of CPUs/cores plus 1. If there are already that many jobs being run, then any new jobs that are scheduled will wait till there is a new scheduler job slot available.

So, if you have a bunch of summaries or other scheduled searches on regular schedules, and one of them happens to take a long time, other jobs that might be scheduled after it might get queued up waiting for it (or another job) to end. So pretty much your graph looks like you have a bunch of short searches every 15 minutes, some others every 30 minutes, and then some long running one(s) every 4 hours. Looks pretty reasonable, anyway, especially since you max lag looks like it's under 60 seconds anyway.

Lowell
Super Champion

Also, I wasn't sure what "Execution lag" so thanks for helping me understand that.

0 Karma

Lowell
Super Champion

Your right, it's under a minute, which isn't bad. The line just looked way to straight to be ignored. As it turns out, the real issue here is that I was looking at metrics from both my search instance and one forwarder that is running something I must have forgotten to turn off. Once I limited the search to just the correct host, the weird increasing pattern went away. There still could be something weird going on there, but I'm less concerned about saved searches running on forwarders which has no local data to begin with.

0 Karma
Get Updates on the Splunk Community!

Splunk is Nurturing Tomorrow’s Cybersecurity Leaders Today

Meet Carol Wright. She leads the Splunk Academic Alliance program at Splunk. The Splunk Academic Alliance ...

Part 2: A Guide to Maximizing Splunk IT Service Intelligence

Welcome to the second segment of our guide. In Part 1, we covered the essentials of getting started with ITSI ...

Part 1: A Guide to Maximizing Splunk IT Service Intelligence

As modern IT environments continue to grow in complexity and speed, the ability to efficiently manage and ...