I'm looking on the "Overview" (scheduler_status
) view in the Splunk 4.1 Search app and I'm trying to understand what the ""Average Execution lag" report is showing me exactly.
I'm seeing a very strong saw-tooth like patten in the MaxLag
, which spikes multiple times a day. Immediately after the spike, the MaxLag
drops significantly and then starts steadily climbing again. I'm not sure where this performance derogation is coming from, but I'd like to know.
This graph uses the following search:
index="_internal" splunk_server="local" source="*metrics.log" group="searchscheduler" dispatched!="0" | timechart partial=false span=10m max(max_lag) As MaxLag, eval(sum(total_lag)/sum(dispatched)) AS AverageLag
Well, it doesn't seem like ther's a problem. the max_lag for any period in metrics is just the longest after the scheduled time that a particular job runs, i.e., if a job is scheduled for 4:00 and doesn't run till 45 seconds later, then it's a 45-second lag.
Lag occurs just because of whatever, but longer ones occur mostly because there is a limit on the number of scheduled jobs that a server will launch. In 4.1.2, this is 25%*(4+(4*n_cpus)), that is, the number of CPUs/cores plus 1. If there are already that many jobs being run, then any new jobs that are scheduled will wait till there is a new scheduler job slot available.
So, if you have a bunch of summaries or other scheduled searches on regular schedules, and one of them happens to take a long time, other jobs that might be scheduled after it might get queued up waiting for it (or another job) to end. So pretty much your graph looks like you have a bunch of short searches every 15 minutes, some others every 30 minutes, and then some long running one(s) every 4 hours. Looks pretty reasonable, anyway, especially since you max lag looks like it's under 60 seconds anyway.
Well, it doesn't seem like ther's a problem. the max_lag for any period in metrics is just the longest after the scheduled time that a particular job runs, i.e., if a job is scheduled for 4:00 and doesn't run till 45 seconds later, then it's a 45-second lag.
Lag occurs just because of whatever, but longer ones occur mostly because there is a limit on the number of scheduled jobs that a server will launch. In 4.1.2, this is 25%*(4+(4*n_cpus)), that is, the number of CPUs/cores plus 1. If there are already that many jobs being run, then any new jobs that are scheduled will wait till there is a new scheduler job slot available.
So, if you have a bunch of summaries or other scheduled searches on regular schedules, and one of them happens to take a long time, other jobs that might be scheduled after it might get queued up waiting for it (or another job) to end. So pretty much your graph looks like you have a bunch of short searches every 15 minutes, some others every 30 minutes, and then some long running one(s) every 4 hours. Looks pretty reasonable, anyway, especially since you max lag looks like it's under 60 seconds anyway.
Also, I wasn't sure what "Execution lag" so thanks for helping me understand that.
Your right, it's under a minute, which isn't bad. The line just looked way to straight to be ignored. As it turns out, the real issue here is that I was looking at metrics from both my search instance and one forwarder that is running something I must have forgotten to turn off. Once I limited the search to just the correct host, the weird increasing pattern went away. There still could be something weird going on there, but I'm less concerned about saved searches running on forwarders which has no local data to begin with.