topic Re: How to do a RCA on "The maximum number of concurrent running jobs ... on this cluster has been reached"? in Splunk Search

How to do a RCA on "The maximum number of concurrent running jobs ... on this cluster has been reached"?

Glasses2 — Fri, 24 Feb 2023 15:21:25 GMT

Hi,

When I inherited this deployment, there were a lot of skipped searches.

The 3 node SHC was under resourced, but with some cron skewing, tuning the limits, reducing zombie scheduled searches, and optimizing some searches... I reduced a lot. However some intensive apps were still causing skipped searches.

So we added a 4th node to the SHC, and it was running smoothly without a skipped search.

Now recently, I started seeing a persistent skipped search warning. Nothing new was added (scheduled searches), resource usage looked good, but I kept seeing >>"The maximum number of concurrent running jobs for this historical scheduled search on this cluster has been reached ".

I could see the jobs that were skipped, but I am not finding a way to see which jobs piled up during a time interval that caused the skipped search and the warning.

I did notice some of the skipped searches were throwing warnings and errors. I am wondering if it caused a hanging job so it added to the count, and created a skipping loop.

IF any one has a way to see the scheduled searches that accumulate and cause this error and skipping, PLEASE advise.

Thank you!

Re: How to do a RCA on "The maximum number of concurrent running jobs ... on this cluster has been reached"?

acharlieh — Sat, 25 Feb 2023 04:05:17 GMT

The key words there are "for this historical scheduled search"... So likely looking at a search job that's taking longer than its scheduled period to execute. I'd start with looking at the runtimes of the skipping search you've already found.

(of course not ruling out something crazy like the job wasn't running but the SHC captain thought it was...)

Re: How to do a RCA on "The maximum number of concurrent running jobs ... on this cluster has been reached"?

Glasses2 — Mon, 27 Feb 2023 14:47:47 GMT

Well I understand your point about "this"... but that's the problem, I couldn't find an error with the skipped searches... unless I am missing something.

Since I did the rolling restart (reset) there are no more skipped searches.

Previously I looked for the longest running searches and none were over-running their schedules, that I could see. For example one search took an hour approx., but it ran every 4 hours.

Since I did some optimizing there were only 3 scheduled searches that produced the warning which I identified with

index="_internal" sourcetype="scheduler" | eval scheduled=strftime(scheduled_time, "%Y-%m-%d %H:%M:%S") | stats values(scheduled) as scheduled values(savedsearch_name) as search_name values(status) as status values(reason) as reason values(run_time) as run_time values(dm_node) as dm_node values(sid) as sid by _time,savedsearch_name | sort -scheduled | table scheduled, search_name, status, reason, run_time

When I looked back at those 3 specific searches, they were not over-running the schedules, so I was wondering how it got stuck thinking it was "piling up" vs "still running".

I am trying to understand/investigate, if a search is "skipped" then when the shc scheduler retries that previously skipped search at its next runtime, "how can I see that the shc CPT thinks its still running"?

And looking back at the "skipped" events, they don't contain "run_time"... so I look back historically to find a day with a high value. But when the searches were running they took max 4 seconds with avg of 2 seconds to complete, which is why I thought the scheduled searches were piling up. Hope that makes sense.

The only other variable I can think of is that these searches are using the "| dbxquery" cmd from Splunk DB Connect app.

So did it the SHC just get stuck?

Any further thoughts appreciated.