<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to do a RCA on &amp;quot;The maximum number of concurrent running jobs ...  on this cluster has been reached&amp;quot;? in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/How-to-do-a-RCA-on-quot-The-maximum-number-of-concurrent-running/m-p/632373#M219673</link>
    <description>&lt;P&gt;Well I understand your point about "this"... but that's the problem, I couldn't find an error with the skipped searches... unless I am missing something.&lt;/P&gt;&lt;P&gt;Since I did the rolling restart (reset) there are no more skipped searches.&lt;/P&gt;&lt;P&gt;Previously I looked for the longest running searches and none were over-running their schedules, that I could see.&amp;nbsp; For example one search took an hour approx., but it ran every 4 hours.&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Since I did some optimizing there were only 3 scheduled searches that produced the warning which I identified with&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;index="_internal" sourcetype="scheduler" 
            | eval scheduled=strftime(scheduled_time, "%Y-%m-%d %H:%M:%S") 
            | stats values(scheduled) as scheduled
                    values(savedsearch_name) as search_name
                    values(status) as status
                    values(reason) as reason
                    values(run_time) as run_time 
                    values(dm_node) as dm_node
                    values(sid) as sid
                    by _time,savedsearch_name |  sort -scheduled
            | table scheduled, search_name, status, reason, run_time&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;When I looked back at those 3 specific searches, they were not over-running the schedules, so I was wondering how it got stuck thinking it was "piling up" vs "still running".&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I am trying to understand/investigate, if a search is "skipped" then when the shc scheduler retries that previously skipped search at its next runtime, "how can I see that the shc CPT thinks its still running"?&amp;nbsp;&lt;/P&gt;&lt;P&gt;And looking back at the "skipped" events, they don't contain "run_time"...&amp;nbsp; so I look back historically to find a day with a high value.&amp;nbsp; But when the searches were running they took max 4 seconds with avg of 2 seconds to complete, which is why I thought the scheduled searches were piling up.&amp;nbsp; Hope that makes sense.&lt;/P&gt;&lt;P&gt;The only other variable I can think of is that these searches are using the "| dbxquery" cmd from Splunk DB Connect app.&lt;/P&gt;&lt;P&gt;So did it the SHC just get stuck?&lt;/P&gt;&lt;P&gt;Any further thoughts appreciated.&lt;/P&gt;&lt;P&gt;TY&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 27 Feb 2023 14:47:47 GMT</pubDate>
    <dc:creator>Glasses2</dc:creator>
    <dc:date>2023-02-27T14:47:47Z</dc:date>
    <item>
      <title>How to do a RCA on "The maximum number of concurrent running jobs ...  on this cluster has been reached"?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-do-a-RCA-on-quot-The-maximum-number-of-concurrent-running/m-p/632219#M219615</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;When I inherited this deployment, there were a lot of skipped searches.&lt;/P&gt;&lt;P&gt;The 3 node SHC was under resourced, but with some cron skewing, tuning the limits, reducing zombie scheduled searches, and optimizing some searches... I reduced a lot.&amp;nbsp; However some intensive apps were still causing skipped searches.&lt;/P&gt;&lt;P&gt;So we added a 4th node to the SHC, and it was running smoothly without a skipped search.&lt;/P&gt;&lt;P&gt;Now recently, I started seeing a persistent skipped search warning.&amp;nbsp; Nothing new was added (scheduled searches), resource usage looked good,&amp;nbsp; but I kept seeing &amp;gt;&amp;gt;&lt;SPAN&gt;"The maximum number of concurrent running jobs for this historical scheduled search on this cluster has been reached ".&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I could see the jobs that were skipped, but I am not finding a way to see which jobs piled up during a time interval that caused the skipped search and the warning.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;I did notice some of the skipped searches were throwing warnings and errors.&amp;nbsp; I am wondering if it caused a hanging job so it added to the count, and created a skipping loop.&lt;/P&gt;&lt;P&gt;IF any one has a way to see the scheduled searches that accumulate and cause this error and skipping, PLEASE advise.&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;</description>
      <pubDate>Fri, 24 Feb 2023 15:21:25 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-do-a-RCA-on-quot-The-maximum-number-of-concurrent-running/m-p/632219#M219615</guid>
      <dc:creator>Glasses2</dc:creator>
      <dc:date>2023-02-24T15:21:25Z</dc:date>
    </item>
    <item>
      <title>Re: How to do a RCA on "The maximum number of concurrent running jobs ...  on this cluster has been reached"?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-do-a-RCA-on-quot-The-maximum-number-of-concurrent-running/m-p/632261#M219629</link>
      <description>&lt;P&gt;The key words there are "&lt;SPAN&gt;for &lt;EM&gt;&lt;STRONG&gt;this&lt;/STRONG&gt;&lt;/EM&gt; historical scheduled search&lt;/SPAN&gt;&lt;SPAN&gt;"... So likely looking at a search job that's taking longer than its scheduled period to execute. I'd start with looking at the runtimes of the skipping search you've already found.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;(of course not ruling out something crazy like the job wasn't running but the SHC captain thought it was...)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Feb 2023 04:05:17 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-do-a-RCA-on-quot-The-maximum-number-of-concurrent-running/m-p/632261#M219629</guid>
      <dc:creator>acharlieh</dc:creator>
      <dc:date>2023-02-25T04:05:17Z</dc:date>
    </item>
    <item>
      <title>Re: How to do a RCA on "The maximum number of concurrent running jobs ...  on this cluster has been reached"?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-do-a-RCA-on-quot-The-maximum-number-of-concurrent-running/m-p/632373#M219673</link>
      <description>&lt;P&gt;Well I understand your point about "this"... but that's the problem, I couldn't find an error with the skipped searches... unless I am missing something.&lt;/P&gt;&lt;P&gt;Since I did the rolling restart (reset) there are no more skipped searches.&lt;/P&gt;&lt;P&gt;Previously I looked for the longest running searches and none were over-running their schedules, that I could see.&amp;nbsp; For example one search took an hour approx., but it ran every 4 hours.&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Since I did some optimizing there were only 3 scheduled searches that produced the warning which I identified with&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;index="_internal" sourcetype="scheduler" 
            | eval scheduled=strftime(scheduled_time, "%Y-%m-%d %H:%M:%S") 
            | stats values(scheduled) as scheduled
                    values(savedsearch_name) as search_name
                    values(status) as status
                    values(reason) as reason
                    values(run_time) as run_time 
                    values(dm_node) as dm_node
                    values(sid) as sid
                    by _time,savedsearch_name |  sort -scheduled
            | table scheduled, search_name, status, reason, run_time&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;When I looked back at those 3 specific searches, they were not over-running the schedules, so I was wondering how it got stuck thinking it was "piling up" vs "still running".&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I am trying to understand/investigate, if a search is "skipped" then when the shc scheduler retries that previously skipped search at its next runtime, "how can I see that the shc CPT thinks its still running"?&amp;nbsp;&lt;/P&gt;&lt;P&gt;And looking back at the "skipped" events, they don't contain "run_time"...&amp;nbsp; so I look back historically to find a day with a high value.&amp;nbsp; But when the searches were running they took max 4 seconds with avg of 2 seconds to complete, which is why I thought the scheduled searches were piling up.&amp;nbsp; Hope that makes sense.&lt;/P&gt;&lt;P&gt;The only other variable I can think of is that these searches are using the "| dbxquery" cmd from Splunk DB Connect app.&lt;/P&gt;&lt;P&gt;So did it the SHC just get stuck?&lt;/P&gt;&lt;P&gt;Any further thoughts appreciated.&lt;/P&gt;&lt;P&gt;TY&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Feb 2023 14:47:47 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-do-a-RCA-on-quot-The-maximum-number-of-concurrent-running/m-p/632373#M219673</guid>
      <dc:creator>Glasses2</dc:creator>
      <dc:date>2023-02-27T14:47:47Z</dc:date>
    </item>
  </channel>
</rss>

