Solved: max concurrent rt searches

damucka · ‎03-25-2019

Hello,

I have an issue with extending the number of the concurrent rt searches. I can see constant amount of 36 RT searches being executed on the system. It happens also that when I try to execute my search, the get queued.

I have the following parameters active on the search head:

max_searches_per_cpu = 40
base_max_searches = 10
max_searches_perc = 77
max_rt_search_multiplier = 1

So, based on the formulas, the derived number of the rt searches would be:

max_hist_searches =40 x 8 + 10 = 330
max_rt_searches = max_rt_search_multiplier x max_hist_searches = 1 x 330 = 330

... but on the chart, there are only 36 concurrent rt searches visible, so I guess this must be a limit set somewhere there.

And I assume we are hitting this limit and that is why my searches go into the status "queued" very often.

Could you please advise?

Kind Regards,

Kamil

nickhills · ‎03-25-2019

Firstly, are you sure you really need RT searches?
Before you answer that, read this: https://answers.splunk.com/answers/734767/why-are-realtime-searches-disliked-in-the-splunk-w.html

The number of searches you can do (regardless of what you set in limits.conf) is limited by the processors in your deployment - that is even more true with RT searching, which consumes one core PER SEARCH. (on SH and Indexers)
Without digging into your config yet, if you really need more RT searches available, you are going to need more cores on all your Search heads and Indexers.

With reference to the above doc, you may want to consider if you can instead 'make do' with historic searches (which often give exactly the same/if not better results) with a far more efficient use of resources.

If my comment helps, please give it a thumbs up!

View solution in original post

harsmarvania57 · ‎03-25-2019

I would like to pin point here is that max_searches_per_cpu = 40 is very bad for production environment. To run more searches simply increasing max_searches_per_cpu won't help, it will reduce your search performance a lot.

damucka · ‎03-25-2019

@harsmarvania57

Thank you.
How would I deal with the situation, when I still have free CPU capacity, let us say 50 - 70% idle and want to get more jobs / alerts processed in parallel?
I understand the overcommitment should be avoided, but when I define the max_searches_per_cpu = 1 as recommended, then only around 30 in parallel are possible and at the moment we have demand of up to 200. For me the situation is quite "easy":
- if I have free resources, I try to parallelize more.

Kind Regards,
Kamil

nickhills · ‎03-25-2019

The amount of 'work' a core is doing (i.e how much of the processor is commited at any point of time) is not a direct indication of how 'limited' your deployment is with regard to rt capacity.

Because real-time searching can be an intensive process, when a user dispatches a rt search a core is dedicated for that search, and it remains allocated to that search until the job completes (which is never for scheduled searches, or for as long as the duration the user (or dashboard) keeps the search running).

This essentially means if you have 4 cores, you're not going to be able to run more than 4 rt searches (although in practice, Splunk/OS needs some of that proc time so it will be less)

If you overcommit the max_searches, it won't change how many rt searches can occur at once, but it will impact your other searches, because now each of the cores not assigned to a rt search are going to be heavily oversubscribed- Increasing this from 1 to 40 means you can run 40 jobs per core at once, but each job will take 40 times as long.

If you are seeing 30-36 concurrent rt searches I am guessing you must have quite a few cores in your SH - you may even see this count reach higher than your 'real' core count, but that will just be an artifact of the way Splunk is reporting the current concurrency.

I think you are suggesting that you seem limited to 36 rt searches, but that the overall processor use on your 32(???) cores is low?
This is the trade off for rt searching, it's a very inefficient use of your processor, and if at all possible (for all the reasons in the post I linked from @woodcock below) you should try to avoid it.

In short, if you need 200 rt searches at once, you need 200 cores.

If my comment helps, please give it a thumbs up!

damucka · ‎03-26-2019

Thank you. You convinced me :-).

I am going to:
- lower the max_searches from 40 to 4
- try to increase the #CPUs from 8 to 12
- persuade the project colleagues to turn the RT Alerts into the scheduled ones.
At the moment I am using the following search to identify them:

| rest /services/search/jobs | search isRealTimeSearch=1 | table label, author, dispatchState,  eai:acl.owner, label, isRealTimeSearch, performance.dispatch.stream.local.duration_secs, runDuration, searchProviders, splunk_server, title

At the moment the above search returns 8 lines, which I guess correspond to the 8 RT alerts we have in the system. The 36 RT searches I was referring to before come from the following:

index=_internal sourcetype=splunkd source=*metrics.log group=search_concurrency "system total" 
         | timechart span=1m max(active_hist_searches) as "Historical Searches" max(active_realtime_searches) as "Real-time Searches"

.. which actually delivers the max per minute. So the correct number of active searches is actually 8, which is still to high for 8 CPU SH.

@ddrillic:
forgive my ignorance, what do you mean by "near real time searches"?
Would that be e.g. an rt-alert turned into the scheduled one with the 1 minute schedule?

Kind Regards,
Kamil

nickhills · ‎03-26-2019

Hi Kamil, that sounds very sensible.
Yes - @ddrillic means repeating a search every few minutes over a time range of the same period.
Like adding search ... earliest=-6m@m latest=-1m@m and scheduling that to run every 5 mins.
(shorten or lengthen those times to fit your needs)

If my comment helps, please give it a thumbs up!

ddrillic · ‎03-26-2019

Right - that's the idea ; -)

damucka · ‎03-26-2019

Thank you, this was really helpful.
There is one more rt alert left in the system, I am chasing the end user to change it :-).

One last question:
- As it seems that the worst thing that can happen are the rt alerts, I would like to take the corresponding authorizations to create them away from the end users.
Could you tell me what would be the corresponding role?

Kind Regards,
Kamil

nickhills · ‎03-26-2019

if you remove the rtsearch and schedule_rtsearch caperbility users will not be able to run those types of searches.
If you want to change the config directly you can use @woodcock's suggestion:

[default]
 # https://answers.splunk.com/answers/734767/why-does-everybody-hate-realtime-searches-what-is.html
 # Kill all ability to do realtime (rt) searches because each one
 # permanently locks 1 CPU core on Search Head and EACH Indexer!
 # Also set this for EVERY existing role.
 rtsearch = disabled
 schedule_rtsearch = disabled

If my comment helps, please give it a thumbs up!

damucka · ‎03-26-2019

Thank you.
If I would like to have the possibility still to execute the ad-hoc searches with the SPL, but forbid the rt-alerts / reports, then the only thing to change would be the:

schedule_rtsearch = disabled

Is it right?

nickhills · ‎03-26-2019

Yes, but you need to do that for every role - you might want to leave it for admin though - that way admins can still schedule them if they are ever absolutely necessary.

If my comment helps, please give it a thumbs up!

damucka · ‎03-27-2019

Thank you, it was very helpful for me.
Today the SH became responsive again ...

Kind Regards,
Kamil

harsmarvania57 · ‎03-25-2019

As you are saying that 50-70% idle, have you checked CPU utilization when splunk is complaining about number of searches reached to maximum limit or during peak time (Like midnight because I have seen that many daily reports run at midnight in many organizations)? To fulfill the demand I would recommend to add more CPU or add more hardware. If I want to increase max_searches_per_cpu from 1 to 2 then I'll do in very rare scenario because I have seen that when you increase max_searches_per_cpu, more number of jobs will run but job completion time will increase for each job.

nickhills · ‎03-25-2019

Firstly, are you sure you really need RT searches?
Before you answer that, read this: https://answers.splunk.com/answers/734767/why-are-realtime-searches-disliked-in-the-splunk-w.html

The number of searches you can do (regardless of what you set in limits.conf) is limited by the processors in your deployment - that is even more true with RT searching, which consumes one core PER SEARCH. (on SH and Indexers)
Without digging into your config yet, if you really need more RT searches available, you are going to need more cores on all your Search heads and Indexers.

With reference to the above doc, you may want to consider if you can instead 'make do' with historic searches (which often give exactly the same/if not better results) with a far more efficient use of resources.

If my comment helps, please give it a thumbs up!

ddrillic · ‎03-25-2019

Absolutely, try to shift towards near real time searches.

damucka · ‎03-25-2019

sorry, forgot to attach the picture. Basically it shows around 180 - 200 historical searches and 36 - 39 rt searches constantly throughout the last 7 days.

What now comes to my mind is:
- what would fit from the math point of view would be if the

max_rt_searches + max_hist_searches <= max_searches_per_cpu x max_searches_perc

is it so?
What I mean is that both rt and historical searches together cannot be higher than the max_searches_per_cpu times max_searches_perc. Only then it would kind of match.

Kind Regards,
Kamil

nickhills · ‎03-25-2019

Just checking that's not a typo?
max_searches_per_cpu should be 1, and you should not really have changed this - certainly not to 40!

If my comment helps, please give it a thumbs up!

max concurrent rt searches

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

SignalFlow: What? Why? How?