Servers idling but searches delayed

PickleRick · ‎08-06-2021

I have a cluster consisting of some 6 or so indexers. I also have a search-head cluster consisting of 3 SH's.

In webui I'm getting:

The percentage of high priority searches delayed (76%) over the last 24 hours is very high and exceeded the red thresholds (10%) on this Splunk instance. Total Searches that were part of this percentage=55. Total delayed Searches=42
The percentage of non high priority searches delayed (77%) over the last 24 hours is very high and exceeded the red thresholds (20%) on this Splunk instance. Total Searches that were part of this percentage=440. Total delayed Searches=339

Also, the users report problems with very slow refreshing dashboards and so on.

But the splunk components themselves do not seem to be stressed that much.

The machines have 64G RAM each and 24 (indexers) or 32 (search-heads) CPUs but the load is up to 10 on SH's or 12 on idxrs tops. If I do vmstat I see the prcessors mostly idling and about half of memory on search-heads is unused (even counting cache as used memory).

So something is definitely wrong but I can't pinpoint the cause.

What can I check?

I see though that search heads are writing heavily to disks. Almost all the time.

Maybe I should tweak some memory limits for SH's then to make it write to disk less? But which ones?

Any hints?

Of course at first it looks as if I should raise the number of concurrent searches allowed because the CPU's are idle but if storage is the bottleneck it won't help much since I'd be hitting the streaming to disk problem just with more concurrent searches.

codebuilder · ‎08-06-2021

Be sure that you have the recommended ulimits set for the user running Splunk. I usually set them in the unit file.

You can verify with something like:

ps -ef |grep -i splunk
Grab the PID from the output, then run:
cat /proc/splunkd_pid/limits

----
An upvote would be appreciated and Accept Solution if it helps!

codebuilder · ‎08-06-2021

The first, and easiest, check would be to ensure that you have ulimits set correctly on all nodes.

https://docs.splunk.com/Documentation/Splunk/8.2.1/Troubleshooting/ulimitErrors

You can also check to see if you have concurrent searches config set to a value sufficient to handle your workload. Delayed searches are often an indicator that value is set too low.

https://docs.splunk.com/Documentation/SplunkCloud/8.2.2106/Admin/ConcurrentLimits

----
An upvote would be appreciated and Accept Solution if it helps!

PickleRick · ‎08-06-2021

Ahhh, forgot to mention that. Ulimits were raised (to some 64k on open files I think) right at installation time so it's definitely not that.

Since I'm having a significant _write_ load on search-heads I'd suspect some memory-size issues but that's just my "overall computing experience hunch", not splunk-specific 😉 Therefore I suspect that raising concurrent searches limit would cause us to get stuck on I/O anyway. Just with higher number of searches. But I think I'll try that anyway just to confirm my suspicions.

Oh, one more thing, we have ES installed.

Servers idling but searches delayed

search performance

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms