Monitoring Splunk

Iowait: Sum of 3 highest per-cpu iowaits reached x threshold of n

fatsug
Builder

Hello community, I have a question which has been floating around here for quite some time and though I've seen quite a few conversations and tips, I have not found a "single definitive source of truth".

At some point, some time ago, when bumping Splunk from v8 to v9 we started noting Iowait alerts from the health monitor. I've checked our resource usage on our indexers (which are generating the alerts) and the cause of the alert seem to be spikes in resource usage. 3 out of x indexers have spikes in resource usage within 10 minutes which triggers an alert.

Most of the time these alert seem wound really tight and the alerts somewhat overblown, on the other hand they should be there for a reason and I am not sure of tuning the alert levels is the right way to go.

I have gone through the following threads:

https://community.splunk.com/t5/Monitoring-Splunk/Why-is-IOWait-red-after-upgrade/m-p/600262#M8968
https://community.splunk.com/t5/Deployment-Architecture/IOWAIT-alert/m-p/666536#M27634
https://community.splunk.com/t5/Splunk-Enterprise/Why-am-I-receiving-this-error-message-IOWait-Resou...
https://community.splunk.com/t5/Splunk-Search/Configure-a-Vsphere-VM-for-Splunk/td-p/409840
https://community.splunk.com/t5/Monitoring-Splunk/Running-Splunk-on-a-VM-CPU-contention/m-p/107582

Some recommendations exist to either ignore or adjust thresholds. Continuously ignoring seems like a slippery slope to desensitization and continuously monitoring add to the risk of alert fatigue. Other recommends ensuring adequate resources to solve the core issue, which seems logical though I am unsure regarding how.

I am left with two questions

1) What are concrete actions could be taken to minimize the chance of these alerts/issues in a deployment based on VMWare Linux servers. In other words, what can/should I forward to the server group that they can work with, check and confirm in order to minimize the chance of these alerts?
2) What recommendations if any exists regarding modifying default thresholds? I could set thresholds high enough to not alert on "normal activity", is this the recommended adjustment or are there any concrete recommended modifications?

Labels (1)
0 Karma
1 Solution

richgalloway
SplunkTrust
SplunkTrust

The IOWait health check is far too sensitive.  The threshold should be adjusted so normal activity does not trigger an alert.

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway
SplunkTrust
SplunkTrust

The IOWait health check is far too sensitive.  The threshold should be adjusted so normal activity does not trigger an alert.

---
If this reply helps you, Karma would be appreciated.

fatsug
Builder

I'll mark this down as the solution and figure out how to push the modified settings from a manager.

0 Karma

fatsug
Builder

Thank you

So the acceptable solution to these issues is to adjust thresholds to not trigger under "normal operation".

The follow-up regarding thresholds settings, from what I understand these alerts are generated locally on the indexers in the indexer cluster. The health.conf settings are apparently not synced in an indexer cluster, only in the search head cluster where any changes has no effect (already tried).

If thresholds are to be modified in the indexer cluster, what file and values are of interest to push from the manager to change relevant thresholds? I have not been able to identify these in the documentation. If not in the indexer cluster, then where?

0 Karma

richgalloway
SplunkTrust
SplunkTrust

The Search Head determines if the alert should be displayed or not so that is where the threshold is set.  Go to Settings->Health Report Manager to change the threshold.

---
If this reply helps you, Karma would be appreciated.

fatsug
Builder

Yes, "should" seems correct 😊 However, I did change these levels and there was no effect. This led me to the documentation pointing at local changes on the indexers.

The warnings are presented for indexers under the "Health of Distributed Splunk Deployment" pointing at indexers, though I am not sure where warnings are generated, to where they are "collected" before being presented in the search head cluster. I also cannot figure out where in the healtch.conf file these levels could/should be modified (health.conf | Splunk Docs) so I'm guessing it is likely somewhere else.

0 Karma
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.
Get Updates on the Splunk Community!

Tech Talk Recap | Mastering Threat Hunting

Mastering Threat HuntingDive into the world of threat hunting, exploring the key differences between ...

Observability for AI Applications: Troubleshooting Latency

If you’re working with proprietary company data, you’re probably going to have a locally hosted LLM or many ...

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

In the age of AI, every tool promises to make our lives easier. From summarizing content to writing code, ...