Hello community, I have a question which has been floating around here for quite some time and though I've seen quite a few conversations and tips, I have not found a "single definitive source of truth". At some point, some time ago, when bumping Splunk from v8 to v9 we started noting Iowait alerts from the health monitor. I've checked our resource usage on our indexers (which are generating the alerts) and the cause of the alert seem to be spikes in resource usage. 3 out of x indexers have spikes in resource usage within 10 minutes which triggers an alert. Most of the time these alert seem wound really tight and the alerts somewhat overblown, on the other hand they should be there for a reason and I am not sure of tuning the alert levels is the right way to go. I have gone through the following threads: https://community.splunk.com/t5/Monitoring-Splunk/Why-is-IOWait-red-after-upgrade/m-p/600262#M8968 https://community.splunk.com/t5/Deployment-Architecture/IOWAIT-alert/m-p/666536#M27634 https://community.splunk.com/t5/Splunk-Enterprise/Why-am-I-receiving-this-error-message-IOWait-Resource-usage/m-p/578077#M10932 https://community.splunk.com/t5/Splunk-Search/Configure-a-Vsphere-VM-for-Splunk/td-p/409840 https://community.splunk.com/t5/Monitoring-Splunk/Running-Splunk-on-a-VM-CPU-contention/m-p/107582 Some recommendations exist to either ignore or adjust thresholds. Continuously ignoring seems like a slippery slope to desensitization and continuously monitoring add to the risk of alert fatigue. Other recommends ensuring adequate resources to solve the core issue, which seems logical though I am unsure regarding how. I am left with two questions 1) What are concrete actions could be taken to minimize the chance of these alerts/issues in a deployment based on VMWare Linux servers. In other words, what can/should I forward to the server group that they can work with, check and confirm in order to minimize the chance of these alerts? 2) What recommendations if any exists regarding modifying default thresholds? I could set thresholds high enough to not alert on "normal activity", is this the recommended adjustment or are there any concrete recommended modifications?
... View more