Dear All,
I have a Search Head, Deployment Server, Monitoring Console, a Cluster Manager, an Indexer Cluster and two unclustered Indexers.
On the Monitoring Console, I get alerts about the IOWaits being high on the two unclustered indexers and this has been happening only since we upgraded to 8.2.5.
There is no evidence of any issues, other than this alert in SplunkWeb and I want to disable it. I am using the following KB article:
https://docs.splunk.com/Documentation/Splunk/8.2.5/Admin/Healthconf
On the Monitoring Console server, I have put the following into the etc\apps\search\local\health.conf file:
[feature:iowait]
alert:sum_top3_cpu_percs__max_last_3m.disabled = 1
However, I am still getting the appearing in SplunkWeb on the Monitoring Console server.
Why is this? Am I configuring the health.conf in the wrong server or the wrong folder, or what? When I run a cmd btool health list, I see the configuration there, but Splunk is not doing as it is being told! If I am doing the wrong thing, even, can someone point me to some documentation that explains what I should be doing?
Thanks in advance!
The Health feature has caused some confusion regards local vs. distributed config. I have investigated this and it is very flexible to configure even though the docs is not so clear about it. So far I have not found any Answers posts that isn't possible to solve using standard config.
If you do configuration locally on Monitoring Console (DMC), as you described, that threshold will only be valid for the DMC local host. There is no distributed threashold. You need to configure the threshold on each and every enterprise instance (e.g. your standalone indexers). Either you do config in Splunk Web under Settings menu on each enterprise instance and just click Save. Or toggle Status Disable/Enable. This will take direct effect and does not require restart.
If you instead configure health.conf on each instance, example disable iowait, put this in health.conf
[feature:iowait]
disabled = 1
And then you need to do a reload e.g. http://<your_splunk>:<splunk_port>/debug/refresh
If you have a index cluster, applying a cluster bundle, this will trigger a restart of the peers. Version 8.2.4
Hope this solves your issue.
In addition you can use these searches to benchmark iowait performance over time, so you can set relevant thresholds for your environment. Just replace the hostname:
CPU IOwait average
index=_internal source="*/splunk/var/log/splunk/health.log"
feature=IOWait component=PeriodicHealthReporter node_type=indicator
indicator=avg_cpu__max_perc_last_3m host=ind0*
| timechart span=30s max(measured_value) min(due_to_threshold_value) by host
CPU IOwait single CPU
index=_internal source="*/splunk/var/log/splunk/health.log"
feature=IOWait component=PeriodicHealthReporter node_type=indicator
indicator=single_cpu__max_perc_last_3m host=ind0*
| timechart span=300s max(measured_value) min(due_to_threshold_value) by host
CPU IOwait top3 CPU
index=_internal source="*/splunk/var/log/splunk/health.log"
feature=IOWait component=PeriodicHealthReporter node_type=indicator
indicator=sum_top3_cpu_percs__max_last_3m host=ind0*
| timechart span=30s max(measured_value) min(due_to_threshold_value) by host
Did you restart the MC server after changing the config file?
Have you tried making the same health.conf change on the indexers?
Yes, I restarted the MC several times after the changes to the configurations.
No, I have not edited the health.conf on the Indexers. They are quite difficult to restart at the moment and I was hoping that someone would have a KB or document that could help (or know definitively) before I went there.