Monitoring Splunk

Dokumentation for "Health of Splunk Deployment" calculation

michel_wolf
Path Finder

Hi all,

in splunk there is always this icon next to your user for the "Health of Splunk Deployment".
You can change these indicators and futures or their teshholds, but I can't find anything about what splunk actually does in the background to collect these values.

You can find something like this in health.conf:

[feature:iowait]
display_name = IOWait 
indicator:avg_cpu__max_perc_last_3m:description = This indicator tracks the average IOWait percentage across all CPUs on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the percentage exceeds 1% and Red if it exceeds 3% during this window. 
indicator:avg_cpu__max_perc_last_3m:red = 3
indicator:avg_cpu__max_perc_last_3m:yellow = 1
indicator:single_cpu__max_perc_last_3m:description = This indicator tracks the IOWait percentage for the single most bottle-necked CPU on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the percentage exceeds 5% and Red if it exceeds 10% during this window. 
indicator:single_cpu__max_perc_last_3m:red = 10
indicator:single_cpu__max_perc_last_3m:yellow = 5 
indicator:sum_top3_cpu_percs__max_last_3m:description = This indicator tracks the sum of IOWait percentage for the three most bottle-necked CPUs on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the sum exceeds 7% and Red if it exceeds 15% during this window. 
indicator:sum_top3_cpu_percs__max_last_3m:red = 15
indicator:sum_top3_cpu_percs__max_last_3m:yellow = 7    
 
I can´t find out how splunk generate this data and how this alert or indicator is created. There must be a kind of process like scripted input which execute a top command to look for the cpu wait time write it to the health.log in splunk ingest this log and a search which provide the information for these indicator.
 

TomK
Observer

Do you guys still experiencing this error?
I can't get rid of it, event it looks everything's fine on my instance and OS.

THP are disabled.
Health Check does not report any issue with limits.
CPU usage is fine.
RAM usage is fine.

IOwait on disk is also fine.

TomK_0-1627037926114.png

Max measured values in the past 24h.

TomK_1-1627038276018.png

 

Any ideas?

0 Karma

michel_wolf
Path Finder

Nope for me I decide to adjust the threshold or disable this health check because it´s meaningless without documentation to find out how and how often splunk will check this iowait.

In generell I have no problems with my systems my searches are fast and I have no indexing delay so I think if you are interested in this the best option is to create a support case. I can´t find any searches which creates this data I think there is a "hidden" scripted input for this.

0 Karma

jotne
Path Finder

Kan you post how/where you adjusted the threshold?

0 Karma

michel_wolf
Path Finder

You can go to Settings --> "Health Report Manager" and then just search for iowait here you can enable the alert or edit the thresholds.

Or you can use the health.conf and copy the stanza to system local like:

[feature:iowait]
display_name = IOWait 
indicator:avg_cpu__max_perc_last_3m:description = This indicator tracks the average IOWait percentage across all CPUs on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the percentage exceeds 1% and Red if it exceeds 3% during this window.
indicator:avg_cpu__max_perc_last_3m:red = 3
indicator:avg_cpu__max_perc_last_3m:yellow = 1 
indicator:single_cpu__max_perc_last_3m:description = This indicator tracks the IOWait percentage for the single most bottle-necked CPU on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the percentage exceeds 5% and Red if it exceeds 10% during this window.
indicator:single_cpu__max_perc_last_3m:red = 10
indicator:single_cpu__max_perc_last_3m:yellow = 5 
indicator:sum_top3_cpu_percs__max_last_3m:description = This indicator tracks the sum of IOWait percentage for the three most bottle-necked CPUs on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the sum exceeds 7% and Red if it exceeds 15% during this window.
indicator:sum_top3_cpu_percs__max_last_3m:red = 15
indicator:sum_top3_cpu_percs__max_last_3m:yellow = 7

hannyt
Engager

I also want to know the actual SPL used.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

See and control what is in the health report using the Health Report Manager at https://<localhost>:8000/en-US/manager/system/health/manager

You can read about the health report at https://docs.splunk.com/Documentation/Splunk/8.1.2/DMC/Aboutfeaturemonitoring?ref=hk

---
If this reply helps you, an upvote would be appreciated.
0 Karma

michel_wolf
Path Finder

Yes this is ok, but if I take a look to /en-US/manager/system/health/manager or the documentation I found nothing about the calculation for the threshold and the link to the feature monitorings also don´t provide any information these data and calculation or I am blind.

Maybe two examples:

1. https://docs.splunk.com/Documentation/Splunk/8.1.2/DMC/Usefeaturemonitoring#1._Review_the_splunkd_he...

In this Screenshot you can see the an error and some related INFO messages for this behavior to understand why this indicator is yellow or red.

So I expect there is an alert savedsearch or something else like index=_internal sourcetype=splunkd component="CMMaster" "streaming error" | stats count as count | eval thresshold=if(count<10,"yellow","red") 

If you don´t have the information why the indicator is red or yellow the meaning of the indicator can be everything

2.  You got an error like this: 

michel_wolf_0-1624976119853.png

 

So how do you troubleshoot this warning or error? The only information you have is "System iowait reached yellow threshold of 1" , but I can not find anything on which data splunk calculates this information or how this data was generated.

The only thing I can find is the settings of the thresholds, but nothing about the calculation for these threshholds what makes an alert for me meaningsless

michel_wolf_1-1624976653595.png

 

 

 

0 Karma

richgalloway
SplunkTrust
SplunkTrust

I understand what you're asking for now.  I think you'll find the information you seek in $SPLUNK_HOME/etc/system/default/health.conf.

---
If this reply helps you, an upvote would be appreciated.
0 Karma

michel_wolf
Path Finder

Unfortunately no, in this configuration you can only see what the indicator should stand for, but not how the data is collected and evaluated, but I have made some progress and I was able to find out that in the app splunk_instrumentation the following searches are for it:

 instrumentation.usage.healthMonitor.report

instrumentation.usage.healthMonitor.currentState

And in the currentState search is the join to the data from the health.log for the Treshhold 

index=_internal earliest=-1d source=*health.log component=PeriodicHealthReporter node_path=splunkd.resource_usage.iowait

In this case you can replace the iowait with the feature you want to look at in more detail

The last step that is still missing is how splunk generates the health.log, since the state is already created in this case for the evaluation.

 

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!