Monitoring Splunk

Dokumentation for "Health of Splunk Deployment" calculation

michel_wolf
Path Finder

Hi all,

in splunk there is always this icon next to your user for the "Health of Splunk Deployment".
You can change these indicators and futures or their teshholds, but I can't find anything about what splunk actually does in the background to collect these values.

You can find something like this in health.conf:

[feature:iowait]
display_name = IOWait 
indicator:avg_cpu__max_perc_last_3m:description = This indicator tracks the average IOWait percentage across all CPUs on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the percentage exceeds 1% and Red if it exceeds 3% during this window. 
indicator:avg_cpu__max_perc_last_3m:red = 3
indicator:avg_cpu__max_perc_last_3m:yellow = 1
indicator:single_cpu__max_perc_last_3m:description = This indicator tracks the IOWait percentage for the single most bottle-necked CPU on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the percentage exceeds 5% and Red if it exceeds 10% during this window. 
indicator:single_cpu__max_perc_last_3m:red = 10
indicator:single_cpu__max_perc_last_3m:yellow = 5 
indicator:sum_top3_cpu_percs__max_last_3m:description = This indicator tracks the sum of IOWait percentage for the three most bottle-necked CPUs on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the sum exceeds 7% and Red if it exceeds 15% during this window. 
indicator:sum_top3_cpu_percs__max_last_3m:red = 15
indicator:sum_top3_cpu_percs__max_last_3m:yellow = 7    
 
I can´t find out how splunk generate this data and how this alert or indicator is created. There must be a kind of process like scripted input which execute a top command to look for the cpu wait time write it to the health.log in splunk ingest this log and a search which provide the information for these indicator.
 

nunoaragao
Explorer

I created two Splunk Support cases related to another item on Splunk Health, and it would be simpler if these had better support documentation.

The item was actually Buckets, which is listed on health.conf as feature:buckets stanza with two indicators. One of them percent_small_buckets_created_last_24h
What I found was, on app splunk_instrumentation there is a savedsearch called [instrumentation.deployment.index] that writes these Metrics to _internal based on querying rest /services/data/indexes endpoint.

10-21-2021 12:02:06.956 +0100 INFO  PeriodicHealthReporter - feature="Buckets" color=red indicator="percent_small_buckets_created_last_24h"

And then there is a second savedsearch with stanza [instrumentation.usage.healthMonitor.currentState] that joins a table returned from rest endpoints with these logs read from _internal. The search returns a JSON structure with the overall Health Report stats.

Is this structure used to populate the Health Report on the Web Interface ? Unsure. As I said .. two tickets ..

0 Karma

TomK
Observer

Do you guys still experiencing this error?
I can't get rid of it, event it looks everything's fine on my instance and OS.

THP are disabled.
Health Check does not report any issue with limits.
CPU usage is fine.
RAM usage is fine.

IOwait on disk is also fine.

TomK_0-1627037926114.png

Max measured values in the past 24h.

TomK_1-1627038276018.png

 

Any ideas?

0 Karma

michel_wolf
Path Finder

Nope for me I decide to adjust the threshold or disable this health check because it´s meaningless without documentation to find out how and how often splunk will check this iowait.

In generell I have no problems with my systems my searches are fast and I have no indexing delay so I think if you are interested in this the best option is to create a support case. I can´t find any searches which creates this data I think there is a "hidden" scripted input for this.

0 Karma

jotne
Builder

Kan you post how/where you adjusted the threshold?

0 Karma

michel_wolf
Path Finder

You can go to Settings --> "Health Report Manager" and then just search for iowait here you can enable the alert or edit the thresholds.

Or you can use the health.conf and copy the stanza to system local like:

[feature:iowait]
display_name = IOWait 
indicator:avg_cpu__max_perc_last_3m:description = This indicator tracks the average IOWait percentage across all CPUs on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the percentage exceeds 1% and Red if it exceeds 3% during this window.
indicator:avg_cpu__max_perc_last_3m:red = 3
indicator:avg_cpu__max_perc_last_3m:yellow = 1 
indicator:single_cpu__max_perc_last_3m:description = This indicator tracks the IOWait percentage for the single most bottle-necked CPU on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the percentage exceeds 5% and Red if it exceeds 10% during this window.
indicator:single_cpu__max_perc_last_3m:red = 10
indicator:single_cpu__max_perc_last_3m:yellow = 5 
indicator:sum_top3_cpu_percs__max_last_3m:description = This indicator tracks the sum of IOWait percentage for the three most bottle-necked CPUs on the machine running the Splunk Enterprise instance, over the last 3 minute window. By default, this indicator will turn Yellow if the sum exceeds 7% and Red if it exceeds 15% during this window.
indicator:sum_top3_cpu_percs__max_last_3m:red = 15
indicator:sum_top3_cpu_percs__max_last_3m:yellow = 7

hannyt
Engager

I also want to know the actual SPL used.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

See and control what is in the health report using the Health Report Manager at https://<localhost>:8000/en-US/manager/system/health/manager

You can read about the health report at https://docs.splunk.com/Documentation/Splunk/8.1.2/DMC/Aboutfeaturemonitoring?ref=hk

---
If this reply helps you, Karma would be appreciated.

michel_wolf
Path Finder

Yes this is ok, but if I take a look to /en-US/manager/system/health/manager or the documentation I found nothing about the calculation for the threshold and the link to the feature monitorings also don´t provide any information these data and calculation or I am blind.

Maybe two examples:

1. https://docs.splunk.com/Documentation/Splunk/8.1.2/DMC/Usefeaturemonitoring#1._Review_the_splunkd_he...

In this Screenshot you can see the an error and some related INFO messages for this behavior to understand why this indicator is yellow or red.

So I expect there is an alert savedsearch or something else like index=_internal sourcetype=splunkd component="CMMaster" "streaming error" | stats count as count | eval thresshold=if(count<10,"yellow","red") 

If you don´t have the information why the indicator is red or yellow the meaning of the indicator can be everything

2.  You got an error like this: 

michel_wolf_0-1624976119853.png

 

So how do you troubleshoot this warning or error? The only information you have is "System iowait reached yellow threshold of 1" , but I can not find anything on which data splunk calculates this information or how this data was generated.

The only thing I can find is the settings of the thresholds, but nothing about the calculation for these threshholds what makes an alert for me meaningsless

michel_wolf_1-1624976653595.png

 

 

 

richgalloway
SplunkTrust
SplunkTrust

I understand what you're asking for now.  I think you'll find the information you seek in $SPLUNK_HOME/etc/system/default/health.conf.

---
If this reply helps you, Karma would be appreciated.
0 Karma

michel_wolf
Path Finder

Unfortunately no, in this configuration you can only see what the indicator should stand for, but not how the data is collected and evaluated, but I have made some progress and I was able to find out that in the app splunk_instrumentation the following searches are for it:

 instrumentation.usage.healthMonitor.report

instrumentation.usage.healthMonitor.currentState

And in the currentState search is the join to the data from the health.log for the Treshhold 

index=_internal earliest=-1d source=*health.log component=PeriodicHealthReporter node_path=splunkd.resource_usage.iowait

In this case you can replace the iowait with the feature you want to look at in more detail

The last step that is still missing is how splunk generates the health.log, since the state is already created in this case for the evaluation.

 

0 Karma

bharathkumarnec
Contributor

Hi,

Please run below to get the iowait usage that is being tracked by splunk by default:

index=_introspection sourcetype=splunk_resource_usage component=IOWait 

Looks like they are taking it from there.

Regards,

BK

Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...