Splunk ITSI

Issue with Service Health Score in ITSI

svendby90
Path Finder

We are experiencing issues with services' health score alternating between 0 and 100 in the Service Analyzer in ITSI.
The health scores shows 0 even though all the underlying KPIs are ok. This happens for all of our defined services. The simplest case is shown below. Here we have a service "Azure Status" with only one defined KPI: "AzureStatus".

alt text

We recently updated to 3.0.0, but experienced the same issue before the upgrade (version 2.4.0).

Anyone ideas what would cause this or what the issue is?

0 Karma
1 Solution

svendby90
Path Finder

It turned out our dev. SH kept writing to the itsi_summary index. ITSI is installed and the services are defined, but not all of the inputs are in place. The result was two log entries a minute, one with the right service health score and one constantly at zero, causing the health score to be incorrectly calculated.

View solution in original post

0 Karma

svendby90
Path Finder

It turned out our dev. SH kept writing to the itsi_summary index. ITSI is installed and the services are defined, but not all of the inputs are in place. The result was two log entries a minute, one with the right service health score and one constantly at zero, causing the health score to be incorrectly calculated.

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Is that KPI running a base search or adhoc search?

0 Karma

svendby90
Path Finder

The KPI is running an adhoc search.

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Are you running on a single heard head or in a cluster?

0 Karma

svendby90
Path Finder

Single search head.

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Can you move your "Azure Status" service to a glass table icon and see if your still getting zero? This will tell us if its a Service Analyzer or ITSI issue

0 Karma

svendby90
Path Finder

It looks to be alternating. The KPI's value is constant, but the health is switching from 100 to 0 at random intervals.

What's interesting, I tried adding some of the other services health scores to the same glass table, and all the scores are alternating between 0 and 100 at the exact same time. And there are no defined dependencies between them.

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Can you share your adhoc search?

0 Karma

svendby90
Path Finder

Yes! Search:
index=azure host=azure_rss sourcetype=azure_status
| eval value=if(StatusMessage="An issue has been discovered",0,1)

Threshold field: value
Split by entity: No
Calculating Average of aggregate over the last 15 minute(s) every 5 minutes.

0 Karma

skoelpin
SplunkTrust
SplunkTrust

I see the issue.. You are returning a value of 0 if the condition is true and returning a value of 1 if the condition is false. When ITSI is averaging the two values, it will never work out correctly.

A better approach would be to not average the results but rather sum them over the 5 minute span and if the count goes over a specified threshold, it can change the color of the KPI.

If you take this approach then your eval should look like this
| eval value=if(StatusMessage="An issue has been discovered",1,0)

0 Karma

svendby90
Path Finder

Ok, I see. Thanks. I will try and see how it goes 🙂

However, as I brieftly mentioned ealier, this happens with all of our defined services. Another example is our AD monitoring where the four KPIs defining the service are green and the overall service health is red at 0. These KPIs are based on counters in standard perfmon logs, have no dependencies and are adhoc searches. An example of one of them is:

index=perfmon_ad host=<host-prefix>*  source="Perfmon:CPU Load" counter="% Processor Time"

Here, the threshold field is the field "Value" and we calculate maximum per entity, average of aggregate over the last 5 minute(s) ever 5 minute(s). The calculated value is typically 4-5% and threshold level "medium" is triggered when it reaches about 80%.

This is the case with the other counters as well. They are well below the trigger tresholds. So I don't see the reason why the overall service should be red...

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Do you have the correct lag set? This could affect your output.

Try creating another service and adding a single ad-hoc KPI to it to see if the service reports the KPI score. Start small and add more KPI's to your service and verify its working. You should also create a glass table and add your service and KPI's to the glasstable rather than viewing it in the service analyzer.

I've noticed its easy to corrupt services when removing KPI's in earlier versions.

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Did this work for you?

0 Karma

svendby90
Path Finder

Sorry, haven't been working on this for a few days. However, the problem has now been solved!

It turned out the dev. SH kept writing to the itsi_summary index. ITSI is installed and the services are defined, but the inputs are not all in place. The result was two log entries a minute, one with the right service health score and one constantly at zero, causing the health score to be incorrectly calculated.

Anyway, thanks for your input, skoelpin! Now I know more about how to troubleshoot ITSI issues : )

0 Karma
Get Updates on the Splunk Community!

New This Month in Splunk Observability Cloud - Metrics Usage Analytics, Enhanced K8s ...

The latest enhancements across the Splunk Observability portfolio deliver greater flexibility, better data and ...

Alerting Best Practices: How to Create Good Detectors

At their best, detectors and the alerts they trigger notify teams when applications aren’t performing as ...

Discover Powerful New Features in Splunk Cloud Platform: Enhanced Analytics, ...

Hey Splunky people! We are excited to share the latest updates in Splunk Cloud Platform 9.3.2408. In this ...