Splunk IT Service Intelligence

Issue with Service Health Score in ITSI

Path Finder

We are experiencing issues with services' health score alternating between 0 and 100 in the Service Analyzer in ITSI.
The health scores shows 0 even though all the underlying KPIs are ok. This happens for all of our defined services. The simplest case is shown below. Here we have a service "Azure Status" with only one defined KPI: "AzureStatus".

alt text

We recently updated to 3.0.0, but experienced the same issue before the upgrade (version 2.4.0).

Anyone ideas what would cause this or what the issue is?

0 Karma
1 Solution

Path Finder

It turned out our dev. SH kept writing to the itsi_summary index. ITSI is installed and the services are defined, but not all of the inputs are in place. The result was two log entries a minute, one with the right service health score and one constantly at zero, causing the health score to be incorrectly calculated.

View solution in original post

0 Karma

Path Finder

It turned out our dev. SH kept writing to the itsi_summary index. ITSI is installed and the services are defined, but not all of the inputs are in place. The result was two log entries a minute, one with the right service health score and one constantly at zero, causing the health score to be incorrectly calculated.

View solution in original post

0 Karma

SplunkTrust
SplunkTrust

Is that KPI running a base search or adhoc search?

0 Karma

Path Finder

The KPI is running an adhoc search.

0 Karma

SplunkTrust
SplunkTrust

Are you running on a single heard head or in a cluster?

0 Karma

Path Finder

Single search head.

0 Karma

SplunkTrust
SplunkTrust

Can you move your "Azure Status" service to a glass table icon and see if your still getting zero? This will tell us if its a Service Analyzer or ITSI issue

0 Karma

Path Finder

It looks to be alternating. The KPI's value is constant, but the health is switching from 100 to 0 at random intervals.

What's interesting, I tried adding some of the other services health scores to the same glass table, and all the scores are alternating between 0 and 100 at the exact same time. And there are no defined dependencies between them.

0 Karma

SplunkTrust
SplunkTrust

Can you share your adhoc search?

0 Karma

Path Finder

Yes! Search:
index=azure host=azurerss sourcetype=azurestatus
| eval value=if(StatusMessage="An issue has been discovered",0,1)

Threshold field: value
Split by entity: No
Calculating Average of aggregate over the last 15 minute(s) every 5 minutes.

0 Karma

SplunkTrust
SplunkTrust

I see the issue.. You are returning a value of 0 if the condition is true and returning a value of 1 if the condition is false. When ITSI is averaging the two values, it will never work out correctly.

A better approach would be to not average the results but rather sum them over the 5 minute span and if the count goes over a specified threshold, it can change the color of the KPI.

If you take this approach then your eval should look like this
| eval value=if(StatusMessage="An issue has been discovered",1,0)

0 Karma

Path Finder

Ok, I see. Thanks. I will try and see how it goes 🙂

However, as I brieftly mentioned ealier, this happens with all of our defined services. Another example is our AD monitoring where the four KPIs defining the service are green and the overall service health is red at 0. These KPIs are based on counters in standard perfmon logs, have no dependencies and are adhoc searches. An example of one of them is:

index=perfmon_ad host=<host-prefix>*  source="Perfmon:CPU Load" counter="% Processor Time"

Here, the threshold field is the field "Value" and we calculate maximum per entity, average of aggregate over the last 5 minute(s) ever 5 minute(s). The calculated value is typically 4-5% and threshold level "medium" is triggered when it reaches about 80%.

This is the case with the other counters as well. They are well below the trigger tresholds. So I don't see the reason why the overall service should be red...

0 Karma

SplunkTrust
SplunkTrust

Do you have the correct lag set? This could affect your output.

Try creating another service and adding a single ad-hoc KPI to it to see if the service reports the KPI score. Start small and add more KPI's to your service and verify its working. You should also create a glass table and add your service and KPI's to the glasstable rather than viewing it in the service analyzer.

I've noticed its easy to corrupt services when removing KPI's in earlier versions.

0 Karma

SplunkTrust
SplunkTrust

Did this work for you?

0 Karma

Path Finder

Sorry, haven't been working on this for a few days. However, the problem has now been solved!

It turned out the dev. SH kept writing to the itsi_summary index. ITSI is installed and the services are defined, but the inputs are not all in place. The result was two log entries a minute, one with the right service health score and one constantly at zero, causing the health score to be incorrectly calculated.

Anyway, thanks for your input, skoelpin! Now I know more about how to troubleshoot ITSI issues : )

0 Karma