We are experiencing issues with services' health score alternating between 0 and 100 in the Service Analyzer in ITSI.
The health scores shows 0 even though all the underlying KPIs are ok. This happens for all of our defined services. The simplest case is shown below. Here we have a service "Azure Status" with only one defined KPI: "AzureStatus".
We recently updated to 3.0.0, but experienced the same issue before the upgrade (version 2.4.0).
Anyone ideas what would cause this or what the issue is?
It turned out our dev. SH kept writing to the itsi_summary index. ITSI is installed and the services are defined, but not all of the inputs are in place. The result was two log entries a minute, one with the right service health score and one constantly at zero, causing the health score to be incorrectly calculated.
It turned out our dev. SH kept writing to the itsi_summary index. ITSI is installed and the services are defined, but not all of the inputs are in place. The result was two log entries a minute, one with the right service health score and one constantly at zero, causing the health score to be incorrectly calculated.
Is that KPI running a base search or adhoc search?
The KPI is running an adhoc search.
Are you running on a single heard head or in a cluster?
Single search head.
Can you move your "Azure Status" service to a glass table icon and see if your still getting zero? This will tell us if its a Service Analyzer or ITSI issue
It looks to be alternating. The KPI's value is constant, but the health is switching from 100 to 0 at random intervals.
What's interesting, I tried adding some of the other services health scores to the same glass table, and all the scores are alternating between 0 and 100 at the exact same time. And there are no defined dependencies between them.
Can you share your adhoc search?
Yes! Search:
index=azure host=azure_rss sourcetype=azure_status
| eval value=if(StatusMessage="An issue has been discovered",0,1)
Threshold field: value
Split by entity: No
Calculating Average of aggregate over the last 15 minute(s) every 5 minutes.
I see the issue.. You are returning a value of 0 if the condition is true and returning a value of 1 if the condition is false. When ITSI is averaging the two values, it will never work out correctly.
A better approach would be to not average the results but rather sum them over the 5 minute span and if the count goes over a specified threshold, it can change the color of the KPI.
If you take this approach then your eval should look like this
| eval value=if(StatusMessage="An issue has been discovered",1,0)
Ok, I see. Thanks. I will try and see how it goes 🙂
However, as I brieftly mentioned ealier, this happens with all of our defined services. Another example is our AD monitoring where the four KPIs defining the service are green and the overall service health is red at 0. These KPIs are based on counters in standard perfmon logs, have no dependencies and are adhoc searches. An example of one of them is:
index=perfmon_ad host=<host-prefix>* source="Perfmon:CPU Load" counter="% Processor Time"
Here, the threshold field is the field "Value" and we calculate maximum per entity, average of aggregate over the last 5 minute(s) ever 5 minute(s). The calculated value is typically 4-5% and threshold level "medium" is triggered when it reaches about 80%.
This is the case with the other counters as well. They are well below the trigger tresholds. So I don't see the reason why the overall service should be red...
Do you have the correct lag set? This could affect your output.
Try creating another service and adding a single ad-hoc KPI to it to see if the service reports the KPI score. Start small and add more KPI's to your service and verify its working. You should also create a glass table and add your service and KPI's to the glasstable rather than viewing it in the service analyzer.
I've noticed its easy to corrupt services when removing KPI's in earlier versions.
Did this work for you?
Sorry, haven't been working on this for a few days. However, the problem has now been solved!
It turned out the dev. SH kept writing to the itsi_summary index. ITSI is installed and the services are defined, but the inputs are not all in place. The result was two log entries a minute, one with the right service health score and one constantly at zero, causing the health score to be incorrectly calculated.
Anyway, thanks for your input, skoelpin! Now I know more about how to troubleshoot ITSI issues : )