We have got per entity thresholding enabled for a number of our services. If an individual host (entity) exceeds an entity threshold, but the aggregate threshold doesn't, the overall Service Health Score reduces and changes colour but the underlying KPI (that the entity belongs to) doesn't. The only way you can find what is happening is to click through the KPI's.
Is there a way to have the underlying KPI reflect that of an underlying entity/KPI that has hit a per entity threshold?
Does the entity thresholding need to be adjusted? That sounds like a strange issue. My thresholds for my entities though are way lower since the values on those KPIs will be less than the service.
What is the value of your KPIs? Is it a percentage or a count? And how is the calculation per entity and service?
This is an example of how we have configured the threshold levels aggregate versus entity:
Dashboard Load Test KPI:
Low Alert - > 500ms
Medium Alert -> 1000ms
High Alert -> 2000ms
Critical Alert -> 5000ms
Per Entity Value
Low Alert - > 1000ms
Medium Alert -> 2000ms
High Alert -> 4000ms
Critical Alert -> 7000m
In the composite score this KPI is set as an 11.
What we see is that when an individual host hits a threshold this impacts the overall service health score but doesn't reflect in the colouring of the KPI unless it is sufficient to cause an aggregrate threshold breach.
Which is, as I said, a bit annoying because you know an entity threshold has been breached but it isn't clear in which underlying KPI.
I am facing similar problem, please let me know if you have found the solution. If your one of the infra component has problem then aggregate KPI should give alert. Aggregated CPU utilization or Memory doesnt make sense. Please suggest if you have found any solution.
I think that example might show the issue. If you had just one entity and one KPI using that threshold I think what will happen is when the KPI on a single entity hist 501ms, the service (aggregate) is going to pass the low threshold and change colors but the entity is going to still be below the low threshold. When the entity hits 1001ms, the service is going to pass the medium threshold but the entity will just pass the low and so on. I'll have to do some testing.
In mine, the entity/aggregate thresholds are reversed. Low alert on the entity is >500, and on the aggregate of the entities it is 1000 since one entity being in a bad state won't necessarily affect the service, but will affect the entity.