ITSI how to implement complex ServiceHealthScore K...

oshirnin · ‎07-03-2019

Hello guys!

I now work on ITSI service models and health. I want my service models to be lightweight and elegant, but support drill-down to the exact problematic component and most importantly reflect the real service state from user perspective. I have done many experiments, but still cannot implement some ITSI services and KPIs satisfying my requirements.

For example, I want to build a 3-node cluster service, containing host1, host2 and host3. I have a KPI query to get each host UP and DOWN state.

I want to achieve the following:

1 My cluster service KPI value does not decrease at all until at least two of three nodes go to DOWN state. The service ServiceHealthScore decrease if two of three nodes go to DOWN state:

If only any one node goes to DOWN state, cluster service ServiceHealthScore KPI should remain Normal (Green color) with value 100. The broken host itself ServiceHealthScore should be High (Orange color).
If two nodes go to DOWN state, cluster service ServiceHealthScore KPI should be Low (Yellow color). The broken hosts - High (Orange color).
If all three nodes go to DOWN state, cluster service ServiceHealthScore KPI should be High (Orange color). All three broken hosts - High (Orange color).

2 Visualize the state of each host in cluster in ITSI Service Analyzer.

3 Have a possibility to alert state change of each individual host and whole cluster by ITSI Notable Event Aggregation Policies.

Below is what I managed to achieve.

Test 1: A very simple service model, one KPI counting number of alive nodes. I know, how many hosts in cluster are UP, I can create events on service state change. Neither I cannot show individual hosts state (poor drill-down), nor alert host state change. Administrator must find affected nodes beyond the Service Analyzer. This is unacceptable.

Service health is calculated as I want it:

If one host DOWN - service has Normal severity, correct:

If two hosts DOWN - service has Low severity, correct:

If all three hosts DOWN - service has High severity, correct:

Test 2: Same simple service model, but KPI has split by entity setting. I know how many hosts in cluster are UP, I know which hosts are UP and DOWN.

Unfortunately, ServiceHealthScore does not work as it should for the cluster:

If one host DOWN - service has High severity, wrong! (despite aggregate KPI is Normal, correct):

If two hosts DOWN - service has High severity, wrong! (despite aggregate KPI is Low, correct):

If all three hosts DOWN - service has High severity, correct:

Test 3: This is what I like the most in Service Analyzer. UP/DOWN KPI are linked to individual services representing hosts, cluster state is calculated automatically.

I could not find any way to force ServiceHealthScore KPI to behave like 'cluster' functionality:

If one host DOWN - service has High severity, wrong:

If two hosts DOWN - service has High severity, wrong:

If all three hosts DOWN - service has High severity, correct:

Actually, I want to implement my KPIs with Test 3 service model, which I personally like the most. Can anyone help me deal with ServiceHealthScore? Alternatively, maybe any workaround?

yannK · ‎10-24-2019

Not sure what is doable.
but if you are going the service-dependencies road,

look at the way the service-health scores are calculated https://docs.splunk.com/Documentation/ITSI/4.4.0/Configure/HealthScores the weights can be a way to assign importance to your sub services But there is no way to do something like "2 of 3 are up" with the weights.

For services dependencies, the upper service can only look at the last minute sub-services scores, (the current ones are not yet calculated), so there is a cascade effect, and a minute delay to see the updates.

ITSI how to implement complex ServiceHealthScore KPIs?

Splunk Custom Visualizations App End of Life

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk