Splunk IT Service Intelligence

ITSI how to implement complex ServiceHealthScore KPIs?

Path Finder

Hello guys!

I now work on ITSI service models and health. I want my service models to be lightweight and elegant, but support drill-down to the exact problematic component and most importantly reflect the real service state from user perspective. I have done many experiments, but still cannot implement some ITSI services and KPIs satisfying my requirements.

For example, I want to build a 3-node cluster service, containing host1, host2 and host3. I have a KPI query to get each host UP and DOWN state.

I want to achieve the following:

1 My cluster service KPI value does not decrease at all until at least two of three nodes go to DOWN state. The service ServiceHealthScore decrease if two of three nodes go to DOWN state:

  • If only any one node goes to DOWN state, cluster service ServiceHealthScore KPI should remain Normal (Green color) with value 100. The broken host itself ServiceHealthScore should be High (Orange color).

  • If two nodes go to DOWN state, cluster service ServiceHealthScore KPI should be Low (Yellow color). The broken hosts - High (Orange color).

  • If all three nodes go to DOWN state, cluster service ServiceHealthScore KPI should be High (Orange color). All three broken hosts - High (Orange color).

2 Visualize the state of each host in cluster in ITSI Service Analyzer.

3 Have a possibility to alert state change of each individual host and whole cluster by ITSI Notable Event Aggregation Policies.

Below is what I managed to achieve.

Test 1: A very simple service model, one KPI counting number of alive nodes. I know, how many hosts in cluster are UP, I can create events on service state change. Neither I cannot show individual hosts state (poor drill-down), nor alert host state change. Administrator must find affected nodes beyond the Service Analyzer. This is unacceptable.

alt text
alt text

Service health is calculated as I want it:

  • If one host DOWN - service has Normal severity, correct:

alt text

  • If two hosts DOWN - service has Low severity, correct:

alt text

  • If all three hosts DOWN - service has High severity, correct:

alt text

Test 2: Same simple service model, but KPI has split by entity setting. I know how many hosts in cluster are UP, I know which hosts are UP and DOWN.

alt text
alt text
alt text

Unfortunately, ServiceHealthScore does not work as it should for the cluster:

  • If one host DOWN - service has High severity, wrong! (despite aggregate KPI is Normal, correct):

alt text

  • If two hosts DOWN - service has High severity, wrong! (despite aggregate KPI is Low, correct):

alt text

  • If all three hosts DOWN - service has High severity, correct:

alt text

Test 3: This is what I like the most in Service Analyzer. UP/DOWN KPI are linked to individual services representing hosts, cluster state is calculated automatically.

alt text
alt text
alt text

I could not find any way to force ServiceHealthScore KPI to behave like 'cluster' functionality:

  • If one host DOWN - service has High severity, wrong:

alt text

  • If two hosts DOWN - service has High severity, wrong:

alt text

  • If all three hosts DOWN - service has High severity, correct:

alt text

Actually, I want to implement my KPIs with Test 3 service model, which I personally like the most. Can anyone help me deal with ServiceHealthScore? Alternatively, maybe any workaround?

Splunk Employee
Splunk Employee

Not sure what is doable.
but if you are going the service-dependencies road,

For services dependencies, the upper service can only look at the last minute sub-services scores, (the current ones are not yet calculated), so there is a cascade effect, and a minute delay to see the updates.

0 Karma