Solved: Why doesn't a maintenance window in ITSI put the s...

las · ‎03-13-2024

We are having a problem with maintenance windows in Splunk IT Service Intelligence.

We have a common service that two other services are dependent on, on top of those two there are other services dependent on them.

Service a Service b

Service in maintenance Service not in maintenance

Common Service

With the current implementation in ITSI, we are forced to put "Service in maintenance" and "Common Service" in maintenance mode to avoid getting wrong healthscores in "Service a".

This creates a problem for us, if an error occurs in "Common Service" during the maintenance window, as it won't reflect correctly in "Service not in maintenance", hence we will not be able to detect failures that affect our users.

We tried raising a ticket, that correctly stated the this is works as designed and documented.

We have an idea ITSIID-I-359, but so far it hasn't been upvoted.

Kind regards

las · ‎03-13-2024

Disclaimer:

This is in no way supported and will break Splunk support.

Don't use this in a production environment.

There are better ways to solve this, I'm just not smart enough to figure them out.

This hack breaks at the next update - allways, so extra care should be taken.

think this would be very nice if Splunk could support the desired behavior, maybee as an option or configuration.

There are three places that influences the maintenance window calculation:

service_health_metrics_monitor:

Original:

| mstats latest(alert_level) AS alert_level WHERE `get_itsi_summary_metrics_index` AND  
  `service_level_max_severity_metric_only` by itsi_kpi_id, itsi_service_id, kpi, kpi_importance
  | lookup kpi_alert_info_lookup alert_level OUTPUT severity_label AS alert_name | `mark_services_in_maintenance`
  | `reorganize_metrics_healthscore_results` | gethealth | `get_info_time_without_sid`
  | lookup service_kpi_lookup _key AS itsi_service_id OUTPUT sec_grp AS itsi_team_id
  | search itsi_team_id=*
  | fields - alert_severity, color, kpi, kpiid, serviceid, severity_label, severity_value
  | rename health_score AS service_health_score | eval is_null_alert_value=if(service_health_score="N/A", 1, 0), 
  service_health_score=if(service_health_score="N/A", 0, service_health_score)

This could be changed to:

Modified:

| mstats latest(alert_level) AS alert_level WHERE `get_itsi_summary_metrics_index` AND 
  `service_level_max_severity_metric_only` by itsi_kpi_id, itsi_service_id, kpi, kpi_importance
  | lookup kpi_alert_info_lookup alert_level OUTPUT severity_label AS alert_name | `mark_services_in_maintenance`
  | `reorganize_metrics_healthscore_results` | gethealth | `get_info_time_without_sid`
  | lookup service_kpi_lookup _key AS itsi_service_id OUTPUT sec_grp AS itsi_team_id
  | fields - alert_severity, color, kpi, kpiid, serviceid, severity_label, severity_value
  | rename health_score AS service_health_score | `mark_services_in_maintenance` | eval is_null_alert_value=if(service_health_score="N/A", 1, 0), 
  service_health_score=if(service_health_score="N/A", 0, service_health_score), alert_level=if(is_service_in_maintenance=1 AND alert_level>-2,-2,alert_level)

I have added an extra call to macro "mark_services_in_maintenance" and expanded the last eval to set alert_level to maintenance.

service_health_monitor:

Original:

`get_itsi_summary_index` host=atp-00pshs* `service_level_max_severity_event_only` 
| stats latest(urgency) AS urgency latest(alert_level) AS alert_level latest(alert_severity) as alert_name latest(service) AS service latest(is_service_in_maintenance) AS is_service_in_maintenance latest(kpi) AS kpi by kpiid, serviceid 
  | lookup service_kpi_lookup _key AS serviceid OUTPUT sec_grp AS itsi_team_id
  | search itsi_team_id=*
| gethealth 
| `gettime`

Could be changed to:

Modified:

`get_itsi_summary_index` `service_level_max_severity_event_only` 
| stats latest(urgency) AS urgency latest(alert_level) AS alert_level latest(alert_severity) as alert_name latest(service) AS service latest(is_service_in_maintenance) AS is_service_in_maintenance latest(kpi) AS kpi by kpiid, serviceid 
| gethealth 
| `gettime`
| `mark_services_in_maintenance`
| eval alert_level=if(is_service_in_maintenance=1 AND alert_level>-2,-2,alert_level), color=if(is_service_in_maintenance=1 AND alert_level=-2,"#5C6773",color), severity_label=if(is_service_in_maintenance=1 AND alert_level=-2,"maintenance",severity_label), alert_severity=if(is_service_in_maintenance=1 AND alert_level=-2,"maintenance",alert_severity)

Again an extra call to macro "mark_services_in_maintenance" and the eval at the bottom to set the service in maintenance.

Those to will ensure the service appears in maintenance in the "Service Anayser" and "Glasstables" I think it also takes care of "Deep Dives" but they don't appear to turn dark grey.

In order to ensure correct calculation we also have to make changes in "gethealth" search command. The Python script that is of interest is located here:

"SPLUNK_HOME/etc/apps/SA-ITOA/lib/itsi/searches/compute_health_score.py"

Search for "If a dependent service is disabled, its health should not affect other services"

You should get som code that look like this:

            for depends_on in service.get('services_depends_on', []):

                # If a dependent service is disabled, its health should not affect other services
                dependent_service_id = depends_on.get('serviceid')
                dependency_enabled = [
                    svc.get('enabled', 1) for svc in self.all_services if dependent_service_id == svc.get('_key')
                ]
                if len(dependency_enabled) == 1 and dependency_enabled[0] == 0:
                    continue

                for kpi in depends_on.get('kpis_depending_on', []):
                    # Get urgencies for dependent services

What I want is to replicate the behavior from a disbled service

            for depends_on in service.get('services_depends_on', []):

                # If a dependent service is disabled, its health should not affect other services
                dependent_service_id = depends_on.get('serviceid')
                dependency_enabled = [
                    svc.get('enabled', 1) for svc in self.all_services if dependent_service_id == svc.get('_key')
                ]
                if len(dependency_enabled) == 1 and dependency_enabled[0] == 0:
                    continue

                # If a dependent service is in maintenance, its health should not affect other services - ATP
                maintenance_service_id = depends_on.get('serviceid')
                try:
                    isinstance(self.maintenance_services, list)
                except:
                    self.maintenance_services = None
                if self._is_service_currently_in_maintenance(maintenance_service_id):
                    self.logger.info('ATP - is service in maintenance %s', self._is_service_currently_in_maintenance(maintenance_service_id))
                    continue
 
                for kpi in depends_on.get('kpis_depending_on', []):
                    # Get urgencies for dependent services

So I added a call to an existing function _is_service_currently_in_maintenance, unfortunately this fails, as the table maintenance_service is un-initialized (that is the try: except: block), now it just a simple check if the service we depend on is in maintenance and if it is, we break out with continue.

Again, this is NOT supported in any way and should not be used in production and will break at the next update.

Kind regards

View solution in original post

las · ‎03-13-2024

Disclaimer:

This is in no way supported and will break Splunk support.

Don't use this in a production environment.

There are better ways to solve this, I'm just not smart enough to figure them out.

This hack breaks at the next update - allways, so extra care should be taken.

think this would be very nice if Splunk could support the desired behavior, maybee as an option or configuration.

There are three places that influences the maintenance window calculation:

service_health_metrics_monitor:

Original:

| mstats latest(alert_level) AS alert_level WHERE `get_itsi_summary_metrics_index` AND  
  `service_level_max_severity_metric_only` by itsi_kpi_id, itsi_service_id, kpi, kpi_importance
  | lookup kpi_alert_info_lookup alert_level OUTPUT severity_label AS alert_name | `mark_services_in_maintenance`
  | `reorganize_metrics_healthscore_results` | gethealth | `get_info_time_without_sid`
  | lookup service_kpi_lookup _key AS itsi_service_id OUTPUT sec_grp AS itsi_team_id
  | search itsi_team_id=*
  | fields - alert_severity, color, kpi, kpiid, serviceid, severity_label, severity_value
  | rename health_score AS service_health_score | eval is_null_alert_value=if(service_health_score="N/A", 1, 0), 
  service_health_score=if(service_health_score="N/A", 0, service_health_score)

This could be changed to:

Modified:

| mstats latest(alert_level) AS alert_level WHERE `get_itsi_summary_metrics_index` AND 
  `service_level_max_severity_metric_only` by itsi_kpi_id, itsi_service_id, kpi, kpi_importance
  | lookup kpi_alert_info_lookup alert_level OUTPUT severity_label AS alert_name | `mark_services_in_maintenance`
  | `reorganize_metrics_healthscore_results` | gethealth | `get_info_time_without_sid`
  | lookup service_kpi_lookup _key AS itsi_service_id OUTPUT sec_grp AS itsi_team_id
  | fields - alert_severity, color, kpi, kpiid, serviceid, severity_label, severity_value
  | rename health_score AS service_health_score | `mark_services_in_maintenance` | eval is_null_alert_value=if(service_health_score="N/A", 1, 0), 
  service_health_score=if(service_health_score="N/A", 0, service_health_score), alert_level=if(is_service_in_maintenance=1 AND alert_level>-2,-2,alert_level)

I have added an extra call to macro "mark_services_in_maintenance" and expanded the last eval to set alert_level to maintenance.

service_health_monitor:

Original:

`get_itsi_summary_index` host=atp-00pshs* `service_level_max_severity_event_only` 
| stats latest(urgency) AS urgency latest(alert_level) AS alert_level latest(alert_severity) as alert_name latest(service) AS service latest(is_service_in_maintenance) AS is_service_in_maintenance latest(kpi) AS kpi by kpiid, serviceid 
  | lookup service_kpi_lookup _key AS serviceid OUTPUT sec_grp AS itsi_team_id
  | search itsi_team_id=*
| gethealth 
| `gettime`

Could be changed to:

Modified:

`get_itsi_summary_index` `service_level_max_severity_event_only` 
| stats latest(urgency) AS urgency latest(alert_level) AS alert_level latest(alert_severity) as alert_name latest(service) AS service latest(is_service_in_maintenance) AS is_service_in_maintenance latest(kpi) AS kpi by kpiid, serviceid 
| gethealth 
| `gettime`
| `mark_services_in_maintenance`
| eval alert_level=if(is_service_in_maintenance=1 AND alert_level>-2,-2,alert_level), color=if(is_service_in_maintenance=1 AND alert_level=-2,"#5C6773",color), severity_label=if(is_service_in_maintenance=1 AND alert_level=-2,"maintenance",severity_label), alert_severity=if(is_service_in_maintenance=1 AND alert_level=-2,"maintenance",alert_severity)

Again an extra call to macro "mark_services_in_maintenance" and the eval at the bottom to set the service in maintenance.

Those to will ensure the service appears in maintenance in the "Service Anayser" and "Glasstables" I think it also takes care of "Deep Dives" but they don't appear to turn dark grey.

In order to ensure correct calculation we also have to make changes in "gethealth" search command. The Python script that is of interest is located here:

"SPLUNK_HOME/etc/apps/SA-ITOA/lib/itsi/searches/compute_health_score.py"

Search for "If a dependent service is disabled, its health should not affect other services"

You should get som code that look like this:

            for depends_on in service.get('services_depends_on', []):

                # If a dependent service is disabled, its health should not affect other services
                dependent_service_id = depends_on.get('serviceid')
                dependency_enabled = [
                    svc.get('enabled', 1) for svc in self.all_services if dependent_service_id == svc.get('_key')
                ]
                if len(dependency_enabled) == 1 and dependency_enabled[0] == 0:
                    continue

                for kpi in depends_on.get('kpis_depending_on', []):
                    # Get urgencies for dependent services

What I want is to replicate the behavior from a disbled service

            for depends_on in service.get('services_depends_on', []):

                # If a dependent service is disabled, its health should not affect other services
                dependent_service_id = depends_on.get('serviceid')
                dependency_enabled = [
                    svc.get('enabled', 1) for svc in self.all_services if dependent_service_id == svc.get('_key')
                ]
                if len(dependency_enabled) == 1 and dependency_enabled[0] == 0:
                    continue

                # If a dependent service is in maintenance, its health should not affect other services - ATP
                maintenance_service_id = depends_on.get('serviceid')
                try:
                    isinstance(self.maintenance_services, list)
                except:
                    self.maintenance_services = None
                if self._is_service_currently_in_maintenance(maintenance_service_id):
                    self.logger.info('ATP - is service in maintenance %s', self._is_service_currently_in_maintenance(maintenance_service_id))
                    continue
 
                for kpi in depends_on.get('kpis_depending_on', []):
                    # Get urgencies for dependent services

So I added a call to an existing function _is_service_currently_in_maintenance, unfortunately this fails, as the table maintenance_service is un-initialized (that is the try: except: block), now it just a simple check if the service we depend on is in maintenance and if it is, we break out with continue.

Again, this is NOT supported in any way and should not be used in production and will break at the next update.

Kind regards

Why doesn't a maintenance window in ITSI put the service in maintenance

Introducing the Splunk Community Dashboard Challenge!

Get the T-shirt to Prove You Survived Splunk University Bootcamp

Wondering How to Build Resiliency in the Cloud?