Splunk ITSI

How does aggregate command (like count, max, avg) in search for ITSI KPI?

htramtran83
Explorer

I have a search:

index=_internal sourcetype=scheduler | eval window_time = if(isnotnull(window_time), window_time, 0) | eval execution_latency = max(dispatch_time - (scheduled_time + window_time), 0) | timechart avg(execution_latency) AS avg_exec_latency, count(eval(status=="completed" OR status=="skipped")) AS total_exec, count(eval(status=="skipped")) AS skipped_exec | eval skip_ratio = round(skipped_exec / total_exec * 100, 2) 

It works perfectly in search, but when I use this to create KPI inside ITIS service to set "skipped ratio" number.
It doesn't give the result running.

Do anyone have explanation or solution?

1 Solution

skoelpin
SplunkTrust
SplunkTrust

You need to select the service, open the KPI, edit the KPI, scroll to the bottom and open the "Generated Search" box. Open this up to see how ITSI is appending aggregation functions and macro to your search. Then hit ctrl + shift + E to expand the macros, copy and paste. Look at the first stats command which ITSI applied and see whats causing it not to return values

View solution in original post

0 Karma

skoelpin
SplunkTrust
SplunkTrust

You need to select the service, open the KPI, edit the KPI, scroll to the bottom and open the "Generated Search" box. Open this up to see how ITSI is appending aggregation functions and macro to your search. Then hit ctrl + shift + E to expand the macros, copy and paste. Look at the first stats command which ITSI applied and see whats causing it not to return values

0 Karma

jojonguyen
Engager

I find out the macro that doesn't work inside the query. After I update, how can I fix back to the query?. I cannot copy and past back to "General Search"

Thanks

0 Karma

skoelpin
SplunkTrust
SplunkTrust

I don't understand, please elaborate

0 Karma

jojonguyen
Engager

like the long query back end: (Sorry it is long)

index=_internal sourcetype=scheduler
| eval window_time = if(isnotnull(window_time), window_time, 0)
| eval execution_latency = max(dispatch_time - (scheduled_time + window_time), 0)
| timechart avg(execution_latency) AS avg_exec_latency, count(eval(status=="completed" OR status=="skipped")) AS total_exec, count(eval(status=="skipped")) AS skipped_exec
| eval skip_ratio = round(skipped_exec / total_exec * 100, 2)
| eval avg_exec_latency = round(avg_exec_latency, 2)
| stats latest(skip_ratio) AS alert_value by host
| addinfo
| eval info_max_time=if(info_max_time="+Infinity",now() + 315569260,info_max_time)
| eval _time=info_max_time
| fields - info_min_time info_max_time info_search_time info_sid
| eval sec_grp = "default_itsi_security_group"
| lookup itsi_entities identifier.values as host, sec_grp as sec_grp OUTPUT title as entity_title, _key as entity_key, services._key as serviceid, sec_grp as entity_sec_grp
| eval maintenance_object_type = "entity", maintenance_object_key = entity_key
| lookup operative_maintenance_log maintenance_object_type, maintenance_object_key OUTPUT _key as maintenance_log_key
| eval in_maintenance = if(IsNull(maintenance_log_key), 0, 1)
| fields - maintenance_object_key, maintenance_object_type, maintenance_log_key
| eval is_entity_defined=if(isnull(entity_key), "0", "1"), entity_key=if(isnull(entity_key), "N/A", entity_key), entity_title=coalesce(entity_title,'host'), is_service_aggregate="0", is_entity_in_maintenance = in_maintenance
| fields - host, in_maintenance, entity_sec_grp
| eval serviceid = "a3b8a294-60e8-4517-9311-b89040cfa8bb"
| appendpipe
[ stats max(alert_value) AS alert_value by serviceid, is_entity_in_maintenance
| presort 0 auto(serviceid) auto(is_entity_in_maintenance)
| sort 0 serviceid is_entity_in_maintenance
| dedup consecutive=t serviceid
| eval is_all_entities_in_maintenance=is_entity_in_maintenance, is_service_aggregate="1", is_entity_defined="0", entity_key="service_aggregate", entity_title="service_aggregate"]
| addinfo
| eval info_max_time=if(info_max_time="+Infinity",now() + 315569260,info_max_time)
| eval _time=info_max_time
| fields - info_min_time info_max_time info_search_time info_sid
| eval maintenance_service_id = "a3b8a294-60e8-4517-9311-b89040cfa8bb"
| eval maintenance_object_type = "service", maintenance_object_key = maintenance_service_id
| lookup operative_maintenance_log maintenance_object_type, maintenance_object_key OUTPUT _key as maintenance_log_key
| eval in_maintenance = if(IsNull(maintenance_log_key), 0, 1)
| fields - maintenance_object_key, maintenance_object_type, maintenance_log_key
| eval is_service_in_maintenance = in_maintenance
| fields - in_maintenance, maintenance_service_id
| setseverityfields serviceid="a3b8a294-60e8-4517-9311-b89040cfa8bb", kpiid="8209f5d9ed8cd590a79e7174" handle_no_data=true generate_max_severity_event=true fill_data_gaps=true
| fields - is_all_entities_in_maintenance
| eval kpi="Skipped 2", urgency="5", alert_period="5", serviceid="a3b8a294-60e8-4517-9311-b89040cfa8bb"
| eval urgency = if (is_service_in_maintenance == 1, 0, urgency)

There are two lines that I bold need to fix. After I fix them, How can I bring it back to ITSI? I am confused

0 Karma

skoelpin
SplunkTrust
SplunkTrust

You have two issues here.. First one is your round, you're 100 is in the wrong spot, it should look like this

| eval skip_ratio = round(skipped_exec / total_exec, 2)*100

Next up, you're entity is host but your timechart command doesn't pass host. ITSI best practices call for not using transformational commands, but rather let ITSI do all of that. If you still need to use timechart and think eventstats wont work, then ditch timechart and pass host through your stats like this

| stats avg(execution_latency) AS avg_exec_latency, count(eval(status=="completed" OR status=="skipped")) AS total_exec, count(eval(status=="skipped")) AS skipped_exec by _time, host

Heres your total query which will work

index=_internal sourcetype=scheduler 
| eval window_time = if(isnotnull(window_time), window_time, 0) 
| eval execution_latency = max(dispatch_time - (scheduled_time + window_time), 0) 
| stats avg(execution_latency) AS avg_exec_latency, count(eval(status=="completed" OR status=="skipped")) AS total_exec, count(eval(status=="skipped")) AS skipped_exec by _time, host
| eval skip_ratio = round(skipped_exec / total_exec, 2)*100
| eval avg_exec_latency = round(avg_exec_latency, 2)
| stats latest(skip_ratio) AS alert_value by host
0 Karma

jojonguyen
Engager

Thank you for helping me figure out.
I just don't agree with round: I believe the correct I use is
| eval skip_ratio = round((skipped_exec / total_exec)*100 , 2)

and I also try with eventstats and split by host field inside, took out the last line about
| stats latest(skip_ratio) AS alert_value by host

and let ITSI do it (as you said)

index=_internal sourcetype=scheduler (status="completed" OR status="skipped" OR status="deferred")
| eval window_time = if(isnotnull(window_time), window_time, 0)
| eval execution_latency = max(dispatch_time - (scheduled_time + window_time), 0)
| eventstats avg(execution_latency) AS avg_exec_latency
count(eval(status=="skipped")) AS skipped_exec , count(eval(status=="completed" OR status=="skipped")) AS total_exec
| eval skip_ratio = round((skipped_exec / total_exec)*100,2)
| eval avg_exec_latency = round(avg_exec_latency, 2)

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Your original round looks like this, which is incorrect

| eval skip_ratio = round(skipped_exec / total_exec 100, 2) 

The way you wrote it, is correct along with mine.

Can you accept/upvote the answer since your issue is resolved?

0 Karma

jojonguyen
Engager

| stats latest(skip_ratio) AS alert_value by host
+++++++
| presort 0 auto(serviceid) auto(is_entity_in_maintenance)*

these two lines I will fix and delete

0 Karma

jojonguyen
Engager

anyone have thought. Can we replace timechart with any command? with Timechart, it doesn't work when place in ITSI

0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...