Re: How to create a Dashboard that will show SLA f...

Aziz94 · ‎04-15-2022

Hi Everyone,

I am struggling a lot to create a Dashboard that will show SLA for alerts received on Incident review Dashboard

Basically I need two things only

1. SLA from alert received until assigned ( from status New to status in progress)

2. SLA from alert pending to closure ( from status Pending to status Closed)

I am facing many issues where empty fields into alert urgency and creation time

I have spent a week to create below query

| tstats `summariesonly` earliest(_time) as incident_creation_time from datamodel=Incident_Management.Notable_Events_Meta by source,Notable_Events_Meta.rule_id
| `drop_dm_object_name("Notable_Events_Meta")`
| `get_correlations`
| join type=outer rule_id
    [| from inputlookup:incident_review_lookup
    | eval _time=time
    | stats earliest(_time) as review_time by rule_id, owner, user, status, urgency]
| rename user as reviewer
| lookup update=true user_realnames_lookup user as "reviewer" OUTPUTNEW realname as "reviewer_realname"
| eval reviewer_realname=if(isnull(reviewer_realname),reviewer,reviewer_realname), nullstatus=if(isnull(status),"true","false"), temp_status=if(isnull(status),-1,status)
| lookup update=true reviewstatuses_lookup _key as temp_status OUTPUT status,label as status_label,description as status_description,default as status_default,end as status_end
| eval incident_duration_minutes=round(((review_time-incident_creation_time)/60),0)
| eval sla=case(urgency="critical" AND incident_duration_minutes>15, "breached", urgency="high" AND incident_duration_minutes>15, "breached", urgency="medium" AND incident_duration_minutes>45, "breached", urgency="low" AND incident_duration_minutes>70, "breached", isnull(review_time), "incident not assigned", 1=1, "not breached")
| convert timeformat="%F %T" ctime(review_time) AS review_time, ctime(incident_creation_time) AS incident_creation_time
| fields rule_id, source, urgency, reviewer_realname, incident_creation_time, review_time, incident_duration_minutes, sla, status_label
| table rule_id, source, urgency, reviewer_realname, incident_creation_time, review_time, incident_duration_minutes, sla, status_label

But still a lot of things are missing, could you please help in creating a small Dashboard with below requirements

1. SLA from alert received until assigned ( from status New to status in progress)

2. SLA from alert pending to closure ( from status Pending to status Closed)

Many thanks in advance

tscroggins · ‎04-16-2022

@Aziz94

Which version of Splunk Enterprise Security are you running? Splunk Enterprise Security 7.0 introduces the Executive Summary and SOC Operations dashboards and the Mean Time to Triage and Mean Time to Resolution metrics.

Whether you use Splunk Enterprise Security 7.0 or not, you can download the app from Splunkbase and read the metric searches for inspiration. (It may not be appropriate to copy and paste them here.)

Aziz94 · ‎04-19-2022

Thank you so much brother,

I check that but it's not giving me the things I need in details

I just need only two query

1. The time between status_label New to status_label In Progress

2. The time difference between status_label pending to status_label closed

And sla to show me if SLA breached or no

Highly appreciated any help as I tried with many splunk experts with no help

tscroggins · ‎04-21-2022

@Aziz94

Mean Time to Triage is supposed to measure the difference between status New and the first update, irrespective of status. (There may be a defect in the product's use of earliest() instead of min() when comparing values in the multi-valued time field. In my test environment, the triage metric has the same value as the resolution metric when all notables are closed.)

Mean Time to Resolution measures the difference between status New and status Closed.

Based on the searches behind those metrics, we can combine data from the Incident Management data model, the incident_review_lookup lookup, and the reviewstatuses_lookup lookup to calculate new metrics.

For reference, the status values include:

New (-1, 1, or null)
Unassigned (0)
In Progress (2)
Pending (3)
Resolved (4)
Closed (5)

You should read and understand "Restrict notable event status transitions" <https://docs.splunk.com/Documentation/ES/latest/Admin/Customizenotables#Restrict_notable_event_statu...> before proceeding. Your Status Configuration options, including which transitions are allowed and by whom, may invalidate the examples below.

To measure the time difference between status New and status In Progress:

| tstats summariesonly=true earliest(_time) as _time from datamodel=Incident_Management by "Notable_Events_Meta.rule_id"
| rename "Notable_Events_Meta.*" as "*"
| eval status=2
| lookup update=true incident_updates_lookup rule_id status outputnew time
| search time=*
| stats earliest(_time) as create_time max(time) as in_progress_time by rule_id
| eval diff=in_progress_time-create_time
| stats avg(diff) as mean_assignment_time
| fieldformat mean_assignment_time=tostring(mean_assignment_time, "duration")

To measure the time difference between status Pending and status Closed:

| tstats summariesonly=true earliest(_time) as _time from datamodel=Incident_Management by "Notable_Events_Meta.rule_id"
| rename "Notable_Events_Meta.*" as "*"
| eval status_pending=3, status_closed=5
| lookup update=true incident_updates_lookup rule_id status as status_pending output time as pending_time
| lookup update=true incident_updates_lookup rule_id status as status_closed output time as closed_time
| search pending_time=* closed_time=*
| stats max(pending_time) as pending_time max(closed_time) as closed_time by rule_id
| eval diff=closed_time-pending_time
| stats avg(diff) as mean_closure_time
| fieldformat mean_closure_time=tostring(mean_closure_time, "duration")

You now have two metrics, mean_assignment_time and mean_closure_time.

To measure service levels, start with the values from your service level agreements, nonfunctional requirements, etc. For example:

90% of notable events must be assigned within 10 minutes
85% of notable events must be closed within 24 hours

Calculate the assignment service level:

| tstats summariesonly=true earliest(_time) as _time from datamodel=Incident_Management by "Notable_Events_Meta.rule_id"
| rename "Notable_Events_Meta.*" as "*"
| eval status=2
| lookup update=true incident_updates_lookup rule_id status outputnew time
| search time=*
| stats earliest(_time) as create_time max(time) as in_progress_time by rule_id
| eval diff=in_progress_time-create_time
``` 10 minutes = 600 seconds ```
| stats sum(eval(if(diff<=600, 1, 0))) as assignment_service_level_met count
| eval assignment_service_level=round(100*assignment_service_level_met/count, 0)."%"

Calculate the closure service level:

| tstats summariesonly=true earliest(_time) as _time from datamodel=Incident_Management by "Notable_Events_Meta.rule_id"
| rename "Notable_Events_Meta.*" as "*"
| eval status_pending=3, status_closed=5
| lookup update=true incident_updates_lookup rule_id status as status_pending outputnew time as pending_time
| lookup update=true incident_updates_lookup rule_id status as status_closed outputnew time as closed_time
| search pending_time=* closed_time=*
| stats max(pending_time) as pending_time max(closed_time) as closed_time by rule_id
| eval diff=closed_time-pending_time
``` 24 hours = 86400 seconds ```
| stats sum(eval(if(diff<=86400, 1, 0))) as closure_service_level_met count
| eval closure_service_level=round(100*closure_service_level_met/count, 0)."%"

You can use the searches in a dashboard, add where or search commands to compare the service levels to your agreements, etc.

How to create a Dashboard that will show SLA for alerts received on Incident review Dashboard?

Dashboard Studio

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!