I've got a multisearch query basically using inputlookups to trace a sprawling kafka setup, getting all the various latencies from source to destination and grouping the results per application eg. AppA avg latency is 1.09sec, AppX avg latency is 0.9secs
Application _time total_avg total_max
AppA 28/6/2023 0:00 0.05 0.09
AppA 28/6/2023 1:00 0.05 0.1
AppA 28/6/2023 2:00 0.05 0.08
AppB 28/6/2023 0:00 0.05 0.09
AppB 28/6/2023 1:00 0.22 2.72
AppB 28/6/2023 2:00 0.05 0.09
AppC 28/6/2023 0:00 0.06 0.1
AppC 28/6/2023 1:00 0.05 0.09
AppC 28/6/2023 2:00 0.05 0.09
AppX 28/6/2023 0:00 0.05 0.09
AppX 28/6/2023 1:00 0.04 0.09
AppX 28/6/2023 2:00 0.04 0.09
There are many other numeric results columns but for the sake of simplicity and endgoal of evaluating the SLA% they're irrelevant.
I'm trying to extend the query generating this output and make a dashboard to track the SLA across all Applications. Simply was the Apps latency below the Apps specific SLA expectation/threshold and Ok, or was it over and in Breach per span (hourly)... and of course whats the resulting SLA % per App per day/week/month.
Using the following query below the above query output:
| makecontinuous _time span=60m
| filldown Application
| fillnull value="-1"
| lookup SLA.csv Application AS Application OUTPUT SLA_threshold
| eval spans = if(isnull(spans),1,spans)
| fields _time Application spans SLA_threshold total_avg total_max
| eval SLA_status = if(total_avg > SLA_threshold, "BREACH", "OK")
| eval SLA_nodata = if(total_avg < 0, "NODATA", "OK")
| eval BREACH = case(SLA_status == "BREACH", 1)
| eval OK = case(SLA_status == "OK", 1)
| eval NODATA = case(SLA_nodata == "NODATA", 1)
| stats sum(spans) as TotalSpans, count(OK) as OK, count(BREACH) as BREACH, count(NODATA) as NODATA, by Application
| eval SLA=OK/(TotalSpans)*100
Which I have mostly working okay and which will return results for a dashboard like:
Application TotalSpans SLA_threshold OK BREACH NODATA SLA %
AppA 24 1.5 24 0 1 100
…
AppX 24 1 23 0 1 100
But unfortunately, theres a central problem I need to take into account, being that sometimes the apps don't have any data for their latency calculations which end up null, and this is throwing off results for SLA as it results in missing bucket/spans.
Application _time total_avg total_max
AppA 28/6/2023 0:00 0.17 2.72
AppA 28/6/2023 1:00 0.04 0.09
AppA 28/6/2023 2:00 0.05 0.1
AppX 28/6/2023 0:00 0.04 1.09
AppX 28/6/2023 2:00 0.04 1.09
The SLA% eval will be off for AppX with one less span.
Ideally, I need to fillin those empty buckets with something not only to correctly count spans per App, so as to not effect the SLA % calculation, but also to flag the missing data somehow. Being able to distinguish between an SLA Breach for data above threshold and a Breach for say no data, or at least the option to choose how i treat it.
Application _time total_avg total_max
AppX 28/6/2023 0:00 0.04 1.09
AppX 28/6/2023 1:00 -1 -1
AppX 28/6/2023 2:00 0.04 1.09