Hey all, I've got a multisearch query using inputlookups to untangle a sprawling kafka setup, getting all the various latencies along source to destination and evaluating them, and grouping the results per application for an overall time eg. AppA avg latency is 1.09sec, AppX avg latency is 0.9secs
Application _time total_avg total_max
AppA 28/6/2023 0:00 0.05 0.09
AppA 28/6/2023 1:00 0.05 0.1
AppA 28/6/2023 2:00 0.05 0.08
AppB 28/6/2023 0:00 0.05 0.09
AppB 28/6/2023 1:00 0.22 2.72
AppB 28/6/2023 2:00 0.05 0.09
AppC 28/6/2023 0:00 0.06 0.1
AppC 28/6/2023 1:00 0.05 0.09
AppC 28/6/2023 2:00 0.05 0.09
AppX 28/6/2023 0:00 0.05 0.09
AppX 28/6/2023 1:00 0.04 0.09
AppX 28/6/2023 2:00 0.04 0.09
I'm trying to extend this query, against another lookup for the SLA threshold for each app and with the above output calculating SLA% for a dashboard to track the SLA across all Applications. Pretty basic, simply was the Apps latency below its specific SLA threshold and thus "OK", or was it over and in "BREACH" per span (hourly defult)... and of course whats the resulting SLA % per day/week/month for each App.
Using the following query below on the above output:
| makecontinuous _time span=60m
| filldown Application
| fillnull value="-1"
| lookup SLA.csv Application AS Application OUTPUT SLA_threshold
| eval spans = if(isnull(spans),1,spans)
| fields _time Application spans SLA_threshold total_avg total_max
| eval SLA_status = if(total_avg > SLA_threshold, "BREACH", "OK")
| eval SLA_nodata = if(total_avg < 0, "NODATA", "OK")
| eval BREACH = case(SLA_status == "BREACH", 1)
| eval OK = case(SLA_status == "OK", 1)
| eval NODATA = case(SLA_nodata == "NODATA", 1)
| stats sum(spans) as TotalSpans, count(OK) as OK, count(BREACH) as BREACH, count(NODATA) as NODATA, by Application
| eval SLA=OK/(TotalSpans)*100
Which is mostly working okay, returning results for a dashboard like:
Application TotalSpans SLA_threshold OK BREACH NODATA SLA %
AppA 24 1.5 24 0 1 100
…
AppX 24 1 23 0 1 100
...Unfortunately, theres a central problem I need to take into account, being that sometimes the apps don't have any data for their latency calculations so they end up null, which means a missing bucket/span and this is throwing off results for SLA eval.
Application _time total_avg total_max
AppA 28/6/2023 0:00 0.17 2.72
AppA 28/6/2023 1:00 0.04 0.09
AppA 28/6/2023 2:00 0.05 0.1
AppX 28/6/2023 0:00 0.04 1.09
AppX 28/6/2023 2:00 0.04 1.09
The SLA% eval will be off for AppX, being calculated over one less span.
Ideally, I need to fillin those empty buckets with something not only to correctly count spans per App, and not effect the SLA % calculation, but also to flag the missing data spans somehow. Being able to distinguish between an SLA Breach for data above threshold and a Breach for say "NODATA" so at least I have the option to choose how i treat those or have a secondary SLA...
Application _time total_avg total_max
AppX 28/6/2023 0:00 0.04 1.09
AppX 28/6/2023 1:00 -1 -1
AppX 28/6/2023 2:00 0.04 1.09
Now the eval case logic for diff SLA and data conditions is not optimal or even right (a way to eval things as NODATA and class them as OK would be good).... Eitherway, as mentioned this approach is working okay with the above output when its a single specific App being queried, but once I search Application="*" -- the approach with "makecontinuous _time span=60" and the eval spans & other case logic no longer works as desired, because the _time buckets exist for the other Applications that have all their results data, so makecontinuous doesn't add any missing buckets or fillin "-1" for the Apps that don't..