Hey all, I've got a multisearch query using inputlookups to untangle a sprawling kafka setup, getting all the various latencies along source to destination and evaluating them, and grouping the results per application for an overall time eg. AppA avg latency is 1.09sec, AppX avg latency is 0.9secs The simplified output of the main query looks like this (only with 8 or so other columns with the times that get summed) for a 3hour window. Application _time total_avg total_max
AppA 28/6/2023 0:00 0.05 0.09
AppA 28/6/2023 1:00 0.05 0.1
AppA 28/6/2023 2:00 0.05 0.08
AppB 28/6/2023 0:00 0.05 0.09
AppB 28/6/2023 1:00 0.22 2.72
AppB 28/6/2023 2:00 0.05 0.09
AppC 28/6/2023 0:00 0.06 0.1
AppC 28/6/2023 1:00 0.05 0.09
AppC 28/6/2023 2:00 0.05 0.09
AppX 28/6/2023 0:00 0.05 0.09
AppX 28/6/2023 1:00 0.04 0.09
AppX 28/6/2023 2:00 0.04 0.09 I'm trying to extend this query, against another lookup for the SLA threshold for each app and with the above output calculating SLA% for a dashboard to track the SLA across all Applications. Pretty basic, simply was the Apps latency below its specific SLA threshold and thus "OK", or was it over and in "BREACH" per span (hourly defult)... and of course whats the resulting SLA % per day/week/month for each App. Using the following query below on the above output: | makecontinuous _time span=60m
| filldown Application
| fillnull value="-1"
| lookup SLA.csv Application AS Application OUTPUT SLA_threshold
| eval spans = if(isnull(spans),1,spans)
| fields _time Application spans SLA_threshold total_avg total_max
| eval SLA_status = if(total_avg > SLA_threshold, "BREACH", "OK")
| eval SLA_nodata = if(total_avg < 0, "NODATA", "OK")
| eval BREACH = case(SLA_status == "BREACH", 1)
| eval OK = case(SLA_status == "OK", 1)
| eval NODATA = case(SLA_nodata == "NODATA", 1)
| stats sum(spans) as TotalSpans, count(OK) as OK, count(BREACH) as BREACH, count(NODATA) as NODATA, by Application
| eval SLA=OK/(TotalSpans)*100 Which is mostly working okay, returning results for a dashboard like: Application TotalSpans SLA_threshold OK BREACH NODATA SLA %
AppA 24 1.5 24 0 1 100
…
AppX 24 1 23 0 1 100 ...Unfortunately, theres a central problem I need to take into account, being that sometimes the apps don't have any data for their latency calculations so they end up null, which means a missing bucket/span and this is throwing off results for SLA eval. For sake of space, lets say over a 3hour period AppA is normal with 3x 1h span buckets of latency results output -- the SLA % eval will work fine. But say for whatever reason App X has missing results for bucket 01h, looking like this: Application _time total_avg total_max
AppA 28/6/2023 0:00 0.17 2.72
AppA 28/6/2023 1:00 0.04 0.09
AppA 28/6/2023 2:00 0.05 0.1
AppX 28/6/2023 0:00 0.04 1.09
AppX 28/6/2023 2:00 0.04 1.09 The SLA% eval will be off for AppX, being calculated over one less span. Ideally, I need to fillin those empty buckets with something not only to correctly count spans per App, and not effect the SLA % calculation, but also to flag the missing data spans somehow. Being able to distinguish between an SLA Breach for data above threshold and a Breach for say "NODATA" so at least I have the option to choose how i treat those or have a secondary SLA... As above, my current approach to this as above, has been to use makecontinuous _time span=60m and fillnull value="-1" The -1 results can hit an eval for "NODATA" and be taken into account separately to buckets which "BREACH" their SLA latency threshold. eg. Application _time total_avg total_max
AppX 28/6/2023 0:00 0.04 1.09
AppX 28/6/2023 1:00 -1 -1
AppX 28/6/2023 2:00 0.04 1.09 Now the eval case logic for diff SLA and data conditions is not optimal or even right (a way to eval things as NODATA and class them as OK would be good).... Eitherway, as mentioned this approach is working okay with the above output when its a single specific App being queried, but once I search Application="*" -- the approach with "makecontinuous _time span=60" and the eval spans & other case logic no longer works as desired, because the _time buckets exist for the other Applications that have all their results data, so makecontinuous doesn't add any missing buckets or fillin "-1" for the Apps that don't.. I've also tried timechart, which will fill things in for all Apps, but then I'm faced with another problem because Applications is a non-numeric field, it adds gazillion columns eg. "total_avg: AppA" ... "total_avg: AppX" etc, theres a dozen other numeric results columns. I'd prefer things more simply output Application specific Any suggestions for a tweak or alternate way to makecontinuous _time work on a per Application basis or a way to simplify or pivot off of the timechart output?
... View more