MultiSearch and Missing Buckets Messing with SLA C...

interrobang · ‎06-28-2023

I've got a multisearch query basically using inputlookups to trace a sprawling kafka setup, getting all the various latencies from source to destination and grouping the results per application eg. AppA avg latency is 1.09sec, AppX avg latency is 0.9secs

eg.

Application	_time	total_avg	total_max
AppA	28/6/2023 0:00	0.05	0.09
AppA	28/6/2023 1:00	0.05	0.1
AppA	28/6/2023 2:00	0.05	0.08
AppB	28/6/2023 0:00	0.05	0.09
AppB	28/6/2023 1:00	0.22	2.72
AppB	28/6/2023 2:00	0.05	0.09
AppC	28/6/2023 0:00	0.06	0.1
AppC	28/6/2023 1:00	0.05	0.09
AppC	28/6/2023 2:00	0.05	0.09
AppX	28/6/2023 0:00	0.05	0.09
AppX	28/6/2023 1:00	0.04	0.09
AppX	28/6/2023 2:00	0.04	0.09

There are many other numeric results columns but for the sake of simplicity and endgoal of evaluating the SLA% they're irrelevant.

I'm trying to extend the query generating this output and make a dashboard to track the SLA across all Applications. Simply was the Apps latency below the Apps specific SLA expectation/threshold and Ok, or was it over and in Breach per span (hourly)... and of course whats the resulting SLA % per App per day/week/month.

Using the following query below the above query output:

| makecontinuous _time span=60m
| filldown Application
| fillnull value="-1"
| lookup SLA.csv Application AS Application OUTPUT SLA_threshold
| eval spans = if(isnull(spans),1,spans)
| fields _time Application spans SLA_threshold total_avg total_max
| eval SLA_status = if(total_avg > SLA_threshold, "BREACH", "OK")
| eval SLA_nodata = if(total_avg < 0, "NODATA", "OK")
| eval BREACH = case(SLA_status == "BREACH", 1)
| eval OK = case(SLA_status == "OK", 1)
| eval NODATA = case(SLA_nodata == "NODATA", 1)
| stats sum(spans) as TotalSpans, count(OK) as OK, count(BREACH) as BREACH, count(NODATA) as NODATA, by Application

| eval SLA=OK/(TotalSpans)*100

Which I have mostly working okay and which will return results for a dashboard like:

Application	TotalSpans	SLA_threshold	OK	BREACH	NODATA	SLA %
AppA	24	1.5	24	0	1	100
…						
AppX	24	1	23	0	1	100

But unfortunately, theres a central problem I need to take into account, being that sometimes the apps don't have any data for their latency calculations which end up null, and this is throwing off results for SLA as it results in missing bucket/spans.

For sake of space, lets say over a 3hour period AppA is normal with 3x 1h span buckets of latency data output -- the SLA % eval will work fine. But App X has missing results for bucket 01h, looking like this:

Application	_time	total_avg	total_max
AppA	28/6/2023 0:00	0.17	2.72
AppA	28/6/2023 1:00	0.04	0.09
AppA	28/6/2023 2:00	0.05	0.1
AppX	28/6/2023 0:00	0.04	1.09
AppX	28/6/2023 2:00	0.04	1.09

The SLA% eval will be off for AppX with one less span.

Ideally, I need to fillin those empty buckets with something not only to correctly count spans per App, so as to not effect the SLA % calculation, but also to flag the missing data somehow. Being able to distinguish between an SLA Breach for data above threshold and a Breach for say no data, or at least the option to choose how i treat it.

My current approach to this as above, has been to use makecontinuous _time span=60m and fillnull value="-1"

The -1 results can hit an eval for "NODATA" and be taken into account separately to a "BREACH". eg.

Application	_time	total_avg	total_max
AppX	28/6/2023 0:00	0.04	1.09
AppX	28/6/2023 1:00	-1	-1
AppX	28/6/2023 2:00	0.04	1.09

Now the eval case logic for diff SLA and data conditions is not optimal or even right (a way to eval things as NODATA and class them as OK would be good).... Eitherway, as mentioned this approach is working okay with the above output when its a single specific App being queried, but once I search Application="*" -- the approach with "makecontinuous _time span=60" and the eval spans & other case logic no longer works as desired, because the _time buckets exist for the other Applications that have all their results data, so makecontinuous doesn't add any missing buckets or fillin "-1" for the Apps that don't..

I've also tried timechart, which will fill things in for all Apps, but then I'm faced with another problem because Applications is a non-numeric field, it adds gazillion columns eg. "total_avg: AppA" ... "total_avg: AppX" etc, theres a dozen other numeric results columns. I'd prefer things more simply output Application specific

Any suggestions for a tweak or alternate way to makecontinuous _time work on a per Application basis or a way to simplify or pivot off of the timechart output?

MultiSearch and Missing Buckets Messing with SLA Calculations

chart

eval

stats

subsearch

timechart

Observability | Use Synthetic Monitoring for Website Metadata Verification

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk

.conf24 | Personalize your .conf experience with Learning Paths!