Issue with MultiSearch and Missing Buckets with SL...

interrobang · ‎06-29-2023

Hey all, I've got a multisearch query using inputlookups to untangle a sprawling kafka setup, getting all the various latencies along source to destination and evaluating them, and grouping the results per application for an overall time eg. AppA avg latency is 1.09sec, AppX avg latency is 0.9secs

The simplified output of the main query looks like this (only with 8 or so other columns with the times that get summed) for a 3hour window.

Application	_time	total_avg	total_max
AppA	28/6/2023 0:00	0.05	0.09
AppA	28/6/2023 1:00	0.05	0.1
AppA	28/6/2023 2:00	0.05	0.08
AppB	28/6/2023 0:00	0.05	0.09
AppB	28/6/2023 1:00	0.22	2.72
AppB	28/6/2023 2:00	0.05	0.09
AppC	28/6/2023 0:00	0.06	0.1
AppC	28/6/2023 1:00	0.05	0.09
AppC	28/6/2023 2:00	0.05	0.09
AppX	28/6/2023 0:00	0.05	0.09
AppX	28/6/2023 1:00	0.04	0.09
AppX	28/6/2023 2:00	0.04	0.09

I'm trying to extend this query, against another lookup for the SLA threshold for each app and with the above output calculating SLA% for a dashboard to track the SLA across all Applications. Pretty basic, simply was the Apps latency below its specific SLA threshold and thus "OK", or was it over and in "BREACH" per span (hourly defult)... and of course whats the resulting SLA % per day/week/month for each App.

Using the following query below on the above output:

| makecontinuous _time span=60m
| filldown Application
| fillnull value="-1"
| lookup SLA.csv Application AS Application OUTPUT SLA_threshold
| eval spans = if(isnull(spans),1,spans)
| fields _time Application spans SLA_threshold total_avg total_max
| eval SLA_status = if(total_avg > SLA_threshold, "BREACH", "OK")
| eval SLA_nodata = if(total_avg < 0, "NODATA", "OK")
| eval BREACH = case(SLA_status == "BREACH", 1)
| eval OK = case(SLA_status == "OK", 1)
| eval NODATA = case(SLA_nodata == "NODATA", 1)
| stats sum(spans) as TotalSpans, count(OK) as OK, count(BREACH) as BREACH, count(NODATA) as NODATA, by Application

| eval SLA=OK/(TotalSpans)*100

Which is mostly working okay, returning results for a dashboard like:

Application	TotalSpans	SLA_threshold	OK	BREACH	NODATA	SLA %
AppA	24	1.5	24	0	1	100
…						
AppX	24	1	23	0	1	100

...Unfortunately, theres a central problem I need to take into account, being that sometimes the apps don't have any data for their latency calculations so they end up null, which means a missing bucket/span and this is throwing off results for SLA eval.

For sake of space, lets say over a 3hour period AppA is normal with 3x 1h span buckets of latency results output -- the SLA % eval will work fine. But say for whatever reason App X has missing results for bucket 01h, looking like this:

Application	_time	total_avg	total_max
AppA	28/6/2023 0:00	0.17	2.72
AppA	28/6/2023 1:00	0.04	0.09
AppA	28/6/2023 2:00	0.05	0.1
AppX	28/6/2023 0:00	0.04	1.09
AppX	28/6/2023 2:00	0.04	1.09

The SLA% eval will be off for AppX, being calculated over one less span.

Ideally, I need to fillin those empty buckets with something not only to correctly count spans per App, and not effect the SLA % calculation, but also to flag the missing data spans somehow. Being able to distinguish between an SLA Breach for data above threshold and a Breach for say "NODATA" so at least I have the option to choose how i treat those or have a secondary SLA...

As above, my current approach to this as above, has been to use makecontinuous _time span=60m and fillnull value="-1"

The -1 results can hit an eval for "NODATA" and be taken into account separately to buckets which "BREACH" their SLA latency threshold. eg.

Application	_time	total_avg	total_max
AppX	28/6/2023 0:00	0.04	1.09
AppX	28/6/2023 1:00	-1	-1
AppX	28/6/2023 2:00	0.04	1.09

Now the eval case logic for diff SLA and data conditions is not optimal or even right (a way to eval things as NODATA and class them as OK would be good).... Eitherway, as mentioned this approach is working okay with the above output when its a single specific App being queried, but once I search Application="*" -- the approach with "makecontinuous _time span=60" and the eval spans & other case logic no longer works as desired, because the _time buckets exist for the other Applications that have all their results data, so makecontinuous doesn't add any missing buckets or fillin "-1" for the Apps that don't..

I've also tried timechart, which will fill things in for all Apps, but then I'm faced with another problem because Applications is a non-numeric field, it adds gazillion columns eg. "total_avg: AppA" ... "total_avg: AppX" etc, theres a dozen other numeric results columns. I'd prefer things more simply output Application specific

Any suggestions for a tweak or alternate way to makecontinuous _time work on a per Application basis or a way to simplify or pivot off of the timechart output?

Issue with MultiSearch and Missing Buckets with SLA Eval

eval

stats

table

timechart

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)