Alerting

Conditional Alert based on comparison of results of 30mins window

poddraj
Explorer

Hi,
My scenario is that I have Counts of Total Requests, Success, Failure & Failure% for time span of every 30 mins over last 2 hours
Let's say first 30mins I got 100 hits and failure% is more than 60% then I want to send an alert immediately but let's say if first 30mins failure% is between 30-50% then I want to see the failure% of previous 30mins and if the failure% of this 30mins is also b/w 30-50% then I want to see one more previous 30mins failure% and if that interval also has same failure% then I want to trigger alert but if the 2nd or 3rd 30min interval has less then 30% failure then I do not want to send alert

I want this alert to be running for every 15 mins. How can I do this in splunk?
I have written below query to get the events for last 2 hours but could not move ahead on the next steps.

index=dte_fios sourcetype=dte2_Fios FT=*FT | eval Interval=strftime('_time',"%d-%m-%Y %H:%M:%S")
| eval Status=case(Error_Code=="0000","Success",1=1,"Failure")
| timechart span=30m count by Status
| eval Total = Success + Failure
| eval Failure%=round(Failure/Total*100)
| table _time,Total,Success,Failure,Failure%

Output is below:
_time Total Success Failure Failure%
2020-04-20 05:00:00 75 61 14 19
2020-04-20 05:30:00 207 129 78 38
2020-04-20 06:00:00 25 10 15 60

Labels (1)
0 Karma
1 Solution

to4kawa
Ultra Champion
| makeresults
| eval _raw="time,Total,Success,Failure,Failure_perc
2020-04-20 05:00:00,75,61,14,19
2020-04-20 05:30:00,207,129,78,38
2020-04-20 06:00:00,25,10,15,60"
| multikv forceheader=1
| eval _time=strptime(time,"%F %T")
| table _time,Total,Success,Failure,Failure_perc
| rename COMMENT as "this is your result. from here, the logic"
| autoregress Failure_perc p=2 as F_anHourAgo 
| autoregress Failure_perc p=1 as F_30MinsAgo
| eval alert=case(Failure_perc > 60, 1
, (30 <= Failure_perc AND Failure_perc <= 50) AND (30 <= F_30MinsAgo AND F_30MinsAgo <= 50) AND (30 <= F_anHourAgo AND F_anHourAgo <= 50), 1
, true(), 0)

timechart is ascending order by default.
and autoregress can keep old value.

...
| where alert = 1

For alerting.
By the way, between 50% and 60%, do you fire alert?

View solution in original post

to4kawa
Ultra Champion
| makeresults
| eval _raw="time,Total,Success,Failure,Failure_perc
2020-04-20 05:00:00,75,61,14,19
2020-04-20 05:30:00,207,129,78,38
2020-04-20 06:00:00,25,10,15,60"
| multikv forceheader=1
| eval _time=strptime(time,"%F %T")
| table _time,Total,Success,Failure,Failure_perc
| rename COMMENT as "this is your result. from here, the logic"
| autoregress Failure_perc p=2 as F_anHourAgo 
| autoregress Failure_perc p=1 as F_30MinsAgo
| eval alert=case(Failure_perc > 60, 1
, (30 <= Failure_perc AND Failure_perc <= 50) AND (30 <= F_30MinsAgo AND F_30MinsAgo <= 50) AND (30 <= F_anHourAgo AND F_anHourAgo <= 50), 1
, true(), 0)

timechart is ascending order by default.
and autoregress can keep old value.

...
| where alert = 1

For alerting.
By the way, between 50% and 60%, do you fire alert?

poddraj
Explorer

Thanks to4kawa...

It worked for me, however I am trying some modification as per my scenario but the query you gave really helped alot. Will come back to you for any further help..

0 Karma

poddraj
Explorer

Hi to4kawa,
I am trying to handle couple of scenarios based out of above query..
I want to alert the transaction details of failures which fall into alert=1 category and have written below query to achieve it

index=dte_fios sourcetype=dte2_Fios FT=*FT Error_Code!=0000 earliest=04/20/2020:11:00:00 latest=04/20/2020:13:00:00
| bin _time span=15m
| stats count as Total, count(eval(Error_Code!="0000")) AS Failure by FT,_time
| eval Failurepercent=round(Failure/Total*100)
| multikv forceheader=1
| table _time,FT,Total,Failure,Failurepercent
| autoregress Failurepercent p=3 as F_45MinsAgo
| autoregress Failurepercent p=2 as F_30MinsAgo
| autoregress Failurepercent p=1 as F_15MinsAgo
| lookup ftthresholdlkp FT
| eval alert=case(Minhits<=Total AND Total<=Maxhits AND Failurepercent > Failure_Threshold1, 1
, (Minhits<=Total AND Total<=Maxhits) AND (Failure_Threshold2 <= Failurepercent AND Failurepercent <= Failure_Threshold3) AND (Failure_Threshold2 <= F_15MinsAgo AND F_15MinsAgo <= Failure_Threshold3) AND (Failure_Threshold2 <= F_30MinsAgo AND F_30MinsAgo <= Failure_Threshold3) AND (Failure_Threshold2 <= F_45MinsAgo AND F_45MinsAgo <= Failure_Threshold3), 1
, (Minhits<=Total AND Total<=Maxhits) AND (Failure_Threshold3 <= Failurepercent AND Failurepercent <= Failure_Threshold4) AND (Failure_Threshold3 <= F_15MinsAgo AND F_15MinsAgo <= Failure_Threshold4) AND (Failure_Threshold3 <= F_30MinsAgo AND F_30MinsAgo <= Failure_Threshold4) ,1,(Minhits<=Total AND Total<=Maxhits) AND (Failure_Threshold4 <= Failurepercent AND Failurepercent <= Failure_Threshold5) AND (Failure_Threshold4 <= F_15MinsAgo AND F_15MinsAgo <= Failure_Threshold5),1,true(), 0)
| where alert=1
| map search="search index=dte_fios sourcetype=dte2_Fios FT=$FT$ earliest=04/20/2020:12:45:00 latest=04/20/2020:13:00:00 |table _time,WPID,MGRID,Host,System,DIP_Command,CID,DTE_Command,FT,OSS,Error_Code,Error_Msg"

If you observe line 3 of my query where I did stats by FT,_time.. this is required for me because I want to compare the failures for every 15mins for each domain in my application and whichever domain has highest failures I want to send those transaction details of that domain over the last 15mins in my alert email which you can see in my map query.

I have 2 doubts here:
1. If I get multiple domain satisfying the alert condition how can I get only the top domain from the where output so that I can use only that FT in my second query.
2. If suppose my alert is running at 1 PM and I have records during 12:45 - 1 PM but do not have records during 12:30 - 12:45 PM but have records during 12:15 - 12:30 PM then this F_15Minago is showing the Failurepercent value of duration 12:15 - 12:30 PM because I do not have row for 12:30-12:45 PM interval.. How can I handle this as this is incorrect

0 Karma

to4kawa
Ultra Champion
  1. what's top domain ? you can use lookup. so , use it and select with where
  2. Alert is scheduled search. select appropriate time picker and add where _time > as_you_kike you can use addinfo and relative_time()
0 Karma

poddraj
Explorer

What I meant by top domain is if I get two domains satisfying alert=1 condition how can I get the domain with high failure percent as input to my second query i.e. $FT$ value as I want to send top failing domain details only in alert email.

  1. Let us say my alert is running at 13:00PM then my req is to check for any domain crossing failure threshold during last 15 mins i.e. 12:45-1300 and I was asked to send the Total & Failures count over previous 4 15mins interval in alert email along with latest 15 mins interval. What I observed is if count is zero for 12:30-12:45 and >0 for other intervals (i.e. 12:15:12:30 & 12:00-12:15) then above query is not returning a row with that interval due to which the failurepercent value of F_15minsago is showing the count of interval 12:15-12:30 which should be the value of F-30minsago. Is the a way I can get row for every 15mins interval even though that interval doesn't have count for that domain
0 Karma

to4kawa
Ultra Champion

how can I get the domain with high failure percent

sort - Failurepercent | head 1

  1. Let us ...

| bin _time span=15m | stats count as Total, count(eval(Error_Code!="0000")) AS Failure by FT,_time

use timechart span=15min

0 Karma

poddraj
Explorer

Hi to4kawa,
When I use timechart in place of bin _time span=15m it is not giving me the desired table output and I cannot add timechart command after my stats command as it will limit the columns in the O/P

0 Karma

to4kawa
Ultra Champion

timechart in place of bin _time span=15m it is not giving me the desired table output

your query is wrong.
please fix it.

after my stats this can't work.

0 Karma

rmmiller
Contributor

Neat example for autoregress!

0 Karma

to4kawa
Ultra Champion
0 Karma

poddraj
Explorer

Hi
I couldn't get an idea on how to use this autoregress. I have added it to my query and got below O/P. How can I compare the failure% values of 3 intervals and take decision to alert.

Query:

index=dte_fios sourcetype=dte2_Fios FT=*FT
| eval Status=case(Error_Code=="0000","Success",1=1,"Failure")
| bin _time span=30m
| stats count as totalCount , count(eval(Error_Code="0000")) AS Success, count(eval(Error_Code!="0000")) AS Failure by _time
| eval Failurepercent=round(Failure/totalCount*100)
| autoregress Failurepercent p=1
| autoregress totalCount AS old_total p=1

OUTPUT

_time totalCount Success Failure Failurepercent Failurepercent_p1 old_total

2020-04-20 05:00:00 75 61 14 19

2020-04-20 05:30:00 207 129 78 38 19 75
2020-04-20 06:00:00 25 10 15 60 38 207

0 Karma
Get Updates on the Splunk Community!

.conf24 | Day 0

Hello Splunk Community! My name is Chris, and I'm based in Canberra, Australia's capital, and I travelled for ...

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

(view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...

Troubleshooting the OpenTelemetry Collector

  In this tech talk, you’ll learn how to troubleshoot the OpenTelemetry collector - from checking the ...