Alerting

Conditional Alert based on comparison of results of 30mins window

poddraj
Explorer

Hi,
My scenario is that I have Counts of Total Requests, Success, Failure & Failure% for time span of every 30 mins over last 2 hours
Let's say first 30mins I got 100 hits and failure% is more than 60% then I want to send an alert immediately but let's say if first 30mins failure% is between 30-50% then I want to see the failure% of previous 30mins and if the failure% of this 30mins is also b/w 30-50% then I want to see one more previous 30mins failure% and if that interval also has same failure% then I want to trigger alert but if the 2nd or 3rd 30min interval has less then 30% failure then I do not want to send alert

I want this alert to be running for every 15 mins. How can I do this in splunk?
I have written below query to get the events for last 2 hours but could not move ahead on the next steps.

index=dte_fios sourcetype=dte2_Fios FT=*FT | eval Interval=strftime('_time',"%d-%m-%Y %H:%M:%S")
| eval Status=case(Error_Code=="0000","Success",1=1,"Failure")
| timechart span=30m count by Status
| eval Total = Success + Failure
| eval Failure%=round(Failure/Total*100)
| table _time,Total,Success,Failure,Failure%

Output is below:
_time Total Success Failure Failure%
2020-04-20 05:00:00 75 61 14 19
2020-04-20 05:30:00 207 129 78 38
2020-04-20 06:00:00 25 10 15 60

Labels (1)
0 Karma
1 Solution

to4kawa
Ultra Champion
| makeresults
| eval _raw="time,Total,Success,Failure,Failure_perc
2020-04-20 05:00:00,75,61,14,19
2020-04-20 05:30:00,207,129,78,38
2020-04-20 06:00:00,25,10,15,60"
| multikv forceheader=1
| eval _time=strptime(time,"%F %T")
| table _time,Total,Success,Failure,Failure_perc
| rename COMMENT as "this is your result. from here, the logic"
| autoregress Failure_perc p=2 as F_anHourAgo 
| autoregress Failure_perc p=1 as F_30MinsAgo
| eval alert=case(Failure_perc > 60, 1
, (30 <= Failure_perc AND Failure_perc <= 50) AND (30 <= F_30MinsAgo AND F_30MinsAgo <= 50) AND (30 <= F_anHourAgo AND F_anHourAgo <= 50), 1
, true(), 0)

timechart is ascending order by default.
and autoregress can keep old value.

...
| where alert = 1

For alerting.
By the way, between 50% and 60%, do you fire alert?

View solution in original post

to4kawa
Ultra Champion
| makeresults
| eval _raw="time,Total,Success,Failure,Failure_perc
2020-04-20 05:00:00,75,61,14,19
2020-04-20 05:30:00,207,129,78,38
2020-04-20 06:00:00,25,10,15,60"
| multikv forceheader=1
| eval _time=strptime(time,"%F %T")
| table _time,Total,Success,Failure,Failure_perc
| rename COMMENT as "this is your result. from here, the logic"
| autoregress Failure_perc p=2 as F_anHourAgo 
| autoregress Failure_perc p=1 as F_30MinsAgo
| eval alert=case(Failure_perc > 60, 1
, (30 <= Failure_perc AND Failure_perc <= 50) AND (30 <= F_30MinsAgo AND F_30MinsAgo <= 50) AND (30 <= F_anHourAgo AND F_anHourAgo <= 50), 1
, true(), 0)

timechart is ascending order by default.
and autoregress can keep old value.

...
| where alert = 1

For alerting.
By the way, between 50% and 60%, do you fire alert?

poddraj
Explorer

Thanks to4kawa...

It worked for me, however I am trying some modification as per my scenario but the query you gave really helped alot. Will come back to you for any further help..

0 Karma

poddraj
Explorer

Hi to4kawa,
I am trying to handle couple of scenarios based out of above query..
I want to alert the transaction details of failures which fall into alert=1 category and have written below query to achieve it

index=dte_fios sourcetype=dte2_Fios FT=*FT Error_Code!=0000 earliest=04/20/2020:11:00:00 latest=04/20/2020:13:00:00
| bin _time span=15m
| stats count as Total, count(eval(Error_Code!="0000")) AS Failure by FT,_time
| eval Failurepercent=round(Failure/Total*100)
| multikv forceheader=1
| table _time,FT,Total,Failure,Failurepercent
| autoregress Failurepercent p=3 as F_45MinsAgo
| autoregress Failurepercent p=2 as F_30MinsAgo
| autoregress Failurepercent p=1 as F_15MinsAgo
| lookup ftthresholdlkp FT
| eval alert=case(Minhits<=Total AND Total<=Maxhits AND Failurepercent > Failure_Threshold1, 1
, (Minhits<=Total AND Total<=Maxhits) AND (Failure_Threshold2 <= Failurepercent AND Failurepercent <= Failure_Threshold3) AND (Failure_Threshold2 <= F_15MinsAgo AND F_15MinsAgo <= Failure_Threshold3) AND (Failure_Threshold2 <= F_30MinsAgo AND F_30MinsAgo <= Failure_Threshold3) AND (Failure_Threshold2 <= F_45MinsAgo AND F_45MinsAgo <= Failure_Threshold3), 1
, (Minhits<=Total AND Total<=Maxhits) AND (Failure_Threshold3 <= Failurepercent AND Failurepercent <= Failure_Threshold4) AND (Failure_Threshold3 <= F_15MinsAgo AND F_15MinsAgo <= Failure_Threshold4) AND (Failure_Threshold3 <= F_30MinsAgo AND F_30MinsAgo <= Failure_Threshold4) ,1,(Minhits<=Total AND Total<=Maxhits) AND (Failure_Threshold4 <= Failurepercent AND Failurepercent <= Failure_Threshold5) AND (Failure_Threshold4 <= F_15MinsAgo AND F_15MinsAgo <= Failure_Threshold5),1,true(), 0)
| where alert=1
| map search="search index=dte_fios sourcetype=dte2_Fios FT=$FT$ earliest=04/20/2020:12:45:00 latest=04/20/2020:13:00:00 |table _time,WPID,MGRID,Host,System,DIP_Command,CID,DTE_Command,FT,OSS,Error_Code,Error_Msg"

If you observe line 3 of my query where I did stats by FT,_time.. this is required for me because I want to compare the failures for every 15mins for each domain in my application and whichever domain has highest failures I want to send those transaction details of that domain over the last 15mins in my alert email which you can see in my map query.

I have 2 doubts here:
1. If I get multiple domain satisfying the alert condition how can I get only the top domain from the where output so that I can use only that FT in my second query.
2. If suppose my alert is running at 1 PM and I have records during 12:45 - 1 PM but do not have records during 12:30 - 12:45 PM but have records during 12:15 - 12:30 PM then this F_15Minago is showing the Failurepercent value of duration 12:15 - 12:30 PM because I do not have row for 12:30-12:45 PM interval.. How can I handle this as this is incorrect

0 Karma

to4kawa
Ultra Champion
  1. what's top domain ? you can use lookup. so , use it and select with where
  2. Alert is scheduled search. select appropriate time picker and add where _time > as_you_kike you can use addinfo and relative_time()
0 Karma

poddraj
Explorer

What I meant by top domain is if I get two domains satisfying alert=1 condition how can I get the domain with high failure percent as input to my second query i.e. $FT$ value as I want to send top failing domain details only in alert email.

  1. Let us say my alert is running at 13:00PM then my req is to check for any domain crossing failure threshold during last 15 mins i.e. 12:45-1300 and I was asked to send the Total & Failures count over previous 4 15mins interval in alert email along with latest 15 mins interval. What I observed is if count is zero for 12:30-12:45 and >0 for other intervals (i.e. 12:15:12:30 & 12:00-12:15) then above query is not returning a row with that interval due to which the failurepercent value of F_15minsago is showing the count of interval 12:15-12:30 which should be the value of F-30minsago. Is the a way I can get row for every 15mins interval even though that interval doesn't have count for that domain
0 Karma

to4kawa
Ultra Champion

how can I get the domain with high failure percent

sort - Failurepercent | head 1

  1. Let us ...

| bin _time span=15m | stats count as Total, count(eval(Error_Code!="0000")) AS Failure by FT,_time

use timechart span=15min

0 Karma

poddraj
Explorer

Hi to4kawa,
When I use timechart in place of bin _time span=15m it is not giving me the desired table output and I cannot add timechart command after my stats command as it will limit the columns in the O/P

0 Karma

to4kawa
Ultra Champion

timechart in place of bin _time span=15m it is not giving me the desired table output

your query is wrong.
please fix it.

after my stats this can't work.

0 Karma

rmmiller
Contributor

Neat example for autoregress!

0 Karma

to4kawa
Ultra Champion
0 Karma

poddraj
Explorer

Hi
I couldn't get an idea on how to use this autoregress. I have added it to my query and got below O/P. How can I compare the failure% values of 3 intervals and take decision to alert.

Query:

index=dte_fios sourcetype=dte2_Fios FT=*FT
| eval Status=case(Error_Code=="0000","Success",1=1,"Failure")
| bin _time span=30m
| stats count as totalCount , count(eval(Error_Code="0000")) AS Success, count(eval(Error_Code!="0000")) AS Failure by _time
| eval Failurepercent=round(Failure/totalCount*100)
| autoregress Failurepercent p=1
| autoregress totalCount AS old_total p=1

OUTPUT

_time totalCount Success Failure Failurepercent Failurepercent_p1 old_total

2020-04-20 05:00:00 75 61 14 19

2020-04-20 05:30:00 207 129 78 38 19 75
2020-04-20 06:00:00 25 10 15 60 38 207

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...