Alerting
Highlighted

Conditional Alert based on comparison of results of 30mins window

Explorer

Hi,
My scenario is that I have Counts of Total Requests, Success, Failure & Failure% for time span of every 30 mins over last 2 hours
Let's say first 30mins I got 100 hits and failure% is more than 60% then I want to send an alert immediately but let's say if first 30mins failure% is between 30-50% then I want to see the failure% of previous 30mins and if the failure% of this 30mins is also b/w 30-50% then I want to see one more previous 30mins failure% and if that interval also has same failure% then I want to trigger alert but if the 2nd or 3rd 30min interval has less then 30% failure then I do not want to send alert

I want this alert to be running for every 15 mins. How can I do this in splunk?
I have written below query to get the events for last 2 hours but could not move ahead on the next steps.

index=dtefios sourcetype=dte2Fios FT=FT | eval Interval=strftime('time',"%d-%m-%Y %H:%M:%S")
| eval Status=case(Error
Code=="0000","Success",1=1,"Failure")
| timechart span=30m count by Status
| eval Total = Success + Failure
| eval Failure%=round(Failure/Total
100)
| table _time,Total,Success,Failure,Failure%

Output is below:
_time Total Success Failure Failure%
2020-04-20 05:00:00 75 61 14 19
2020-04-20 05:30:00 207 129 78 38
2020-04-20 06:00:00 25 10 15 60

Labels (1)
0 Karma
Highlighted

Re: Conditional Alert based on comparison of results of 30mins window

Ultra Champion
0 Karma
Highlighted

Re: Conditional Alert based on comparison of results of 30mins window

Explorer

Hi
I couldn't get an idea on how to use this autoregress. I have added it to my query and got below O/P. How can I compare the failure% values of 3 intervals and take decision to alert.

Query:

index=dtefios sourcetype=dte2Fios FT=FT
| eval Status=case(ErrorCode=="0000","Success",1=1,"Failure")
| bin _time span=30m
| stats count as totalCount , count(eval(Error
Code="0000")) AS Success, count(eval(Error_Code!="0000")) AS Failure by _time
| eval Failurepercent=round(Failure/totalCount
100)
| autoregress Failurepercent p=1
| autoregress totalCount AS old_total p=1

OUTPUT

time totalCount Success Failure Failurepercent Failurepercentp1 old_total

2020-04-20 05:00:00 75 61 14 19

2020-04-20 05:30:00 207 129 78 38 19 75
2020-04-20 06:00:00 25 10 15 60 38 207

0 Karma
Highlighted

Re: Conditional Alert based on comparison of results of 30mins window

Ultra Champion
| makeresults
| eval _raw="time,Total,Success,Failure,Failure_perc
2020-04-20 05:00:00,75,61,14,19
2020-04-20 05:30:00,207,129,78,38
2020-04-20 06:00:00,25,10,15,60"
| multikv forceheader=1
| eval _time=strptime(time,"%F %T")
| table _time,Total,Success,Failure,Failure_perc
| rename COMMENT as "this is your result. from here, the logic"
| autoregress Failure_perc p=2 as F_anHourAgo 
| autoregress Failure_perc p=1 as F_30MinsAgo
| eval alert=case(Failure_perc > 60, 1
, (30 <= Failure_perc AND Failure_perc <= 50) AND (30 <= F_30MinsAgo AND F_30MinsAgo <= 50) AND (30 <= F_anHourAgo AND F_anHourAgo <= 50), 1
, true(), 0)

timechart is ascending order by default.
and autoregress can keep old value.

...
| where alert = 1

For alerting.
By the way, between 50% and 60%, do you fire alert?

View solution in original post

Highlighted

Re: Conditional Alert based on comparison of results of 30mins window

Contributor

Neat example for autoregress!

0 Karma
Highlighted

Re: Conditional Alert based on comparison of results of 30mins window

Explorer

Thanks to4kawa...

It worked for me, however I am trying some modification as per my scenario but the query you gave really helped alot. Will come back to you for any further help..

0 Karma
Highlighted

Re: Conditional Alert based on comparison of results of 30mins window

Explorer

Hi to4kawa,
I am trying to handle couple of scenarios based out of above query..
I want to alert the transaction details of failures which fall into alert=1 category and have written below query to achieve it

index=dtefios sourcetype=dte2Fios FT=FT ErrorCode!=0000 earliest=04/20/2020:11:00:00 latest=04/20/2020:13:00:00
| bin _time span=15m
| stats count as Total, count(eval(Error
Code!="0000")) AS Failure by FT,_time
| eval Failurepercent=round(Failure/Total
100)
| multikv forceheader=1
| table time,FT,Total,Failure,Failurepercent
| autoregress Failurepercent p=3 as F
45MinsAgo
| autoregress Failurepercent p=2 as F30MinsAgo
| autoregress Failurepercent p=1 as F
15MinsAgo
| lookup ftthresholdlkp FT
| eval alert=case(Minhits<=Total AND Total<=Maxhits AND Failurepercent > FailureThreshold1, 1
, (Minhits<=Total AND Total<=Maxhits) AND (Failure
Threshold2 <= Failurepercent AND Failurepercent <= FailureThreshold3) AND (FailureThreshold2 <= F15MinsAgo AND F15MinsAgo <= FailureThreshold3) AND (FailureThreshold2 <= F30MinsAgo AND F30MinsAgo <= FailureThreshold3) AND (FailureThreshold2 <= F45MinsAgo AND F45MinsAgo <= FailureThreshold3), 1
, (Minhits<=Total AND Total<=Maxhits) AND (Failure
Threshold3 <= Failurepercent AND Failurepercent <= FailureThreshold4) AND (FailureThreshold3 <= F15MinsAgo AND F15MinsAgo <= FailureThreshold4) AND (FailureThreshold3 <= F30MinsAgo AND F30MinsAgo <= FailureThreshold4) ,1,(Minhits<=Total AND Total<=Maxhits) AND (FailureThreshold4 <= Failurepercent AND Failurepercent <= FailureThreshold5) AND (FailureThreshold4 <= F15MinsAgo AND F15MinsAgo <= FailureThreshold5),1,true(), 0)
| where alert=1
| map search="search index=dte
fios sourcetype=dte2Fios FT=$FT$ earliest=04/20/2020:12:45:00 latest=04/20/2020:13:00:00 |table _time,WPID,MGRID,Host,System,DIPCommand,CID,DTECommand,FT,OSS,ErrorCode,Error_Msg"

If you observe line 3 of my query where I did stats by FT,_time.. this is required for me because I want to compare the failures for every 15mins for each domain in my application and whichever domain has highest failures I want to send those transaction details of that domain over the last 15mins in my alert email which you can see in my map query.

I have 2 doubts here:
1. If I get multiple domain satisfying the alert condition how can I get only the top domain from the where output so that I can use only that FT in my second query.
2. If suppose my alert is running at 1 PM and I have records during 12:45 - 1 PM but do not have records during 12:30 - 12:45 PM but have records during 12:15 - 12:30 PM then this F_15Minago is showing the Failurepercent value of duration 12:15 - 12:30 PM because I do not have row for 12:30-12:45 PM interval.. How can I handle this as this is incorrect

0 Karma
Highlighted

Re: Conditional Alert based on comparison of results of 30mins window

Ultra Champion
  1. what's top domain ? you can use lookup. so , use it and select with where
  2. Alert is scheduled search. select appropriate time picker and add where _time > as_you_kike you can use addinfo and relative_time()
0 Karma
Highlighted

Re: Conditional Alert based on comparison of results of 30mins window

Explorer

What I meant by top domain is if I get two domains satisfying alert=1 condition how can I get the domain with high failure percent as input to my second query i.e. $FT$ value as I want to send top failing domain details only in alert email.

  1. Let us say my alert is running at 13:00PM then my req is to check for any domain crossing failure threshold during last 15 mins i.e. 12:45-1300 and I was asked to send the Total & Failures count over previous 4 15mins interval in alert email along with latest 15 mins interval. What I observed is if count is zero for 12:30-12:45 and >0 for other intervals (i.e. 12:15:12:30 & 12:00-12:15) then above query is not returning a row with that interval due to which the failurepercent value of F_15minsago is showing the count of interval 12:15-12:30 which should be the value of F-30minsago. Is the a way I can get row for every 15mins interval even though that interval doesn't have count for that domain
0 Karma
Highlighted

Re: Conditional Alert based on comparison of results of 30mins window

Ultra Champion

how can I get the domain with high failure percent

sort - Failurepercent | head 1

  1. Let us ...

| bin _time span=15m | stats count as Total, count(eval(Error_Code!="0000")) AS Failure by FT,_time

use timechart span=15min

0 Karma
Speak Up for Splunk Careers!

We want to better understand the impact Splunk experience and expertise has has on individuals' careers, and help highlight the growing demand for Splunk skills.