Hi,
My scenario is that I have Counts of Total Requests, Success, Failure & Failure% for time span of every 30 mins over last 2 hours
Let's say first 30mins I got 100 hits and failure% is more than 60% then I want to send an alert immediately but let's say if first 30mins failure% is between 30-50% then I want to see the failure% of previous 30mins and if the failure% of this 30mins is also b/w 30-50% then I want to see one more previous 30mins failure% and if that interval also has same failure% then I want to trigger alert but if the 2nd or 3rd 30min interval has less then 30% failure then I do not want to send alert
I want this alert to be running for every 15 mins. How can I do this in splunk?
I have written below query to get the events for last 2 hours but could not move ahead on the next steps.
index=dte_fios sourcetype=dte2_Fios FT=*FT | eval Interval=strftime('_time',"%d-%m-%Y %H:%M:%S")
| eval Status=case(Error_Code=="0000","Success",1=1,"Failure")
| timechart span=30m count by Status
| eval Total = Success + Failure
| eval Failure%=round(Failure/Total*100)
| table _time,Total,Success,Failure,Failure%
Output is below:
_time Total Success Failure Failure%
2020-04-20 05:00:00 75 61 14 19
2020-04-20 05:30:00 207 129 78 38
2020-04-20 06:00:00 25 10 15 60
| makeresults
| eval _raw="time,Total,Success,Failure,Failure_perc
2020-04-20 05:00:00,75,61,14,19
2020-04-20 05:30:00,207,129,78,38
2020-04-20 06:00:00,25,10,15,60"
| multikv forceheader=1
| eval _time=strptime(time,"%F %T")
| table _time,Total,Success,Failure,Failure_perc
| rename COMMENT as "this is your result. from here, the logic"
| autoregress Failure_perc p=2 as F_anHourAgo
| autoregress Failure_perc p=1 as F_30MinsAgo
| eval alert=case(Failure_perc > 60, 1
, (30 <= Failure_perc AND Failure_perc <= 50) AND (30 <= F_30MinsAgo AND F_30MinsAgo <= 50) AND (30 <= F_anHourAgo AND F_anHourAgo <= 50), 1
, true(), 0)
timechart
is ascending order by default.
and autoregress
can keep old value.
...
| where alert = 1
For alerting.
By the way, between 50% and 60%, do you fire alert?
| makeresults
| eval _raw="time,Total,Success,Failure,Failure_perc
2020-04-20 05:00:00,75,61,14,19
2020-04-20 05:30:00,207,129,78,38
2020-04-20 06:00:00,25,10,15,60"
| multikv forceheader=1
| eval _time=strptime(time,"%F %T")
| table _time,Total,Success,Failure,Failure_perc
| rename COMMENT as "this is your result. from here, the logic"
| autoregress Failure_perc p=2 as F_anHourAgo
| autoregress Failure_perc p=1 as F_30MinsAgo
| eval alert=case(Failure_perc > 60, 1
, (30 <= Failure_perc AND Failure_perc <= 50) AND (30 <= F_30MinsAgo AND F_30MinsAgo <= 50) AND (30 <= F_anHourAgo AND F_anHourAgo <= 50), 1
, true(), 0)
timechart
is ascending order by default.
and autoregress
can keep old value.
...
| where alert = 1
For alerting.
By the way, between 50% and 60%, do you fire alert?
Thanks to4kawa...
It worked for me, however I am trying some modification as per my scenario but the query you gave really helped alot. Will come back to you for any further help..
Hi to4kawa,
I am trying to handle couple of scenarios based out of above query..
I want to alert the transaction details of failures which fall into alert=1 category and have written below query to achieve it
index=dte_fios sourcetype=dte2_Fios FT=*FT Error_Code!=0000 earliest=04/20/2020:11:00:00 latest=04/20/2020:13:00:00
| bin _time span=15m
| stats count as Total, count(eval(Error_Code!="0000")) AS Failure by FT,_time
| eval Failurepercent=round(Failure/Total*100)
| multikv forceheader=1
| table _time,FT,Total,Failure,Failurepercent
| autoregress Failurepercent p=3 as F_45MinsAgo
| autoregress Failurepercent p=2 as F_30MinsAgo
| autoregress Failurepercent p=1 as F_15MinsAgo
| lookup ftthresholdlkp FT
| eval alert=case(Minhits<=Total AND Total<=Maxhits AND Failurepercent > Failure_Threshold1, 1
, (Minhits<=Total AND Total<=Maxhits) AND (Failure_Threshold2 <= Failurepercent AND Failurepercent <= Failure_Threshold3) AND (Failure_Threshold2 <= F_15MinsAgo AND F_15MinsAgo <= Failure_Threshold3) AND (Failure_Threshold2 <= F_30MinsAgo AND F_30MinsAgo <= Failure_Threshold3) AND (Failure_Threshold2 <= F_45MinsAgo AND F_45MinsAgo <= Failure_Threshold3), 1
, (Minhits<=Total AND Total<=Maxhits) AND (Failure_Threshold3 <= Failurepercent AND Failurepercent <= Failure_Threshold4) AND (Failure_Threshold3 <= F_15MinsAgo AND F_15MinsAgo <= Failure_Threshold4) AND (Failure_Threshold3 <= F_30MinsAgo AND F_30MinsAgo <= Failure_Threshold4) ,1,(Minhits<=Total AND Total<=Maxhits) AND (Failure_Threshold4 <= Failurepercent AND Failurepercent <= Failure_Threshold5) AND (Failure_Threshold4 <= F_15MinsAgo AND F_15MinsAgo <= Failure_Threshold5),1,true(), 0)
| where alert=1
| map search="search index=dte_fios sourcetype=dte2_Fios FT=$FT$ earliest=04/20/2020:12:45:00 latest=04/20/2020:13:00:00 |table _time,WPID,MGRID,Host,System,DIP_Command,CID,DTE_Command,FT,OSS,Error_Code,Error_Msg"
If you observe line 3 of my query where I did stats by FT,_time.. this is required for me because I want to compare the failures for every 15mins for each domain in my application and whichever domain has highest failures I want to send those transaction details of that domain over the last 15mins in my alert email which you can see in my map query.
I have 2 doubts here:
1. If I get multiple domain satisfying the alert condition how can I get only the top domain from the where output so that I can use only that FT in my second query.
2. If suppose my alert is running at 1 PM and I have records during 12:45 - 1 PM but do not have records during 12:30 - 12:45 PM but have records during 12:15 - 12:30 PM then this F_15Minago is showing the Failurepercent value of duration 12:15 - 12:30 PM because I do not have row for 12:30-12:45 PM interval.. How can I handle this as this is incorrect
top domain
? you can use lookup
. so , use it and select with where
where _time > as_you_kike
you can use addinfo
and relative_time()
What I meant by top domain is if I get two domains satisfying alert=1 condition how can I get the domain with high failure percent as input to my second query i.e. $FT$ value as I want to send top failing domain details only in alert email.
how can I get the domain with high failure percent
sort - Failurepercent | head 1
- Let us ...
| bin _time span=15m | stats count as Total, count(eval(Error_Code!="0000")) AS Failure by FT,_time
use timechart span=15min
Hi to4kawa,
When I use timechart in place of bin _time span=15m it is not giving me the desired table output and I cannot add timechart command after my stats command as it will limit the columns in the O/P
timechart in place of bin _time span=15m it is not giving me the desired table output
your query is wrong.
please fix it.
after my stats
this can't work.
Neat example for autoregress!
use autoregress
to Failure%
https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Autoregress
Hi
I couldn't get an idea on how to use this autoregress. I have added it to my query and got below O/P. How can I compare the failure% values of 3 intervals and take decision to alert.
index=dte_fios sourcetype=dte2_Fios FT=*FT
| eval Status=case(Error_Code=="0000","Success",1=1,"Failure")
| bin _time span=30m
| stats count as totalCount , count(eval(Error_Code="0000")) AS Success, count(eval(Error_Code!="0000")) AS Failure by _time
| eval Failurepercent=round(Failure/totalCount*100)
| autoregress Failurepercent p=1
| autoregress totalCount AS old_total p=1
_time totalCount Success Failure Failurepercent Failurepercent_p1 old_total
2020-04-20 05:00:00 75 61 14 19
2020-04-20 05:30:00 207 129 78 38 19 75
2020-04-20 06:00:00 25 10 15 60 38 207