Splunk Search

Search to check if request duration search has recovered

guywood13
Path Finder

 

index=my_index source="/var/log/nginx/access.log"
| stats avg(request_time) as Average_Request_Time 
| where Average_Request_Time >1

 

I have this query setup as an alert if my web app request duration goes over 1 second and this searches back over a 30 min window.

I want to know when this alert has recovered.  So I guess effectively running this query twice against 1st 30 mins of an hour then 2nd 30 mins of an hour then give me a result I can alert when that gets returned.  The result would be an indication that the 1st 30 mins was over 1 second average duration and the 2nd 30 mins was under 1 second average duration and thus, it recovered.

I have no idea where to start with this!  But I do want to keep the alert query above for my main alert of an issue and have a 2nd alert query for this recovery element.  Hoep this is possible.

Labels (2)
Tags (2)
0 Karma
1 Solution

ITWhisperer
SplunkTrust
SplunkTrust
index=my_index source="/var/log/nginx/access.log"
    [| makeresults
    | addinfo
    | bin info_min_time as earliest span=15m
    | bin info_max_time as latest span=15m
    | table earliest latest]
| bin _time span=15m
| stats avg(request_time) as Average_Request_Time by _time
| streamstats count as weight
| eval alert=if(Average_Request_Time>1,weight,0)
| stats sum(alert) as alert
| where alert==1

View solution in original post

guywood13
Path Finder

Thanks again @ITWhisperer.  Is there any way to restrict to the previous 2 times bins in the query as the cron scheduler doesn't fire exactly on the hour and getting 3 bins as you said.  Thinking of running at 1:05pm and if that could get the 12:30-45 & 12:45-1 bins, I think that would work well.

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust
index=my_index source="/var/log/nginx/access.log"
    [| makeresults
    | addinfo
    | bin info_min_time as earliest span=15m
    | bin info_max_time as latest span=15m
    | table earliest latest]
| bin _time span=15m
| stats avg(request_time) as Average_Request_Time by _time
| streamstats count as weight
| eval alert=if(Average_Request_Time>1,weight,0)
| stats sum(alert) as alert
| where alert==1

guywood13
Path Finder

Absolutely perfect, thank you!

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Use /service/search/jobs REST API.  When you set up an alert, you must have a saved search.  Assuming that you give it a name "My alarming alert: Everybody panic!", the following search will tell you when the last alert happened and when the first clear occurred. (It will display the time of latest clean search if last alert already expired.)

| rest /services/search/jobs
| where isDone = 1 AND label == "My alarming alert: Everybody panic!"
| fields updated resultCount label
| eval _time = strptime(updated, "%FT%T.%3N%z")
| transaction startswith="resultCount>0" endswith="resultCount=0" keeporphans=1
| fields - _*
| where closed_txn == 1 OR resultCount == 0
| eval last_alert_count = max(resultCount)
| eval last_alert_time = min(updated)
| fields label last_alert_time last_alert_count

 

Tags (1)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

 

Set this alert to run every 30 minutes looking back for 1 hour

index=my_index source="/var/log/nginx/access.log"
| stats avg(request_time) as Average_Request_Time 
| streamstats count as weight
| eval alert=if(Average_Request_Time>1,weight,0)
| stats sum(alert) as alert
| where alert==1

 

0 Karma

guywood13
Path Finder

Thanks @ITWhisperer but this doesn't seem to work.  I've simulated the average request time being over 1 second in the logs and this search returns alert=1 straight away.  When I'd want it to return this when searching the 2nd time window to say that we'd actually recovered from this high request time.

Can you explain what is happening from streamstats onwards as I can't get my head round it?  I don't get how this separates the 2 time windows.  I've been running the search manually looking back 30 mins and it just returns alert=1 every time.

FYI initially got my time window wrong and it is actually checking every 15 minute window so I'd want to compare the 2x 15 min windows over the last 30 mins to see if it has recovered.  I don't think this makes a difference to the query though.

 

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

My mistake - I missed out the time bin part - try this over the previous 30 minutes for 15 minute groups - you may need to align your time period so that you only get 2 15 minute bins

index=my_index source="/var/log/nginx/access.log"
| bin _time span=15m
| stats avg(request_time) as Average_Request_Time by _time
| streamstats count as weight
| eval alert=if(Average_Request_Time>1,weight,0)
| stats sum(alert) as alert
| where alert==1
0 Karma
Get Updates on the Splunk Community!

.conf24 | Day 0

Hello Splunk Community! My name is Chris, and I'm based in Canberra, Australia's capital, and I travelled for ...

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

 (view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...

Troubleshooting the OpenTelemetry Collector

  In this tech talk, you’ll learn how to troubleshoot the OpenTelemetry collector - from checking the ...