Solved: Search to check if request duration search has rec...

guywood13 · ‎02-15-2024

index=my_index source="/var/log/nginx/access.log"
| stats avg(request_time) as Average_Request_Time 
| where Average_Request_Time >1

I have this query setup as an alert if my web app request duration goes over 1 second and this searches back over a 30 min window.

I want to know when this alert has recovered. So I guess effectively running this query twice against 1st 30 mins of an hour then 2nd 30 mins of an hour then give me a result I can alert when that gets returned. The result would be an indication that the 1st 30 mins was over 1 second average duration and the 2nd 30 mins was under 1 second average duration and thus, it recovered.

I have no idea where to start with this! But I do want to keep the alert query above for my main alert of an issue and have a 2nd alert query for this recovery element. Hoep this is possible.

ITWhisperer · ‎02-20-2024

index=my_index source="/var/log/nginx/access.log"
    [| makeresults
    | addinfo
    | bin info_min_time as earliest span=15m
    | bin info_max_time as latest span=15m
    | table earliest latest]
| bin _time span=15m
| stats avg(request_time) as Average_Request_Time by _time
| streamstats count as weight
| eval alert=if(Average_Request_Time>1,weight,0)
| stats sum(alert) as alert
| where alert==1

View solution in original post

guywood13 · ‎02-20-2024

Thanks again @ITWhisperer. Is there any way to restrict to the previous 2 times bins in the query as the cron scheduler doesn't fire exactly on the hour and getting 3 bins as you said. Thinking of running at 1:05pm and if that could get the 12:30-45 & 12:45-1 bins, I think that would work well.

ITWhisperer · ‎02-20-2024

index=my_index source="/var/log/nginx/access.log"
    [| makeresults
    | addinfo
    | bin info_min_time as earliest span=15m
    | bin info_max_time as latest span=15m
    | table earliest latest]
| bin _time span=15m
| stats avg(request_time) as Average_Request_Time by _time
| streamstats count as weight
| eval alert=if(Average_Request_Time>1,weight,0)
| stats sum(alert) as alert
| where alert==1

guywood13 · ‎02-21-2024

Absolutely perfect, thank you!

yuanliu · ‎02-15-2024

Use /service/search/jobs REST API. When you set up an alert, you must have a saved search. Assuming that you give it a name "My alarming alert: Everybody panic!", the following search will tell you when the last alert happened and when the first clear occurred. (It will display the time of latest clean search if last alert already expired.)

| rest /services/search/jobs
| where isDone = 1 AND label == "My alarming alert: Everybody panic!"
| fields updated resultCount label
| eval _time = strptime(updated, "%FT%T.%3N%z")
| transaction startswith="resultCount>0" endswith="resultCount=0" keeporphans=1
| fields - _*
| where closed_txn == 1 OR resultCount == 0
| eval last_alert_count = max(resultCount)
| eval last_alert_time = min(updated)
| fields label last_alert_time last_alert_count

ITWhisperer · ‎02-15-2024

Set this alert to run every 30 minutes looking back for 1 hour

index=my_index source="/var/log/nginx/access.log"
| stats avg(request_time) as Average_Request_Time 
| streamstats count as weight
| eval alert=if(Average_Request_Time>1,weight,0)
| stats sum(alert) as alert
| where alert==1

guywood13 · ‎02-16-2024

Thanks @ITWhisperer but this doesn't seem to work. I've simulated the average request time being over 1 second in the logs and this search returns alert=1 straight away. When I'd want it to return this when searching the 2nd time window to say that we'd actually recovered from this high request time.

Can you explain what is happening from streamstats onwards as I can't get my head round it? I don't get how this separates the 2 time windows. I've been running the search manually looking back 30 mins and it just returns alert=1 every time.

FYI initially got my time window wrong and it is actually checking every 15 minute window so I'd want to compare the 2x 15 min windows over the last 30 mins to see if it has recovered. I don't think this makes a difference to the query though.

ITWhisperer · ‎02-17-2024

My mistake - I missed out the time bin part - try this over the previous 30 minutes for 15 minute groups - you may need to align your time period so that you only get 2 15 minute bins

index=my_index source="/var/log/nginx/access.log"
| bin _time span=15m
| stats avg(request_time) as Average_Request_Time by _time
| streamstats count as weight
| eval alert=if(Average_Request_Time>1,weight,0)
| stats sum(alert) as alert
| where alert==1

Search to check if request duration search has recovered

eval

stats

Strengthen Your Future: A Look Back at Splunk 10 Innovations and .conf25 Highlights!

Now Offering the AI Assistant Usage Dashboard in Cloud Monitoring Console

Stay Connected: Your Guide to October Tech Talks, Office Hours, and Webinars!

Are you a member of the Splunk Community?