Solved: Trigger Alert and avoid false p[positives

shashank_24 · ‎02-04-2022

Hi, I've created an alert for one of my main API service and how it works is, it runs every 30 mins, looks into failure rate and failed requests and then based on the threshold which is (failedRequest > 200 AND failurerate > 10%), it triggers the alert and raises a incident.

Now there are times when during those 30 mins, there is a short blip of 5 mins with large number of errors and for the rest of the time it was normal. Now in that case as well the alert gets fired because it meets the threshold. How can i avoid that?

Is it possible to look for number of errors and if they are consistent for like 20 or 30 mins and if they are then trigger the alert? How can i achieve that?

Here is my sample query - Let me know if anyone can advice on this. It will be immensely helpful.

index=myapp_prod source=myapp "message.logPoint"=OUTGOING_RESPONSE (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode 
| where(serviceName LIKE "my-service") 
| stats count as totalrequests count(eval(httpResponseCode=200)) as successrequest count(eval(httpResponseCode=500 OR httpResponseCode=502 OR httpResponseCode=503)) as failedrequest
| eval Total = successrequest + failedrequest 
| eval failureRatePercentage = round(((failedrequest/Total) * 100),2)
| where failureRatePercentage > 10 AND failedrequest > 200

impurush · ‎02-04-2022

Hi @shashank_24

You can use timechart instead of stats to breakdown to 5 mins within 30 mins, then you can trigger an alert if the failure rate is greater than 10 for more than 20 mins.

index=myapp_prod source=myapp "message.logPoint"=OUTGOING_RESPONSE (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode 
| where(serviceName LIKE "my-service") 
| timechart span=5m count as totalrequests count(eval(httpResponseCode=200)) as successrequest count(eval(httpResponseCode=500 OR httpResponseCode=502 OR httpResponseCode=503)) as failedrequest
| eval failureRatePercentage = round(((failedrequest/totalrequests) * 100),2)
| where failureRatePercentage > 10 AND failedrequest > 200

You can try to use the above query and in the trigger condition, you can select, trigger an alert if the number of result rows greater than 4.

Note: You can adjust the failed request based on your 5 mins threshold

View solution in original post

impurush · ‎02-04-2022

Hi @shashank_24

You can use timechart instead of stats to breakdown to 5 mins within 30 mins, then you can trigger an alert if the failure rate is greater than 10 for more than 20 mins.

index=myapp_prod source=myapp "message.logPoint"=OUTGOING_RESPONSE (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode 
| where(serviceName LIKE "my-service") 
| timechart span=5m count as totalrequests count(eval(httpResponseCode=200)) as successrequest count(eval(httpResponseCode=500 OR httpResponseCode=502 OR httpResponseCode=503)) as failedrequest
| eval failureRatePercentage = round(((failedrequest/totalrequests) * 100),2)
| where failureRatePercentage > 10 AND failedrequest > 200

You can try to use the above query and in the trigger condition, you can select, trigger an alert if the number of result rows greater than 4.

Note: You can adjust the failed request based on your 5 mins threshold

shashank_24 · ‎02-04-2022

Hi @impurush that's perfect. I think that gives me what I was looking for but I just have one concern. This obviously will output more than 1 row BUT our ticketing system is designed in such a way that if the alert ouputs more than 1 row then it creates more than 1 ticket. So if 5 rows then 5 tickets will be raised.

With your query what I am getting is if the error has persisted for more than 15 mins in regular intervals so there is definitely a genuine problem so we need to trigger the alert but how can reduce that to just one row and gives information about the alert like failed percentage etc.

Let me know if that is possible or it will be too complex

impurush · ‎02-04-2022

Hi @shashank_24

I have the same scenario in my environment. I selected the trigger option "Once" instead of "For each result".
So that, it will trigger only once irrespective of the number of rows and also it will trigger only more than 4 rows comes. Hope this will solve the problem.

shashank_24 · ‎02-05-2022

Hi @impurush Thanks so much for the help. I've done like below and it worked for me.

| eventstats count as rows
| table _time host totalrequests successrequest failedrequest failureRatePercentage rows
| search rows > 4
| sort - failedrequest limit=1
| fields - rows _time successrequest
| eval message= "myapp service is having consistently high failure rates for last 30 minutes"

I just have one more question. timechart works fine when you are filtering your search for only one service or API. But it fails when you have multiple services and you use BY clause. Like below -

index=myapp_prod source=myapp "message.logPoint"=OUTGOING_RESPONSE (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode 
| timechart span=5m count as totalrequests count(eval(httpResponseCode=200)) as successrequest count(eval(httpResponseCode=500 OR httpResponseCode=502 OR httpResponseCode=503)) as failedrequest BY serviceName
| eval failureRatePercentage = round(((failedrequest/totalrequests) * 100),2)
| where failureRatePercentage > 10 AND failedrequest > 200

Let's say I have 10 critical APIs and if any of them fails consistently for say 30 mins then I want to trigger an alert. Is it achievable?

Let me know your thoughts

impurush · ‎02-07-2022

Hi @shashank_24,

Yes, the time chart will fail because when you use BY in the time chart which has multiple columns, then the result will have a column along with the API service name. I tried something like this and you can try like below (Instead of time chart, you can bucket command which will help in that case)

index=myapp_prod source=myapp "message.logPoint"=OUTGOING_RESPONSE (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode 
| bucket span=5m _time
| stats count as totalrequests count(eval(httpResponseCode=200)) as successrequest count(eval(httpResponseCode=500 OR httpResponseCode=502 OR httpResponseCode=503)) as failedrequest BY _time,serviceName
| eval failureRatePercentage = round(((failedrequest/totalrequests) * 100),2)
| where failureRatePercentage > 10 AND failedrequest > 200
| eventstats count by serviceName
| where count>4
| field - count

Trigger Alert and avoid false p[positives

development

using Splunk Enterprise

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Build and Launch AI Agents from Your Splunk Workflows

Splunk Cloud Application Management in Terraform

Get Agentic with Splunk Lantern: Connect to Cisco Cloud Control, Transform ...

Join the Conversation