Splunk Enterprise

Trigger Alert and avoid false p[positives

shashank_24
Path Finder

Hi, I've created an alert for one of my main API service and how it works is, it runs every 30 mins, looks into failure rate and failed requests and then based on the threshold which is (failedRequest > 200 AND failurerate > 10%), it triggers the alert and raises a incident.

Now there are times when during those 30 mins, there is a short blip of 5 mins with large number of errors and for the rest of the time it was normal. Now in that case as well the alert gets fired because it meets the threshold. How can i avoid that?

Is it possible to look for number of errors and if they are consistent for like 20 or 30 mins and if they are then trigger the alert? How can i achieve that?


Here is my sample query - Let me know if anyone can advice on this. It will be immensely helpful.

index=myapp_prod source=myapp "message.logPoint"=OUTGOING_RESPONSE (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode 
| where(serviceName LIKE "my-service") 
| stats count as totalrequests count(eval(httpResponseCode=200)) as successrequest count(eval(httpResponseCode=500 OR httpResponseCode=502 OR httpResponseCode=503)) as failedrequest
| eval Total = successrequest + failedrequest 
| eval failureRatePercentage = round(((failedrequest/Total) * 100),2)
| where failureRatePercentage > 10 AND failedrequest > 200

 

Labels (2)
0 Karma
1 Solution

impurush
Contributor

Hi @shashank_24 

You can use timechart instead of stats to breakdown to 5 mins within 30 mins, then you can trigger an alert if the failure rate is greater than 10 for more than 20 mins.

 

index=myapp_prod source=myapp "message.logPoint"=OUTGOING_RESPONSE (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode 
| where(serviceName LIKE "my-service") 
| timechart span=5m count as totalrequests count(eval(httpResponseCode=200)) as successrequest count(eval(httpResponseCode=500 OR httpResponseCode=502 OR httpResponseCode=503)) as failedrequest
| eval failureRatePercentage = round(((failedrequest/totalrequests) * 100),2)
| where failureRatePercentage > 10 AND failedrequest > 200

 

 You can try to use the above query and in the trigger condition, you can select, trigger an alert if the number of result rows greater than 4.

Note: You can adjust the failed request based on your 5 mins threshold

View solution in original post

impurush
Contributor

Hi @shashank_24 

You can use timechart instead of stats to breakdown to 5 mins within 30 mins, then you can trigger an alert if the failure rate is greater than 10 for more than 20 mins.

 

index=myapp_prod source=myapp "message.logPoint"=OUTGOING_RESPONSE (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode 
| where(serviceName LIKE "my-service") 
| timechart span=5m count as totalrequests count(eval(httpResponseCode=200)) as successrequest count(eval(httpResponseCode=500 OR httpResponseCode=502 OR httpResponseCode=503)) as failedrequest
| eval failureRatePercentage = round(((failedrequest/totalrequests) * 100),2)
| where failureRatePercentage > 10 AND failedrequest > 200

 

 You can try to use the above query and in the trigger condition, you can select, trigger an alert if the number of result rows greater than 4.

Note: You can adjust the failed request based on your 5 mins threshold

shashank_24
Path Finder

Hi @impurush that's perfect. I think that gives me what I was looking for but I just have one concern. This obviously will output more than 1 row BUT our ticketing system is designed in such a way that if the alert ouputs more than 1 row then it creates more than 1 ticket. So if 5 rows then 5 tickets will be raised.

With your query what I am getting is if the error has persisted for more than 15 mins in regular intervals so there is definitely a genuine problem so we need to trigger the alert but how can reduce that to just one row and gives information about the alert like failed percentage etc.

Let me know if that is possible or it will be too complex

0 Karma

impurush
Contributor

Hi @shashank_24 

I have the same scenario in my environment. I selected the trigger option "Once" instead of "For each result".
So that, it will trigger only once irrespective of the number of rows and also it will trigger only more than 4 rows comes. Hope this will solve the problem.

0 Karma

shashank_24
Path Finder

Hi @impurush Thanks so much for the help. I've done like below and it worked for me.

| eventstats count as rows
| table _time host totalrequests successrequest failedrequest failureRatePercentage rows
| search rows > 4
| sort - failedrequest limit=1
| fields - rows _time successrequest
| eval message= "myapp service is having consistently high failure rates for last 30 minutes"

 

I just have one more question. timechart works fine when you are filtering your search for only one service or API. But it fails when you have multiple services and you use BY clause. Like below -

index=myapp_prod source=myapp "message.logPoint"=OUTGOING_RESPONSE (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode 
| timechart span=5m count as totalrequests count(eval(httpResponseCode=200)) as successrequest count(eval(httpResponseCode=500 OR httpResponseCode=502 OR httpResponseCode=503)) as failedrequest BY serviceName
| eval failureRatePercentage = round(((failedrequest/totalrequests) * 100),2)
| where failureRatePercentage > 10 AND failedrequest > 200

Let's say I have 10 critical APIs and if any of them fails consistently for say 30 mins then I want to trigger an alert. Is it achievable? 

Let me know your thoughts

0 Karma

impurush
Contributor

Hi @shashank_24,

Yes, the time chart will fail because when you use BY in the time chart which has multiple columns, then the result will have a column along with the API service name. I tried something like this and you can try like below (Instead of time chart, you can bucket command which will help in that case)

index=myapp_prod source=myapp "message.logPoint"=OUTGOING_RESPONSE (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode 
| bucket span=5m _time
| stats count as totalrequests count(eval(httpResponseCode=200)) as successrequest count(eval(httpResponseCode=500 OR httpResponseCode=502 OR httpResponseCode=503)) as failedrequest BY _time,serviceName
| eval failureRatePercentage = round(((failedrequest/totalrequests) * 100),2)
| where failureRatePercentage > 10 AND failedrequest > 200
| eventstats count by serviceName
| where count>4
| field - count

 

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...