Alerting

Smoothing a running average

Esperteyu
Explorer

Hi,

So what I've been trying to do lately is to create an alert on top of a ratio errors/total and the option I focused for the moment, not that I wouldn't like to have something more accurate if I can though, is trying to alert if the last 10 minutes ratio exceeds in more than 20 percentage points the last 24 hours ratio.

For that I tried to use trendline with no luck as I need a by clause (as per https://answers.splunk.com/answers/692424/trendline-to-work-grouped-by-field.html) and a few other things and eventually came with "something"..
Then I started to think about smoothing the 24 hour average to avoid that some spikes could go undetected, and found the outlier command which I used in a very naive way, so the query I have at the moment is this one

index="logger" "Raw Notification" 
| bin _time span=10m
| eval _raw=replace(_raw,"\\\\\"","\"") 
| rex "\"RawRequest\":\"(?<raw_request>.+)\"}$" 
| eval json= raw_request 
| spath input=json output=country_code path=customer.billingAddress.countryCode 
| spath input=json output=card_scheme path=paymentMethod.card.cardScheme 
| spath input=json output=acquirer_name path=processing.authResponse.acquirerName 
| spath input=json output=transaction_status path=transaction.status 
| spath input=json output=reason_messages path=history{}.reasonMessage
| eval acquirer= card_scheme . ":" . acquirer_name . ":" . country_code
| eval final_reason_message=mvIndex(reason_messages, 1)
| eval error=if(like(transaction_status,"%FAILED%"),1,0)
| eval error_message=if(like(transaction_status,"%FAILED%"),final_reason_message, NULL()) 
| stats count as total sum(error) as errors mode(error_message) as most_common_error_message by _time, acquirer
| eval ten_minutes_error_rate=100*exact(errors)/exact(total) 
| outlier action=TF total errors
| sort by _time
| streamstats time_window=24h sum(total) as twentyfour_hours_total sum(errors) as twentyfour_hours_errors by acquirer
| eval twentyfour_hours_error_rate=100*exact(twentyfour_hours_errors)/exact(twentyfour_hours_total)
| eval outlier = ten_minutes_error_rate - twentyfour_hours_error_rate
| where outlier > 20

I would like to get critics on it, with some sample date I have it detected what I expected to be detected but I'm not sure if I am reinventing the wheel once I know about the outlier command (although I don't think it's that easy to just use it to detect the outliers in my case because, as I'm considering ratios, it's my understanding that the usage of standard deviations and averages have to be carefully thought and therefore I tried to use the ratio and not the average of the ratios), if that query makes any sense at all with or without the outlier....

Thanks

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...