Alerting

Smoothing a running average

Esperteyu
Explorer

Hi,

So what I've been trying to do lately is to create an alert on top of a ratio errors/total and the option I focused for the moment, not that I wouldn't like to have something more accurate if I can though, is trying to alert if the last 10 minutes ratio exceeds in more than 20 percentage points the last 24 hours ratio.

For that I tried to use trendline with no luck as I need a by clause (as per https://answers.splunk.com/answers/692424/trendline-to-work-grouped-by-field.html) and a few other things and eventually came with "something"..
Then I started to think about smoothing the 24 hour average to avoid that some spikes could go undetected, and found the outlier command which I used in a very naive way, so the query I have at the moment is this one

index="logger" "Raw Notification" 
| bin _time span=10m
| eval _raw=replace(_raw,"\\\\\"","\"") 
| rex "\"RawRequest\":\"(?<raw_request>.+)\"}$" 
| eval json= raw_request 
| spath input=json output=country_code path=customer.billingAddress.countryCode 
| spath input=json output=card_scheme path=paymentMethod.card.cardScheme 
| spath input=json output=acquirer_name path=processing.authResponse.acquirerName 
| spath input=json output=transaction_status path=transaction.status 
| spath input=json output=reason_messages path=history{}.reasonMessage
| eval acquirer= card_scheme . ":" . acquirer_name . ":" . country_code
| eval final_reason_message=mvIndex(reason_messages, 1)
| eval error=if(like(transaction_status,"%FAILED%"),1,0)
| eval error_message=if(like(transaction_status,"%FAILED%"),final_reason_message, NULL()) 
| stats count as total sum(error) as errors mode(error_message) as most_common_error_message by _time, acquirer
| eval ten_minutes_error_rate=100*exact(errors)/exact(total) 
| outlier action=TF total errors
| sort by _time
| streamstats time_window=24h sum(total) as twentyfour_hours_total sum(errors) as twentyfour_hours_errors by acquirer
| eval twentyfour_hours_error_rate=100*exact(twentyfour_hours_errors)/exact(twentyfour_hours_total)
| eval outlier = ten_minutes_error_rate - twentyfour_hours_error_rate
| where outlier > 20

I would like to get critics on it, with some sample date I have it detected what I expected to be detected but I'm not sure if I am reinventing the wheel once I know about the outlier command (although I don't think it's that easy to just use it to detect the outliers in my case because, as I'm considering ratios, it's my understanding that the usage of standard deviations and averages have to be carefully thought and therefore I tried to use the ratio and not the average of the ratios), if that query makes any sense at all with or without the outlier....

Thanks

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...