Alerting

Smoothing a running average

Esperteyu
Explorer

Hi,

So what I've been trying to do lately is to create an alert on top of a ratio errors/total and the option I focused for the moment, not that I wouldn't like to have something more accurate if I can though, is trying to alert if the last 10 minutes ratio exceeds in more than 20 percentage points the last 24 hours ratio.

For that I tried to use trendline with no luck as I need a by clause (as per https://answers.splunk.com/answers/692424/trendline-to-work-grouped-by-field.html) and a few other things and eventually came with "something"..
Then I started to think about smoothing the 24 hour average to avoid that some spikes could go undetected, and found the outlier command which I used in a very naive way, so the query I have at the moment is this one

index="logger" "Raw Notification" 
| bin _time span=10m
| eval _raw=replace(_raw,"\\\\\"","\"") 
| rex "\"RawRequest\":\"(?<raw_request>.+)\"}$" 
| eval json= raw_request 
| spath input=json output=country_code path=customer.billingAddress.countryCode 
| spath input=json output=card_scheme path=paymentMethod.card.cardScheme 
| spath input=json output=acquirer_name path=processing.authResponse.acquirerName 
| spath input=json output=transaction_status path=transaction.status 
| spath input=json output=reason_messages path=history{}.reasonMessage
| eval acquirer= card_scheme . ":" . acquirer_name . ":" . country_code
| eval final_reason_message=mvIndex(reason_messages, 1)
| eval error=if(like(transaction_status,"%FAILED%"),1,0)
| eval error_message=if(like(transaction_status,"%FAILED%"),final_reason_message, NULL()) 
| stats count as total sum(error) as errors mode(error_message) as most_common_error_message by _time, acquirer
| eval ten_minutes_error_rate=100*exact(errors)/exact(total) 
| outlier action=TF total errors
| sort by _time
| streamstats time_window=24h sum(total) as twentyfour_hours_total sum(errors) as twentyfour_hours_errors by acquirer
| eval twentyfour_hours_error_rate=100*exact(twentyfour_hours_errors)/exact(twentyfour_hours_total)
| eval outlier = ten_minutes_error_rate - twentyfour_hours_error_rate
| where outlier > 20

I would like to get critics on it, with some sample date I have it detected what I expected to be detected but I'm not sure if I am reinventing the wheel once I know about the outlier command (although I don't think it's that easy to just use it to detect the outliers in my case because, as I'm considering ratios, it's my understanding that the usage of standard deviations and averages have to be carefully thought and therefore I tried to use the ratio and not the average of the ratios), if that query makes any sense at all with or without the outlier....

Thanks

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Event Series May & June: From Network Visibility to Service Intelligence

Unifying the Network: Moving from Alert Noise to Service Intelligence with Splunk ITSI In today’s hybrid ...

Global Splunk User Group Events: May + June 2026

Your Splunk Community Awaits: Discover Upcoming User Group Events Worldwide    Staying ahead in the fast-paced ...

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas     Cisco Live 2026 is almost here, and this ...