Hi, I have few alerts created which looks into failure rates of my services and I have put in a condition which says if the failure rate is > 10% AND number of failed request > 200 then trigger the alert.
This is really not the ideal way to do the monitoring. Is there a way in Splunk we can use the AI to detect anomalies or outliers over time? So basically if Splunk can detect a failure pattern and if that pattern is consistent then don't trigger an alert but if it goes beyond the threshold, only then trigger it?
Can we do this kind of stuff in Splunk using in-built ML or AI?
Take a look at the ML toolkit - there are some good examples on outliers there - you can also roll your own, e.g. this type of search will look for hourly outliers outside 3 * stdev
search error
| bin _time span=1m
| stats count by _time
| streamstats window=60 avg(count) as avg stdev(count) as stdev
| eval multiplier = 3
| eval lower_bound = avg - (stdev * multiplier)
| eval upper_bound = avg + (stdev * multiplier)
| eval outlier = if(count < lower_bound OR count > upper_bound, 1, 0)
| table _time count lower_bound upper_bound outlier
Results Example