Is there a way to create a general alert that can trigger when anything suddenly experiences a significant amount of increased log messages? For example...
any_host experiences large number of same event (login failures, multipath errors, read only file systems, etc) in a given time period
Its basically a catch all type of alert. This is in place of writing an alert for every possible 'large increase in activity'. We're writing alerts as we see things come in and it there is a lot of stuff falling through the cracks.
A method I'm working on for this involves doing the following:
I'm working out the exact technical details, but the above is the approach I'm currently experimenting with making work.
or you could just use the predict command. it calculates a band of normalcy, then you eval if your real flow has left the band. I wrote a blog post showing how to do that here: http://blogs.splunk.com/2012/11/04/predict-detect/
thanks for the follow up... indeed, non-normal data basically leaves you with two choices. A) don't do that, B) chop off the outliers. I hear that choice B plus an economics degree = profit! But seriously, if the consequences of a false positive are dire, it's best not to screw with algorithmic planning and detection unless you're going to add a layer of smarts to it. Bayesian smarts are fairly effective within a reasonable domain, but again you want to be thinking about the big picture of inputs and outcomes. For instance, SpamAssassin on your email is low impact, but HFT on your savings can leave a mark.
Actually, 90 makes the band tighter - 99 is the widest (least sensitive). I also tried the trendline solution above as well - similar result (http://i.imgur.com/CVEsMzF.png). No worries - it's just that some data's behavior doesn't conform to a uniform Gaussian distribution, therefore using averages and +/- standard deviations can give misleading results.
I think that goes the other way, as in 90 would have lower sensitivity than 99... not sure though. Also, here's another way to do it which doesn't look into the future (janked from a search Coccyx wrote):
... | trendline sma20(Sales) as trend | eventstats stdev(Sales) as stdev | eval trend=if(isnull(trend),Sales,trend) | eval "High Prediction"=trend+(2*stdev) | eval "Low Prediction"=if(trend-(1.5*stdev)>0,trend-(1.5*stdev),0) | fields - stdev, trend
Thanks for the suggestions! Yes, changing the algorithm to "LL" seems to be the best one for this kind of data, and I also changed the range to the 99th percentile (widest possible). The number of false alerts is much less than before (no longer 50), but still is about 4-5 for a 4 hour window. (http://i.imgur.com/vcHfltD.png)
Agreed -- luckily there's some tuning options to that command. Here's the manual for reference: http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Predict
My initial inclination is to broaden the range from 95% percentile, using the upper and lower options. You might also try out different algorithms. I suspect that fare request response times have some periodicity, and LLP or LLT might work better?
Hmmm...I tried the predict command as suggested in your blog on some response time data and during a 4-hour window of what I know are "normal" values, the upper95 and lower95 bands were crossed almost 50 times (http://i.imgur.com/IZIbgqY.png). That is a lot of false alerts.
The Prelert Anomaly Detective app uses machine-learning algorithms to automatically learn the baseline rates of your events (or the values of performance metrics) and uses that information to detect anomalies in current data. It can auto-learn the base line in 3 modes:
Sounds like it would be useful for your use-case!