I have a challenge that I'm hoping someone can help with.
There are around 24,000,000 events being indexed per 24 hours. There are around 500 known error conditions based on text within each event. The challenge is to produce alerting based on deviation from a 4 week time slice average for both a total number of events across all instances and alerts for deviation per instance.
With some input from the guys at .Conf I have opted to use eventtype definitions to identify known error events. The configuration for defining the eventtype tagging is deployed and working although it has slowed the responsiveness of the dashboards somewhat.
I need to raise an alert if the total of any eventtype across all application instances for the previous 4 minutes is 10% greater than a baseline calculated from the average of the previous 4 Mondays for the same period.
I also need to alert in a similar way if a single application instance's eventtype count deviation is 20% greater than the average across all application instances for the same time slice for the previous 4 Mondays.
I've created a search that counts the eventtype per application instance per minute and that is now going to a summary index and runs every hour.
source="*frame.log" OR source="*HHmm.log" earliest=-24h latest=-23h | stats count(eventtype) as etype by date_minute eventtype App_Instance
I guess the next steps might be to create a search that calculates the average per App_Instance and total every minute for the previous 4 minute window and put that into either an external file or another summary index
Then create a search that uses a subsearch to populate the outer search raising alerts for each eventtype.
Any advice and help would be appreciated.
Good use case! But, to make sure I understand, you have a number of eventtypes (and applications) you want to baseline and compute the difference between the current rate and the baseline rate to see if the system is no longer behaving normally.
For example, for eventtype="foo" if you have a count of 250 in the last 4 minutes, how does this compare to average counts of eventtype="foo" on the previous 4 Mondays at the same period (??)
A challenge with these comparisons can be the number of false positives and false negatives that can result, because sometimes a simple average and % deviations are often not sufficient to model the data accurately. Apologies for the statistical terminology, but an ideal baseline should be fit to a probability distribution function that accurately models the data. Different kinds of data may require different probability distribution functions. And if you are trying to do this across multiple data types/dimensions, it can be difficult implement as you'll need to store these multiple baselines across a large number of dimensions.
My company has developed an analytics app that will be able to accomplish this very thing and make it simple to use.
Here's a link to our app: http://splunk-base.splunk.com/apps/68765/prelert-anomaly-detective
I am doing something like the same thing. What I did was to divide the time into buckets and then use the math to calculate the thresholds and std dev. And as a performance tip, try to use summary indexing. You should create a small bucket with a search string which is scheduled to gather as much data as you need so that it doesn't get logged down because of the amount of data to be searched.