We had an issue come up this morning where we all of a sudden had a HUGE spike in one type of error in our error logs - it normally has 1-100 of these a day but in the first hour this morning we had over 1000 . We noticed it visually on one of our dashboards so we could jump in and address it quickly. Yay for that. However, we had to find it visually because at the moment I don't have an alert that is doing what I'd like it to.
I have an alert set up to look for standard deviation. However it seems that no matter how I tweak it I either get so many alerts it's not useful or I don't get the alerts I need.
Here's my current standard deviation search for the alert:
index=ecm sourcetype="ibm:was:system" host=PRDFNCM CIWEB AND Error AND "Exception" NOT "CIWEB.Plugin" | rex field=_raw ".(?\w*?Exception)" | bucket _time span=1d | stats count BY _time ExceptionName | eventstats stdev(count) as stdev BY ExceptionName| where count > (3 * stdev)
This morning the standard deviation was calculated as 477.73 and the count was 1377 so it didn't alert.
Since a normal day for this error is less than 100 it seems to me like the standard deviation is off but I don't know how to fix it.
Any help or advice would be much appreciated.
We have something like this, specifically with HTTP 500 errors. We get around 50 or so an hour normally. So, I setup the alert to simply search for 500's, stats, and add totals and e-mail if they are over 75.
index=application (host=TTAPPPEGACC*) sourcetype="apollo:prod:tomcat_access" httpcode=500 |eval host=upper(host) |stats count by host |addtotals col=true
I then setup the alert screen as shown.
I started with something that and it works nicely if you know what you're looking for. The problem is we're monitoring an unknown number of errors and they all have different 'normal' thresholds. I'm trying not to have to hard code everything and update the alert every week.