I'm looking for a way to make an alert trigger only if a certain amount of events occur within a 3 minute period, per minute.
Right now:
index=stuff earliest=-4m@m latest=-1m@m
| bucket span=1m _time
| stats count by _time errormsg
Running every 3 minutes, and alert set to trigger if events > 5
Problem is, if there is a single minute spike of errors that exceeds 5 errors within that 3 minute search period, the alert will trigger. I'm looking for help in having the alert trigger only if each minute during that 3 minute period exceeds 5 errors.
Thanks in advance!
If your threshold is total errors per minute based you can do something like this:
earliest=-4m@m latest=-1m@m
| bucket span=1m _time
| stats count as TotalErrCount values(errmsg) by _time | where TotalErrCount>5
Then alert if you get 3 results. If you are wanting to total each specific message and look for cases where there is a specific errmsg with >5 for all 3 minutes then this should do it if you alert on getting 3 results, it would take some refinement to provide per message counts in the results, the goal here was to give 3 results only if at least one specific errmesg occurred more than 5 times in each minute:
earliest=-4m@m latest=-1m@m
| bucket span=1m _time
| stats count as ErrCount by _time errmesg | where ErrCount>5 | stats values(errmesg) by _time
If your threshold is total errors per minute based you can do something like this:
earliest=-4m@m latest=-1m@m
| bucket span=1m _time
| stats count as TotalErrCount values(errmsg) by _time | where TotalErrCount>5
Then alert if you get 3 results. If you are wanting to total each specific message and look for cases where there is a specific errmsg with >5 for all 3 minutes then this should do it if you alert on getting 3 results, it would take some refinement to provide per message counts in the results, the goal here was to give 3 results only if at least one specific errmesg occurred more than 5 times in each minute:
earliest=-4m@m latest=-1m@m
| bucket span=1m _time
| stats count as ErrCount by _time errmesg | where ErrCount>5 | stats values(errmesg) by _time
Your 2nd query is what i was looking for. Thank you!
So are you trying to alert when the overall count of errors exceeds 5 for all 3 minutes or do the specific different errormsg values need to factor in?
Does your search always results in 3 rows? If so, you can try something like this
index=stuff earliest=-4m@m latest=-1m@m | bucket span=1m _time | stats count by _time errormsg | eval flag = if(count >5, 1, 0) | eventstats sum(flag) as total | search total = 3