Hi guys just need some brain picking
How can I create an alert that monitors for errors that persist for more than 2 minutes then trigger?
index=your_logs "error"
| bin _time span=1m
| stats count as error_count by _time
| streamstats window=2 current=t count(error_count) as consecutive_error_minutes
| where consecutive_error_minutes >= 2
| stats count as alert_trigger
but when I
Time Range: Set to Last 5 minutes
Cron Schedule: Run it every 5 minutes (e.g., */5 * * * *).
Trigger Condition: Set to Number of Results > 0.
Cron Schedule: */5 * * * * (Runs every 5 minutes).
Time Range:
Earliest: -7m
Latest: -2m
For some reason the Alert just triggered with
Alert Trigger = 0
Not sure what went wrong?
Adding to what has already been said it seems you'd be best off with just streamstats but with a time window.
index=your_logs "error"
| streamstats time_window=2m count values(_time) as _time
| where count>=2
If you have multiple types of error and want to check each of them separately you can add some form of a "by" clause to streamstats like
by errorcode
One caveat - for long-lasting errors it will give you several results from each subsequent error event.
Thank you Ill give this a try
Then you need to understand what you are meaning with inside two minutes.
Is this meaning as xx:y1:zz or is this meaning that event has happened within two minute time slot counting ms too? If 1st is enough then bin is correct answer but if it’s 2nd then you need something like stats + range. And try always use first stats instead of *stats as this way you can utilize indexes parallelism (map + reduce) and get better response time and utilize less resources!
I think your logic is flawed. If there are gaps with minutes with no error, then they will not be "consecutive" minutes, just adjacent, so if you have errors at 8:01 and 8:03, you will get a count of 2 consecutive errors, which I assume is not what you want.
You would be better off using timechart, as that will give you populated values for each time interval - see this example using timechart and a changed streamstats - run this example with a time range of last 60 minutes and you can see the effect. Comment out the last line to see how the count is calculated
| makeresults count=20
| eval _time=now() - ((random() % 30) * 60)
| timechart span=1m count as error_count
| streamstats window=2 current=t count(eval(error_count>0)) as consecutive_error_minutes
| where consecutive_error_minutes >= 2This will return you a list of minutes where the consecutive error count was >=2.
Note that this will remove the first of the minutes when the error first occurred, as streamstats will record that as a 1 error count, so the results will not include the first minute of the error. Again, is this what you want?
Hopefully this helps, but add any extra detail if this does not get you to where you want to get to.
That is a Good Point!
what should happen is
8:01 and 8:03 does not trigger missing 8:02 since no event was log for error 1min missing
but
7:57 , 7:58 and 7:59 should trigger from 57 min>59 that was 2min so should trigger
hope this helps?
And another small, but significant issue is that you will "miss" consecutive errors that occur on your search boundary, e.g. your search runs at 0, 5, 10... and searches 53-58, 58-03, 03-08...
But as you're requiring a count of 2 or more, you're only actually looking at 4 possible minutes.
So, if you have errors at 03 and 04, you will never see that as 2 consecutive errors. So, you want your search window to be -8 to -2, so there is a 1 minute overlap, so you're using a full 5 minute window for >1 error count.
When the query ends with stats count, it will always return one result. Therefore, Number of Results > 0 will always trigger the alert. Add a where command to the alert so it only returns results if there are consecutive errors.
index=your_logs "error"
| bin _time span=1m
| stats count as error_count by _time
| streamstats window=2 current=t count(error_count) as consecutive_error_minutes
| where consecutive_error_minutes >= 2
| stats count as alert_trigger
| where alert_trigger > 0That said, I have doubts about the methodology used. The current query will trigger if two consecutive errors are detected, but what if they're different errors? Does it matter? I would think that two different errors would not be considered "persistence".
what should happen is
8:01 and 8:03 does not trigger missing 8:02 since no event was log for error 1min missing
but
7:57 , 7:58 and 7:59 should trigger from 57 min>59 that was 2min so should trigger
what if they're different errors?
Ive filtered out my search to only look for 1 type of error
Does it matter?
no
Thank you ill give this a try