I have a hypothetical search that runs every 5 minutes and scans last hour worth of data for certain errors:
index=log "error" earliest=-1h | eventstats c as user_errors by username | where user_errors>5 | dedup username | table ip, username, user_errors
.
I want to alert on users who cause more than 5 errors per any hour period.
This search generates approx. 10 alerts each time, where 6-8 users are the same because "sliding" window is an hour wide and is being scanned every 5 minutes.
Ideally i'd want to throttle alerts for the same username
.
So I want to receive alerts only on all "unique" users and never receive alerts for the same user more often than once an hour.
The only way I see how to do that is to set "Alerting mode"-"Once per result" and set "Per result throttling fields"=username
What happens after i did that - is that I receive email alert for only 1 user and it misses all the other users. And then i do not receive any more alerts whatsoever till the next hour - and again for one user only.
Any way to fix that?
You can do this with some variation of dynamic lookups:
http://wiki.splunk.com/Dynamically_Editing_Lookup_Tables
One approach is like this:
You have a lookup table that has input field username
and output field last_alert_time
.
Before you generate a alert (in the search), you do a lookup in last_alert_by_user_lookup.csv
for username
to get last_alert_time
and only alert if _time
- last_alert_time
> threshold_seconds
(or if last_alert_time
is null
).
Every time that you generate an alert, you call a scirpt to update last_alert_by_user_lookup.csv
to "upsert" the user's last_alert_time
(you could do this with another scheduled search and also other ways but this is probably easiest).
You can do this with some variation of dynamic lookups:
http://wiki.splunk.com/Dynamically_Editing_Lookup_Tables
One approach is like this:
You have a lookup table that has input field username
and output field last_alert_time
.
Before you generate a alert (in the search), you do a lookup in last_alert_by_user_lookup.csv
for username
to get last_alert_time
and only alert if _time
- last_alert_time
> threshold_seconds
(or if last_alert_time
is null
).
Every time that you generate an alert, you call a scirpt to update last_alert_by_user_lookup.csv
to "upsert" the user's last_alert_time
(you could do this with another scheduled search and also other ways but this is probably easiest).
Yes, I think this is good approach.
It's reasonably flexible and can be applied to more complex throttling logic as well.
Thank you.
Please "accept" my answer if it works for you.
No prob, thank you.
You could change your search to this:
index=log "error" earliest=-1h | stats c as user_errors latest(ip) as ip by username | where user_errors > 5
Should be a bit more efficient than a late dedup
... unrelated to the actual question of course.
Thank you.