If you have a user who is "constantly failing" over a period of time, then that is a training problem.
If your job is running every 10 minutes, across a 10-minute timeframe, and deciding whether to alert, then you can just change it to run across a 20-minute timeframe, and alert only for those users that deserve an alert in the second 10 minutes but did not deserve (and therefore probably receive) an alert in the first 10 minutes. It won't be the splunk alert that's suppressing the long-term fails, but the search itself.across
You could actually go one further, just in case someone keeps failing long-term. Do the calculation across a 30 minute period. If the present period is an alert, suppress the alert only if the prior period was an alert but two periods ago was NOT an alert. Basically, if the guy has been failing for 30 minutes straight then there is something really wrong with him and we should send Mongo to go break his legs...
... View more