All,
Had a AD outage today which caused issues, but we had no alert. I was just going to compare time ranges in 15 minute buckets for spikes, but thought there might be a fancier way to do it. Any input on some options on how to alert againt the authentication spike below?
Take a look at this Question&Answer:
https://answers.splunk.com/answers/511894/how-to-use-the-timewrap-command-and-set-an-alert-f.html
Basically, timechart
is s pretty good base solution for this... but I'm wondering what your count
field on that kind of event contains. Is it the count in a particular unit of time? Is there even such a field? When you further count
the count within timechart
, whatever value might have been in that earlier count
field is gone, so only the _time field is really passing. We've changed count
to index
below to clarify what is really happening.
tag=authentication tag=failure
| fields index
| timechart span=5m count as FailCount
| eventstats avg(FailCount) as avgFail stdev(FailCount) as stdevFail max(FailCount) as maxFail
| eval Warning = avgFail + 2*stdevFail
| fields - avgFail stdevFail
The above will put a horizontal line at the avg + 2 sd level on the timechart, and another one at the maximum value found. Anything above this level you can consider alerting on.
This would let all the records through if any of them are worth alerting on...
| where maxFail >= Warning
... or this would let only the ones that are worth alerting on themselves....
| where FailCount >= Warning