We have a requirement to provide guaranteed alerting.
Interested to understand what our options are, especially within the constraints of data lag.
If we have several hosts sending application data and there is a surge of log activity for several hours (lets say 10x the norm), the universal forwarders will start to limit the bandwidth so as to not affect host performance….. and so we end up with lag between log entries to index time… Let’s say 1 hour lag…
How are we best to provide guaranteed (and timely where possible) alerting?
Having real-time alerting is not a requirement, but guaranteed alerting of a critical event is.
1.Anticipating that up to an hour lag can occur, always search from -75m to -60m. eg. 1 hour in the past…scheduling every 15miutes.
2.Real-time searching: rt to rt.
3.Are there any other options?
Well, "guaranteed" is very difficult in from a rigorous mathematical or computer science perspective. (That is, how can I prove I will always have alerts fire). So, you have to define your intent by "guaranteed". Is it "against most likely scenarios" or "no matter what"?
Against most likely scenarios, I would recommend pushing the forwarder throughput throttle all the way open. A properly designed deployment should be able to handle forwarders sending higher than their normal volume for a short time, ideally without greatly impacting host performance. (Assumption there is your hosts are modern and have ample idle CPU/network bandwidth to deal with the burst)
I would also suggest alerting on forwarder lag and on event volumes. If these are your likely "will break" scenario, then seeing your lag double or your volumes go up by 50% gives you a hint that something is wrong and you need to check on it.
These are, however, just hedges against likely scenarios. If you are doing stock trading, for example, and a system failure could cause you to lose $500 million in minutes and almost go out business ... then "guaranteed" takes on a whole new meaning.
There are other things to consider beyond the alert itself. For example - an alert posted by email is hardly guaranteed. How does someone acknowledge they received said alert and are acting upon it?
You need to fully understand the "guaranteed" requirement. And if the need is a highly robust guarantee, then I would recommend discussing with Splunk professional services to help figure out all the pieces here.