We have a process that produces 8,000 requests per second that are consumed by a server. We average only about 2 timeout events per second. A few times per day, however, this timeout rate will spike to 1,000 timeouts for no more than a second or two. We don't care about these spikes, but need to know as quickly as possible when the consuming service is down. How can I "smooth" the event count such that we can ignore the spikes and be notified within about 5 minutes of an outage? I'm thinking of something like a kalman filter (I'm not a mathematicition) acting on the past 5 minutes of data and runs every 5 minutes. A normal average won't do the trick because it can't tell the difference between performance degradation and a spike. It doesn't seem like the predictive functions native to Splunk would work right out of the box. Any other ideas? Thanks.
... View more