How do you monitor/alert for spikes of negative events?


Hey everyone. In data we're taking in (SIP call dialogs) we can monitor for negative termination causes. Some are expected, but what we're concerned about is sudden spikes. For instance, if we suddenly see a huge spike in the number of 404 messages, that would indicate a problem and we would want to be notified about it.

So, right now the approach I am rolling around in my head is to do a timechart, and then run a delta against each command. Something like this:

|timechart span=5m count(eval(SIP_CODE="404")) AS 404_COUNT count(eval(SIP_CODE="487")) AS 487_COUNT ....
|delta 404_COUNT
|delta 487_COUNT

But at this point I'm kind of stuck on how to calculate the rolling average and standard deviation to figure out whether an event is a "spike" or not. Statistically speaking if we see a delta which is more than 3 standard deviations from the mean it would be significant, and probably indicate something weird (it'd be in the 99.7th percentile).

I know other folks have worked on similar things, so I'm curious how you approached it.

A simple threshold won't work because our numbers grow and wane throughout the day as they're dependent on business hours.

Tags (1)
0 Karma

Re: How do you monitor/alert for spikes of negative events?

0 Karma

Re: How do you monitor/alert for spikes of negative events?


Try this

index=SIP earliest=-24h@m latest=@m
| bucket span=5m _time
| stats count as Today by SIP_CODE _time
| eval time=strptime(strftime(_time,"%H:%M"),"%H:%M")
| fields time SIP_CODE Today
| join time [ search index=SIP earliest=-7d@d latest=-24h@m
      | bucket span=5m _time
      | stats count as simpleCount by SIP_CODE _time
      | eval timespan=strftime(_time,"%H:%M")
      | stats p99(simpleCount) as UpperBound by SIP_CODE timespan
      | eval time=strptime(timespan,"%H:%M")
      | fields time SIP_CODE UpperBound ]
| fieldformat time=strftime(time,"%H:%M")
| table time SIP_CODE Today UpperBound

This computes a count for every 5 minutes of the last 24 hours. Then it computes the 99th percentile count over a 5-minute range across the previous week. Finally it joins the two calculations together and shows them in a table.
To only show the time periods and SIP_CODEs where today's count exceeds the 99th percentile, add the following line at the end of the search

| where Today > UpperBound

If you really want to run this frequently, or use it as the basis of an alert, I would consider using a summary index to calculate and store the 5 minute counts by SIP_CODE... but that is another question. As is "how would you chart this?"

Also, here are two links that may be generally useful for this sort of thing:

And there are many similar, though not identical, questions here on Splunk Answers.