I have a system that monitors a set of devices and outputs an alarm message everytime there is a state change on one of the said devices. Possible states are UP and DOWN.
The system sends the messages to Splunk and I need to send email alerts based on certain conditions:
1) If DOWN is received and not followed by any other message in the next 5 minutes: send STATE CHANGED TO DOWN email alert
2) If UP is received and not followed by any other message in the next 5 minutes: send STATE CHANGED TO UP email alert
3) If DOWN or UP is received, followed by any another state change message in the next 5 minutes: do not send any email alerts (considered false alarm)
I have tried implementing a search that "looks back" in the last 5 minutes, scheduled every minute, to find lone state change events. This fulfills condition (1) and (2), but the logic falls apart on condition (3), as the last alarm in an event chain (e.g UP-DOWN-UP-DOWN within seconds of each other) will always be sent as alert.
Anyone have any ideas how I can achieve this in Splunk?
You can try this ...
(your search that returns up or down) | table _time Device State | bin _time span=10s | stats values(State) as State by _time Device | rename COMMENT as "The above groups all signals into 10-second intervals with either up, down or both." | rename COMMENT as "If an interval has both, mark it as mixed." | eval State =if(mvcount(State)>1,"mixed",State) | rename COMMENT as "Calculate how long in seconds a steady state has been maintained." | sort 0 Device _time | streamstats window=60 global=f first(_time) as Start_Of_State by Device, State | eval StateDuration= _time - Start_Of_State | rename COMMENT as "Report the first record when a steady state has been maintained for exactly 5 minutes." | where StateDuration = 300 AND State!="mixed" | head 1
Change to 10s intervals to process less records, sort by Device, name StateDuration and calculate separately to make it obvious what we are doing.
Hi DalJeanis, I have tried your suggestion earlier. The way you do it here is by dividing the data into discrete 10s time bins, but unfortunately this doesn't work for high volume data like what I am working with.
For example, when I receive DOWN at 10.49, and UP at 10.51. I understand your idea is to put them in a single bin and mark it as a "mixed" state that will be filtered out by the final 'where' command, but actually 'bin' will separate these 2 into 10.40 and 10.50 bins.