Alerting

How to create a real-time conditional alert by matching the results of events in 2 different rolling window

New Member

Hi,
My scenario is that I have a set of commands and I have total hits & total failures for a command in last 30 mins.
Let's say Command A has got 100 hits and out of it 30 got failed in last 30 mins now I want to check the same total hits & total failures of the same command previous 30 mins and if I see same then I want to check for more previous 30 mins and if I see same kind of failure % then I want to trigger an alert.

How can I do this in splunk?

Labels (1)
0 Karma

SplunkTrust
SplunkTrust

Okay, first, if you're looking at 30m increments, you are probably not looking for a real time search. How fast will the person have to respond? What is the actual SLA? if they don't have to respond to an alert within 5m, then you want a scheduled search.

Second, is your 30 minute window a rolling window, or a fixed window?

It's expensive to go back and do things a second or third time. Just get the data all at the same time. What I would tend to do for what you talked about is this -

 your search that gets the events for the last 90 minutes

| rename COMMENT as "divide up the three time periods"
| addinfo 
| eval timeframe= ceiling((_time - info_min_time)/1800)

| rename COMMENT as "set up all the fields you need to stats the three periods"
| command = (whatever the command was)
| errorMessage = coalesce( whatever the error message was, "(NONE)")
| stats count as totalCount  by command errorMessage timeframe 

Now you have records for each combination of time period, command and error message, with "(NONE)" for records with no errors.

| rename COMMENT as "find total of records for each command for each timeframe "
| eventstats sum(totalcount) as commandcount by command timeframe  

| rename COMMENT as "set the  _time to the end of the three time periods"
| eval _time=_info_min_time + 1800*timeframe

Now you can look at the absolute number and/or percentage of errors in each timeframe that are not "(NONE)" and see whether you have a consistent error condition. One way would be to do this.

| eval errorpercent= totalCount /commandcount 
| eventstats min(errorpercent)  as minpercent max(errorpercent)  as maxpercent  by command
| where ... minpercent and maxpercent match some criterial you set.
0 Karma

Communicator

One thing you could try is to apply a time-based window of 30m to streamstats

streamstats

and build your alert condition based on that.

0 Karma