Alerting

How to create a real-time conditional alert by matching the results of events in 2 different rolling window

New Member

Hi,
My scenario is that I have a set of commands and I have total hits & total failures for a command in last 30 mins.
Let's say Command A has got 100 hits and out of it 30 got failed in last 30 mins now I want to check the same total hits & total failures of the same command previous 30 mins and if I see same then I want to check for more previous 30 mins and if I see same kind of failure % then I want to trigger an alert.

How can I do this in splunk?

Labels (1)
0 Karma

SplunkTrust
SplunkTrust

Okay, first, if you're looking at 30m increments, you are probably not looking for a real time search. How fast will the person have to respond? What is the actual SLA? if they don't have to respond to an alert within 5m, then you want a scheduled search.

Second, is your 30 minute window a rolling window, or a fixed window?

It's expensive to go back and do things a second or third time. Just get the data all at the same time. What I would tend to do for what you talked about is this -

 your search that gets the events for the last 90 minutes

| rename COMMENT as "divide up the three time periods"
| addinfo 
| eval timeframe= ceiling((_time - info_min_time)/1800)

| rename COMMENT as "set up all the fields you need to stats the three periods"
| command = (whatever the command was)
| errorMessage = coalesce( whatever the error message was, "(NONE)")
| stats count as totalCount  by command errorMessage timeframe 

Now you have records for each combination of time period, command and error message, with "(NONE)" for records with no errors.

| rename COMMENT as "find total of records for each command for each timeframe "
| eventstats sum(totalcount) as commandcount by command timeframe  

| rename COMMENT as "set the  _time to the end of the three time periods"
| eval _time=_info_min_time + 1800*timeframe

Now you can look at the absolute number and/or percentage of errors in each timeframe that are not "(NONE)" and see whether you have a consistent error condition. One way would be to do this.

| eval errorpercent= totalCount /commandcount 
| eventstats min(errorpercent)  as minpercent max(errorpercent)  as maxpercent  by command
| where ... minpercent and maxpercent match some criterial you set.
0 Karma

Communicator

One thing you could try is to apply a time-based window of 30m to streamstats

streamstats

and build your alert condition based on that.

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!