Alerting

Create a Splunk Alert that Evaluates Two Conditions

victorcorrea
Explorer

Hi Splunk Community,

I need to create an alert that only gets triggered if two conditions are met. As a matter of fact, the conditions are layered:

  1. Search results are >3 in a 5-minute interval.
  2. Condition 1 is true 3 times over a 15-minute interval.

I thought I would create 3 sub-searches within the search and output the result in a "counter" and I would, then, run a search to identify if the "counter" values are >3:

index=foo mal_code="foo" source="foo.log"
| search "{\\\"status\\\":{\\\"serverStatusCode\\\":\\\"500\\\"" earliest=-5m@m latest=now
| stats count as event_count1
| search "{\\\"status\\\":{\\\"serverStatusCode\\\":\\\"500\\\"" earliest=-10m@m latest=-5m@m
| stats count as event_count2
| search "{\\\"status\\\":{\\\"serverStatusCode\\\":\\\"500\\\"" earliest=-15m@m latest=-10m@m
| stats count as event_count3
| search event_count*>0
| stats count as result


I am not sure my time modifiers are working correctly, but I am not getting the results I expected.

I would appreciate if I could get some advice on how to go about this.

Labels (1)
0 Karma
1 Solution

bowesmana
SplunkTrust
SplunkTrust

You can probably make that initial search much faster without using append, which you should try to avoid as most of the time you can do it an alternate way.

Try this

(index=foo earliest=-5m@m latest=@m) OR 
(index=its-em-pbus3-app earliest=-15m@m latest=-5m@m)
"Waiting" 
| bin _time span=5m aligntime=@m
| stats count by _time
| stats sum(eval(if(count>=3,1,0))) as AllOverThreshold

You search is effectively looking for Waiting in 2 different indexes with 3 different time ranges - if you make the first one -5m@m to @m rather than "now" then you can count the results by _time and you would expect  3+ per time window.

The second stats just creates a new field called AllOverThreshold that should have the value 3 if all counters are over 3.

Then you can simply use a where clause to say

| where AllOverThreshold=3

Then your alert will have no results if all counters are >= 3.

NB: If you use latest=now in the first query, then you will get 4 rows of data with the last being the seconds from @m to now and that may or may not have results

 

View solution in original post

victorcorrea
Explorer

An update to my original question:


I managed to build a query that runs the 3 searches I need within the defined timeframes and validated that the results are good:

index=foo earliest=-5m@m latest=now 
    | search "Waiting" 
    | stats count as counter1
      | append [ 
          | search "Waiting" index=its-em-pbus3-app earliest=-10m@m latest=-5m@m 
          | stats count as counter2 ]
            | append [
              | search "Waiting" index=its-em-pbus3-app earliest=-15m@m latest=-10m@m 
              | stats count as counter3 ]


I have displayed the results from each query in a table and compared them against searches for the same timeframes to confirm that the values matched.

So that's part 1 dealt with.

Now I'm trying to figure out a way to generate a result to this query that would indicate that the value of the 3 counts is >=3.

I tried using "case" to check each value individually and assign a value to a "results" field using eval:

| eval results = case ( counter1 >= 3 AND counter2 >=3 AND counter3>=3 , "true"

 
My goal was to be able to search for the "results" field value to determine if my conditions were met, but no dice.

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Try something like this

index=foo earliest=-15m@m latest=@m
| bin _time span=5m aligntime=latest
| stats count by _time

bowesmana
SplunkTrust
SplunkTrust

You can probably make that initial search much faster without using append, which you should try to avoid as most of the time you can do it an alternate way.

Try this

(index=foo earliest=-5m@m latest=@m) OR 
(index=its-em-pbus3-app earliest=-15m@m latest=-5m@m)
"Waiting" 
| bin _time span=5m aligntime=@m
| stats count by _time
| stats sum(eval(if(count>=3,1,0))) as AllOverThreshold

You search is effectively looking for Waiting in 2 different indexes with 3 different time ranges - if you make the first one -5m@m to @m rather than "now" then you can count the results by _time and you would expect  3+ per time window.

The second stats just creates a new field called AllOverThreshold that should have the value 3 if all counters are over 3.

Then you can simply use a where clause to say

| where AllOverThreshold=3

Then your alert will have no results if all counters are >= 3.

NB: If you use latest=now in the first query, then you will get 4 rows of data with the last being the seconds from @m to now and that may or may not have results

 

victorcorrea
Explorer

Thank you Bowesman,

That makes sense and simplifies the query significantly.

I added two different indexes by mistake when I added my reply, I only need to search a single index.

The only thing that is still not clear to me is the values that I need to refer to in the "earliest" and "latest.

So, I grabbed your query and listed the "latest" as "@m":

index=cts-ep-app earliest=-15m@m latest=@m "Waiting"
| bin _time span=5m aligntime=@m
| stats count by _time
| stats sum(eval(if(count>=3,1,0))) as AllOverThreshold
| where AllOverThreshold=3

 

And that seems to have done the trick.

Many thanks.

Appreciate you taking the time to chime in.

0 Karma

bowesmana
SplunkTrust
SplunkTrust

@victorcorrea Have a look at the time modifiers for the concept of 'snap to', which is the @ component of a time constraint.

Generally with an alert, it is a good idea to understand whether you have any "lag" in data being generated by a source and then arriving and being indexed in Splunk.

Consider an event generated at 6:59:58 by a system, which is sent to Splunk at 7:00:02 and is indexed at 7:00:03. 

If your alert runs at 7am and searches earliest=-5m@m latest=@m then that event that has a time stamp of 6:59:58 will not yet be indexed in Splunk, so it will not be found by your alert. If this is one of your "Waiting" events, then you may trigger an alerts for a count of 2, but if you look later at that data, you will find the count is actually 3, because that latest event is now in the index.

So, consider whether this is an issue for your alert - you can discover lag by doing

index=foo
| eval lag=_indextime-_time
| stats avg(lag)

if lag is significant, then shift your 5 minute time windows back sufficiently so you do not miss events. 

0 Karma

victorcorrea
Explorer

Right now I am creating the alerts in our DEV environment and the lag is negligible, but that will definitely be something I'll keep in mind once we're promoting it to PAT and PROD.

Ultimately, the goal of the alert is to catch a trend in 5-minute intervals for a specific error code. If there are occasional spikes that are not sustained (say, in the first 5 minutes the count is 6 but in the second and third the count is 1 or 0) we don't want the alert to be triggered.

But I'll be running some tests with the application folks by injecting the signatures into the logs at various rates, so I'll be able to determine if I'll have to shift the time windows.

Thanks again, Bowesmana.

You've been very helpful and I appreciate you sharing your knowledge.


0 Karma
Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...