Hi Splunk Community,
I need to create an alert that only gets triggered if two conditions are met. As a matter of fact, the conditions are layered:
I thought I would create 3 sub-searches within the search and output the result in a "counter" and I would, then, run a search to identify if the "counter" values are >3:
index=foo mal_code="foo" source="foo.log"
| search "{\\\"status\\\":{\\\"serverStatusCode\\\":\\\"500\\\"" earliest=-5m@m latest=now
| stats count as event_count1
| search "{\\\"status\\\":{\\\"serverStatusCode\\\":\\\"500\\\"" earliest=-10m@m latest=-5m@m
| stats count as event_count2
| search "{\\\"status\\\":{\\\"serverStatusCode\\\":\\\"500\\\"" earliest=-15m@m latest=-10m@m
| stats count as event_count3
| search event_count*>0
| stats count as result
I am not sure my time modifiers are working correctly, but I am not getting the results I expected.
I would appreciate if I could get some advice on how to go about this.
You can probably make that initial search much faster without using append, which you should try to avoid as most of the time you can do it an alternate way.
Try this
(index=foo earliest=-5m@m latest=@m) OR
(index=its-em-pbus3-app earliest=-15m@m latest=-5m@m)
"Waiting"
| bin _time span=5m aligntime=@m
| stats count by _time
| stats sum(eval(if(count>=3,1,0))) as AllOverThreshold
You search is effectively looking for Waiting in 2 different indexes with 3 different time ranges - if you make the first one -5m@m to @m rather than "now" then you can count the results by _time and you would expect 3+ per time window.
The second stats just creates a new field called AllOverThreshold that should have the value 3 if all counters are over 3.
Then you can simply use a where clause to say
| where AllOverThreshold=3
Then your alert will have no results if all counters are >= 3.
NB: If you use latest=now in the first query, then you will get 4 rows of data with the last being the seconds from @m to now and that may or may not have results
An update to my original question:
I managed to build a query that runs the 3 searches I need within the defined timeframes and validated that the results are good:
index=foo earliest=-5m@m latest=now
| search "Waiting"
| stats count as counter1
| append [
| search "Waiting" index=its-em-pbus3-app earliest=-10m@m latest=-5m@m
| stats count as counter2 ]
| append [
| search "Waiting" index=its-em-pbus3-app earliest=-15m@m latest=-10m@m
| stats count as counter3 ]
I have displayed the results from each query in a table and compared them against searches for the same timeframes to confirm that the values matched.
So that's part 1 dealt with.
Now I'm trying to figure out a way to generate a result to this query that would indicate that the value of the 3 counts is >=3.
I tried using "case" to check each value individually and assign a value to a "results" field using eval:
| eval results = case ( counter1 >= 3 AND counter2 >=3 AND counter3>=3 , "true"
My goal was to be able to search for the "results" field value to determine if my conditions were met, but no dice.
Try something like this
index=foo earliest=-15m@m latest=@m
| bin _time span=5m aligntime=latest
| stats count by _time
You can probably make that initial search much faster without using append, which you should try to avoid as most of the time you can do it an alternate way.
Try this
(index=foo earliest=-5m@m latest=@m) OR
(index=its-em-pbus3-app earliest=-15m@m latest=-5m@m)
"Waiting"
| bin _time span=5m aligntime=@m
| stats count by _time
| stats sum(eval(if(count>=3,1,0))) as AllOverThreshold
You search is effectively looking for Waiting in 2 different indexes with 3 different time ranges - if you make the first one -5m@m to @m rather than "now" then you can count the results by _time and you would expect 3+ per time window.
The second stats just creates a new field called AllOverThreshold that should have the value 3 if all counters are over 3.
Then you can simply use a where clause to say
| where AllOverThreshold=3
Then your alert will have no results if all counters are >= 3.
NB: If you use latest=now in the first query, then you will get 4 rows of data with the last being the seconds from @m to now and that may or may not have results
Thank you Bowesman,
That makes sense and simplifies the query significantly.
I added two different indexes by mistake when I added my reply, I only need to search a single index.
The only thing that is still not clear to me is the values that I need to refer to in the "earliest" and "latest.
So, I grabbed your query and listed the "latest" as "@m":
index=cts-ep-app earliest=-15m@m latest=@m "Waiting"
| bin _time span=5m aligntime=@m
| stats count by _time
| stats sum(eval(if(count>=3,1,0))) as AllOverThreshold
| where AllOverThreshold=3
And that seems to have done the trick.
Many thanks.
Appreciate you taking the time to chime in.
@victorcorrea Have a look at the time modifiers for the concept of 'snap to', which is the @ component of a time constraint.
Generally with an alert, it is a good idea to understand whether you have any "lag" in data being generated by a source and then arriving and being indexed in Splunk.
Consider an event generated at 6:59:58 by a system, which is sent to Splunk at 7:00:02 and is indexed at 7:00:03.
If your alert runs at 7am and searches earliest=-5m@m latest=@m then that event that has a time stamp of 6:59:58 will not yet be indexed in Splunk, so it will not be found by your alert. If this is one of your "Waiting" events, then you may trigger an alerts for a count of 2, but if you look later at that data, you will find the count is actually 3, because that latest event is now in the index.
So, consider whether this is an issue for your alert - you can discover lag by doing
index=foo
| eval lag=_indextime-_time
| stats avg(lag)
if lag is significant, then shift your 5 minute time windows back sufficiently so you do not miss events.
Right now I am creating the alerts in our DEV environment and the lag is negligible, but that will definitely be something I'll keep in mind once we're promoting it to PAT and PROD.
Ultimately, the goal of the alert is to catch a trend in 5-minute intervals for a specific error code. If there are occasional spikes that are not sustained (say, in the first 5 minutes the count is 6 but in the second and third the count is 1 or 0) we don't want the alert to be triggered.
But I'll be running some tests with the application folks by injecting the signatures into the logs at various rates, so I'll be able to determine if I'll have to shift the time windows.
Thanks again, Bowesmana.
You've been very helpful and I appreciate you sharing your knowledge.