I am trying to create a search to generate an alert if I find a host that has more than 1000 events for two consecutive 10 minute periods.
The first search would look for a particular string to see if there are more than 1000 occurrences ( by host) 20 minutes ago to 10 minutes ago.
Then want to see if that same host has more than 1000 events for 10 minutes ago to now.
Would I use two different searches with same search ( index=anIndex source=aSource "aString") with just different lookbacks:
( earliest=-20m latest=-10m ) & ( earliest=-10m latest=now ) and then appendcols ?
Where this stumps me is how would I make sure that its the same host from the first search that is also found in the second search ?
Or is there a different / better approach for this type of comparison, search ?
I would add to this that there's no need to calculate series before the initial stats, as you can calculate it on a much smaller number of events after the stats.
Just to give you more examples to play with...
index=anIndex source=aSource "aString" earliest=-20m@m latest=@m
| bin _time span=10m aligntime=@m
| stats count by host _time
| where count>1000
| eval series = strftime(_time, "%H:%M:%S")." - ".strftime(_time + 600, "%H:%M:%S")
| stats list(series) as series list(count) as eventsPerSeries sum(count) as totalEvents count by host
| where count=2
| fields - count
You can use the bin command to make the 10 minute time buckets
index=anIndex source=aSource "aString" earliest=-20m@m latest=@m
| bin _time span=10m aligntime=@m
| stats count by host _time
| where count>1000
| stats sum(count) as events count by host
| where count=2
Assuming you set the time range in the alert as above, the earliest/latest is not needed in the search itself.
Note the aligntime=@m to make sure Splunk does not do any other time bucketing.
This works too !!!
I like that the snap to on earliest / latest uses whole 10 minute time periods ( if that is the correct term ? )
I compared the results from your solution to @Tom_Lundie and got the same results so both are valid solutions.
I cant accept two solutions so how should I go about giving 'credit' to both of you ?
It's up to you to accept whichever one you end up going with that works for your particular use case. Just karma posts that were useful and accept the one solution.
Note that the alert should run every 10 minutes, so you catch the situation where you get 500, 2000, 2000, 500 event per 10 minutes. If you run your search every 20 minutes, you will miss the middle two
Yes, "snap to" is correct. Whenever you make alerts or reports and are specifying time, unless you really want "now" then try to avoid it, as it's nondeterministic, as it depends a bit on when the scheduler actually runs the alert/report.
Also, there can often be a delay between an event occurring on a host and it being ingested and indexed by Splunk, so it's a common technique, if in your example looking for 2*10 minute windows, that would set the search to be something like
earliest=-22m@m latest=-2m@m
which gives you a 2 minute delay to allow for event lag.
What about something like this?
index=anIndex source=aSource "aString" earliest=-20min latest=now
| eval time_range = if(_time > relative_time(now(), "-10min"), "lt_10_ago", "mt_10_ago")
| stats count by host time_range
| where count > 1000
| eventstats dc(time_range) as time_range_count by host
| where time_range_count = 2
This works. I modified the earliest / latest to match the other example to make Splunk snap to whole 10 minute buckets if I understand how that works correctly.
index=anIndex source=aSource "aString" earliest=-20m@m latest=@m
I am comparing results of your solution to the one provided by @bowesmana
In your solution I like that I see two results, one for each 10 minute bucket and it has a column for each 'range' '20 to 10 Mins' & '10 to 0 Mins"
I ran both queries to validate the results and both return the same results.
Both are valid solutions. How do I give 'credit' to both of you !!!
Also, I know that you mentioned in OP that this was for an alert, but if you did want to maintain and display each individual window for contextual purposes, here is a hybrid solution from both of our answers!
index=anIndex source=aSource "aString" earliest=-20m@m latest=@m
| bin _time span=10m aligntime=@m
| eval series = strftime(_time, "%H:%M:%S")." - ".strftime(_time + 600, "%H:%M:%S")
| stats count by host, series
| where count > 1000
| eventstats dc(series) as time_ranges by host
| where time_ranges=2
| appendpipe [| stats sum(count) as count by host | eval series="Total"]
| chart values(count) over host by series
I would add to this that there's no need to calculate series before the initial stats, as you can calculate it on a much smaller number of events after the stats.
Just to give you more examples to play with...
index=anIndex source=aSource "aString" earliest=-20m@m latest=@m
| bin _time span=10m aligntime=@m
| stats count by host _time
| where count>1000
| eval series = strftime(_time, "%H:%M:%S")." - ".strftime(_time + 600, "%H:%M:%S")
| stats list(series) as series list(count) as eventsPerSeries sum(count) as totalEvents count by host
| where count=2
| fields - count
I was looking at the data being returned for this version and have a question.
I removed | where count=2 and in my results data I then see 3 series, here is an example of the data. I ran this query at 10:40 AM and received these results.
Time Range / range total
10:10 - 10:20 / 438
10:20 - 10:30 / 1642
10:30 - 10:40 / 1474 Total = 3554
Since I am looking at 10 minute time buckets. I ran the query again at 10:50 AM and got these results. I expected the last row to move to the middle. Middle row to the first row and to get a new row as the last but instead this is the data I received.
10:20 - 10:30 / 406
10:30 - 10:40 / 1860
10:40 - 10:50 / 1601 Total = 3867
Is this working as expected or am I missunderstanding something ?
If the search window is -20m@m to @m, then it is a 20 minute range, so I don't see how you can get 3 10 minute time periods in the output.
However, as far as those counters 'moving' in the data, that's probably the ingestion lag issue I talked about.
Have a look at this variant
index=anIndex source=aSource "aString" earliest=-25m@m latest=-5m@m
| eval event_time=_time
| bin _time span=10m aligntime=@m
| stats min(event_time) as first_event max(event_time) as last_event count by host _time
| where count>1000
| eval series = strftime(_time, "%H:%M:%S")." - ".strftime(_time + 600, "%H:%M:%S")
| foreach *_event [ eval <<FIELD>>=strftime('<<FIELD>>', "%T") ]
| stats list(series) as series list(count) as eventsPerSeries list(*event) as *event sum(count) as totalEvents count by host
| where count>=2
| fields - count
this will search the -25 to -5 minute period and also record the first and last event it has seen in the final output. It does a count>=2 at the end.
Haha glad to hear it,
I think @bowesmana's use of bin and @m makes it a bit cleaner, especially seeing as you're happy with it...
Feel free to accept their solution, you're welcome to share the love with the karma button though 🙂