I am looking for suggestions as to how best to implement an alerting request made by my users.
Summary
A query is run to count the number of events. The time weighted difference (in percentage) between one period of the next will be used to trigger the alert if the threshold is met.
Query
I have a query which I am already using on an existing dashboard. I am using TSTATS to count events, then a REX to group the events based on the first 3 characters for a field (DRA).
| tstats prestats=t count where index="foo" sourcetype="bar*" DRA=C* by DRA, _time
| rex field=DRA "^(?<pfx>\D\S{2}).*"
| timechart span=5m count by pfx useother=f limit=0 usenull=f
The groups and the number of these 'groups' will vary, but the result will be similar to the below :
_time
C27
C31
C33
2022-10-12 13:00:00
116
2
70
2022-10-12 13:05:00
287
3
20
2022-10-12 13:10:00
383
6
45
2022-10-12 13:15:00
259
7
41
I suspect the maximum number of DRA codes that we will see will be 25, although I can break this up into different queries and play with some timing and priorities with the running of the searches.
Goal
The goal is to alert when any percentage changes from one period to the next by more than a set percentage. So, for example, in the above, I might want an alert at 13:05 that 'C33' had changed by ~72% from the previous period.
I Have Tried
Using a mix of streamstats, eval and trendline statements, have the following which will alert for a single 'C' code.
| tstats count as count where index="foo" sourcetype="bar" DRA=C27* by _time span=5m
| timechart sum(count) as total_count span=5min
| streamstats current=f window=1 last(total_count) as prev_count
| eval percentage_errors=round(abs(prev_count/total_count)*100,1)
| fillnull value=000
| trendline wma5(percentage_errors) AS trend_percentage
| eval trend_percentage=round(trend_percentage,1)
| fillnull value=0 trend_percentage
| table _time, total_count, prev_count, percentage_errors, trend_percentage
Problems and Concerns
How can I modify my query to account for the variable nature of the DRA code - both in name (Cxx) and in the number of DRA codes returned? I have added 'by' clauses almost everywhere, but have not had success.
Each 5 minute period can see up to 70k events per DRA. Any thoughts on running all of the calculations across all extracted DRAs every 5 minutes?
Any suggestions or comments on my line of thinking are appreciated.
... View more