Solved: Anyone have suggestions? - Alerting with Calculati...

lennys26 · ‎10-13-2022

I am looking for suggestions as to how best to implement an alerting request made by my users.

Summary

A query is run to count the number of events. The time weighted difference (in percentage) between one period of the next will be used to trigger the alert if the threshold is met.

Query

I have a query which I am already using on an existing dashboard. I am using TSTATS to count events, then a REX to group the events based on the first 3 characters for a field (DRA).

| tstats prestats=t count where index="foo" sourcetype="bar*" DRA=C* by DRA, _time 
| rex field=DRA "^(?<pfx>\D\S{2}).*" 
| timechart span=5m count by pfx useother=f limit=0 usenull=f

The groups and the number of these 'groups' will vary, but the result will be similar to the below :

_time	C27	C31	C33
2022-10-12 13:00:00	116	2	70
2022-10-12 13:05:00	287	3	20
2022-10-12 13:10:00	383	6	45
2022-10-12 13:15:00	259	7	41

I suspect the maximum number of DRA codes that we will see will be 25, although I can break this up into different queries and play with some timing and priorities with the running of the searches.

Goal

The goal is to alert when any percentage changes from one period to the next by more than a set percentage. So, for example, in the above, I might want an alert at 13:05 that 'C33' had changed by ~72% from the previous period.

I Have Tried

Using a mix of streamstats, eval and trendline statements, have the following which will alert for a single 'C' code.

| tstats count as count where index="foo" sourcetype="bar" DRA=C27* by _time span=5m
| timechart sum(count) as total_count span=5min
| streamstats current=f window=1 last(total_count) as prev_count 
| eval percentage_errors=round(abs(prev_count/total_count)*100,1) 
| fillnull value=000
| trendline wma5(percentage_errors) AS trend_percentage
| eval trend_percentage=round(trend_percentage,1) 
| fillnull value=0 trend_percentage 
| table _time, total_count, prev_count, percentage_errors, trend_percentage

Problems and Concerns

How can I modify my query to account for the variable nature of the DRA code - both in name (Cxx) and in the number of DRA codes returned? I have added 'by' clauses almost everywhere, but have not had success.
Each 5 minute period can see up to 70k events per DRA. Any thoughts on running all of the calculations across all extracted DRAs every 5 minutes?

Any suggestions or comments on my line of thinking are appreciated.

ITWhisperer · ‎10-13-2022

| untable _time pfx count
| streamstats window=1 current=f global=f sum(count) as previous by pfx
| eval diff = abs(count - previous)
| eval percentage_change = round(100 * diff / previous, 2)

View solution in original post

ITWhisperer · ‎10-13-2022

| untable _time pfx count
| streamstats window=1 current=f global=f sum(count) as previous by pfx
| eval diff = abs(count - previous)
| eval percentage_change = round(100 * diff / previous, 2)

lennys26 · ‎10-14-2022

@ITWhisperer - Brilliant, untable. I am not sure I fully get it and need to read into it more, but I am surprised that I haven't come across this at least a dozen times by now. Thanks.

I have added in the trendline to get trending on the percentage_change, and after some rearranging of the SPL, I have come up with the following which appears to work correctly:

| tstats prestats=t count where index="foo" sourcetype="bar" DRA=C* by DRA, _time 
| rex field=DRA "^(?<pfx>\D\S{2}).*" 
| timechart span=5m count by pfx useother=f limit=0 usenull=f 
| trendline wma5(percentage_change) AS trend_percentage 
| eval trend_percentage=round(trend_percentage,1)  
| untable _time pfx count 
| streamstats window=1 current=f global=f sum(count) as previous by pfx 
| eval diff = abs(count - previous) 
| eval percentage_change = round(100 * diff / previous, 2) 
| table _time, pfx, count, previous, diff, percentage_change

Now to add some final touches -- I may add a streamstats count as marker by pfx to be sure I capture the last/latest event (however this may miss a pfx if it exists in one time period and not another, but I will play with this to see what happens).

Thanks.

Anyone have suggestions? - Alerting with Calculations Against a Lot of Results

alert action

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor