Alerting

Anyone have suggestions? - Alerting with Calculations Against a Lot of Results

lennys26
Communicator

I am looking for suggestions as to how best to implement an alerting request made by my users. 

Summary

A query is run to count the number of events. The time weighted difference (in percentage) between one period of the next will be used to trigger the alert if the threshold is met.

Query

I have a query which I am already using on an existing dashboard. I am using TSTATS to count events, then a REX to group the events based on the first 3 characters for a field (DRA). 

 

 

| tstats prestats=t count where index="foo" sourcetype="bar*" DRA=C* by DRA, _time 
| rex field=DRA "^(?<pfx>\D\S{2}).*" 
| timechart span=5m count by pfx useother=f limit=0 usenull=f

 

 

 

The groups and the number of these 'groups' will vary, but the result will be similar to the below :

_time C27 C31 C33
2022-10-12 13:00:00 116 2 70
2022-10-12 13:05:00 287 3 20
2022-10-12 13:10:00 383 6 45
2022-10-12 13:15:00 259 7 41

I suspect the maximum number of DRA codes that we will see will be 25, although I can break this up into different queries and play with some timing and priorities with the running of the searches.

Goal

The goal is to alert when any percentage changes from one period  to the next by more than a set percentage.  So, for example, in the above, I might want an alert at 13:05 that 'C33' had changed by ~72% from the previous period.

I Have Tried

Using a mix of streamstats, eval and trendline statements, have the following which will alert for a single 'C' code.

 

 

| tstats count as count where index="foo" sourcetype="bar" DRA=C27* by _time span=5m
| timechart sum(count) as total_count span=5min
| streamstats current=f window=1 last(total_count) as prev_count 
| eval percentage_errors=round(abs(prev_count/total_count)*100,1) 
| fillnull value=000
| trendline wma5(percentage_errors) AS trend_percentage
| eval trend_percentage=round(trend_percentage,1) 
| fillnull value=0 trend_percentage 
| table _time, total_count, prev_count, percentage_errors, trend_percentage

 

 

 

Problems and Concerns

  1. How can I modify my query to account for the variable nature of the DRA code - both in name (Cxx) and in the number of DRA codes returned? I have added 'by' clauses almost everywhere, but have not had success.
  2. Each 5 minute period can see up to 70k events per DRA. Any thoughts on running all of the calculations across all extracted DRAs every 5 minutes?

Any suggestions or comments on my line of thinking are appreciated.

 

Labels (1)
Tags (2)
0 Karma
1 Solution

ITWhisperer
SplunkTrust
SplunkTrust
| untable _time pfx count
| streamstats window=1 current=f global=f sum(count) as previous by pfx
| eval diff = abs(count - previous)
| eval percentage_change = round(100 * diff / previous, 2)

View solution in original post

ITWhisperer
SplunkTrust
SplunkTrust
| untable _time pfx count
| streamstats window=1 current=f global=f sum(count) as previous by pfx
| eval diff = abs(count - previous)
| eval percentage_change = round(100 * diff / previous, 2)

lennys26
Communicator

@ITWhisperer - Brilliant, untable. I am not sure I fully get it and need to read into it more, but I am surprised that I haven't come across this at least a dozen times by now. Thanks.

I have added in the trendline to get trending on the percentage_change, and after some rearranging of the SPL, I have come up with the following which appears to work correctly:

 

 

 

| tstats prestats=t count where index="foo" sourcetype="bar" DRA=C* by DRA, _time 
| rex field=DRA "^(?<pfx>\D\S{2}).*" 
| timechart span=5m count by pfx useother=f limit=0 usenull=f 
| trendline wma5(percentage_change) AS trend_percentage 
| eval trend_percentage=round(trend_percentage,1)  
| untable _time pfx count 
| streamstats window=1 current=f global=f sum(count) as previous by pfx 
| eval diff = abs(count - previous) 
| eval percentage_change = round(100 * diff / previous, 2) 
| table _time, pfx, count, previous, diff, percentage_change

 

 

Now to add some final touches -- I may add a streamstats count as marker by pfx to  be sure I capture the last/latest event (however this may miss a pfx if it exists in one time period and not another, but I will play with this to see what happens). 

Thanks.

 

0 Karma
Get Updates on the Splunk Community!

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...

SignalFlow: What? Why? How?

What is SignalFlow? Splunk Observability Cloud’s analytics engine, SignalFlow, opens up a world of in-depth ...

Federated Search for Amazon S3 | Key Use Cases to Streamline Compliance Workflows

Modern business operations are supported by data compliance. As regulations evolve, organizations must ...