Splunk Search

Detect abnormal failure event based on machine learning?

sassens1
Path Finder

Hello,

I'd like to build a search that will trigger a spike on my authentication agent failure events but I do not want to put a hard threshold.
I started playing with trendline but I don't if I'm on the right path.

index=paloalto event_id=globalprotectgateway-auth-fail | timechart span=10m count| trendline sma7(count)

On an average day I will have 20-30 failure per hour over 10K agents but if there is an issue on the firewall for example in less than 10 minutes I can have more than a thousand of events and I like to be alerted asap. On the other hand the number of hosts will change over time and during week-end there is less activity.
How can I achieve this?

Labels (2)
Tags (1)
0 Karma
1 Solution

muebel
SplunkTrust
SplunkTrust

Hi sassens, streamstats is good for this sort of thing. In general, I'd advise checking out the Machine Learning Toolkit for more examples of solutions https://splunkbase.splunk.com/app/2890/

But in this case, something like this might work to give you outliers. Adjust the multiplier to increase or decrease the sensitivity:

index=paloalto event_id=globalprotectgateway-auth-fail
| timechart count span=10m
| streamstats window=12 avg(count) as avg, stdev(count) as stdev
| eval multiplier = 2
| eval lower_bound = avg - (stdev * multiplier)
| eval upper_bound = avg + (stdev * multiplier)
| fields - multiplier stdev
| eval isOutlier = if(count > upper_bound OR count < lower_bound, 5, 0)

You'll want to also adjust the span and window, and overall search timeframe, but this is the general structure of these types of searches.

Please let me know if this answers your question!

View solution in original post

muebel
SplunkTrust
SplunkTrust

Hi sassens, streamstats is good for this sort of thing. In general, I'd advise checking out the Machine Learning Toolkit for more examples of solutions https://splunkbase.splunk.com/app/2890/

But in this case, something like this might work to give you outliers. Adjust the multiplier to increase or decrease the sensitivity:

index=paloalto event_id=globalprotectgateway-auth-fail
| timechart count span=10m
| streamstats window=12 avg(count) as avg, stdev(count) as stdev
| eval multiplier = 2
| eval lower_bound = avg - (stdev * multiplier)
| eval upper_bound = avg + (stdev * multiplier)
| fields - multiplier stdev
| eval isOutlier = if(count > upper_bound OR count < lower_bound, 5, 0)

You'll want to also adjust the span and window, and overall search timeframe, but this is the general structure of these types of searches.

Please let me know if this answers your question!

jliu034
New Member

Hi, I have further question regarding this. How if I have group by in the time chart in your example? Is that possible to identify the outlier for each single http_status in this case?

index=paloalto event_id=globalprotectgateway-auth-fail
| timechart count span=10m by http_status
| streamstats window=12 avg(count) as avg, stdev(count) as stdev
| eval multiplier = 2
| eval lower_bound = avg - (stdev multiplier)
| eval upper_bound = avg + (stdev multiplier)
| fields - multiplier stdev
| eval isOutlier = if(count > upper_bound OR count < lower_bound, 5, 0)

0 Karma

sassens1
Path Finder

Hi,

very nice I was exactly working on the same thing and playing with the "Detect numeric outliers" of the ML toolkit, here is my search:

index=paloalto sourcetype=pan:system  event_id=globalprotectgateway-auth-fail 
| timechart span=10m count 
| eventstats median("count") as median p25("count") as p25 p75("count") as p75
| eval IQR=(p75-p25)| eval upperBound=(median+IQR*4)
| eval isOutlier=if('count' > upperBound, 1, 0)
| where isOutlier=1

Now if I want to create an alert, let's say every 10min over the last 4h it will work great to detect the spike (count>upperbound) but I need to use the throttle not to be spammed by alerts if the count continues to rise.
Later at some point in time the search will run again and perhaps triggers an old spike because the average or median value will then be low again and the upperbound lower than the count of the old spike.

I'm going to try your search see if the accuracy and behavior is better than mine.

0 Karma

jliu034
New Member

Guys, I have further question regarding this. How if I have group by in the time chart in your example? Is that possible to identify the outlier for each single http_status in this case?

index=paloalto event_id=globalprotectgateway-auth-fail
| timechart count span=10m by http_status
| streamstats window=12 avg(count) as avg, stdev(count) as stdev
| eval multiplier = 2
| eval lower_bound = avg - (stdev * multiplier)
| eval upper_bound = avg + (stdev * multiplier)
| fields - multiplier stdev
| eval isOutlier = if(count > upper_bound OR count < lower_bound, 5, 0)

0 Karma

muebel
SplunkTrust
SplunkTrust

sounds good 😄

Please accept my answer if it helped in any way

0 Karma
Get Updates on the Splunk Community!

Video | Welcome Back to Smartness, Pedro

Remember Splunk Community member, Pedro Borges? If you tuned into Episode 2 of our Smartness interview series, ...

Detector Best Practices: Static Thresholds

Introduction In observability monitoring, static thresholds are used to monitor fixed, known values within ...

Expert Tips from Splunk Education, Observability in Action, Plus More New Articles on ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...