Solved: Detect abnormal failure event based on machine lea...

sassens1 · ‎02-22-2017

Hello,

I'd like to build a search that will trigger a spike on my authentication agent failure events but I do not want to put a hard threshold.
I started playing with trendline but I don't if I'm on the right path.

index=paloalto event_id=globalprotectgateway-auth-fail | timechart span=10m count| trendline sma7(count)

On an average day I will have 20-30 failure per hour over 10K agents but if there is an issue on the firewall for example in less than 10 minutes I can have more than a thousand of events and I like to be alerted asap. On the other hand the number of hosts will change over time and during week-end there is less activity.
How can I achieve this?

muebel · ‎02-22-2017

Hi sassens, streamstats is good for this sort of thing. In general, I'd advise checking out the Machine Learning Toolkit for more examples of solutions https://splunkbase.splunk.com/app/2890/

But in this case, something like this might work to give you outliers. Adjust the multiplier to increase or decrease the sensitivity:

index=paloalto event_id=globalprotectgateway-auth-fail
| timechart count span=10m
| streamstats window=12 avg(count) as avg, stdev(count) as stdev
| eval multiplier = 2
| eval lower_bound = avg - (stdev * multiplier)
| eval upper_bound = avg + (stdev * multiplier)
| fields - multiplier stdev
| eval isOutlier = if(count > upper_bound OR count < lower_bound, 5, 0)

You'll want to also adjust the span and window, and overall search timeframe, but this is the general structure of these types of searches.

Please let me know if this answers your question!

View solution in original post

muebel · ‎02-22-2017

Hi sassens, streamstats is good for this sort of thing. In general, I'd advise checking out the Machine Learning Toolkit for more examples of solutions https://splunkbase.splunk.com/app/2890/

But in this case, something like this might work to give you outliers. Adjust the multiplier to increase or decrease the sensitivity:

index=paloalto event_id=globalprotectgateway-auth-fail
| timechart count span=10m
| streamstats window=12 avg(count) as avg, stdev(count) as stdev
| eval multiplier = 2
| eval lower_bound = avg - (stdev * multiplier)
| eval upper_bound = avg + (stdev * multiplier)
| fields - multiplier stdev
| eval isOutlier = if(count > upper_bound OR count < lower_bound, 5, 0)

You'll want to also adjust the span and window, and overall search timeframe, but this is the general structure of these types of searches.

Please let me know if this answers your question!

jliu034 · ‎03-15-2020

Hi, I have further question regarding this. How if I have group by in the time chart in your example? Is that possible to identify the outlier for each single http_status in this case?

index=paloalto event_id=globalprotectgateway-auth-fail
| timechart count span=10m by http_status
| streamstats window=12 avg(count) as avg, stdev(count) as stdev
| eval multiplier = 2
| eval lower_bound = avg - (stdev multiplier)
| eval upper_bound = avg + (stdev multiplier)
| fields - multiplier stdev
| eval isOutlier = if(count > upper_bound OR count < lower_bound, 5, 0)

sassens1 · ‎02-22-2017

Hi,

very nice I was exactly working on the same thing and playing with the "Detect numeric outliers" of the ML toolkit, here is my search:

index=paloalto sourcetype=pan:system  event_id=globalprotectgateway-auth-fail 
| timechart span=10m count 
| eventstats median("count") as median p25("count") as p25 p75("count") as p75
| eval IQR=(p75-p25)| eval upperBound=(median+IQR*4)
| eval isOutlier=if('count' > upperBound, 1, 0)
| where isOutlier=1

Now if I want to create an alert, let's say every 10min over the last 4h it will work great to detect the spike (count>upperbound) but I need to use the throttle not to be spammed by alerts if the count continues to rise.
Later at some point in time the search will run again and perhaps triggers an old spike because the average or median value will then be low again and the upperbound lower than the count of the old spike.

I'm going to try your search see if the accuracy and behavior is better than mine.

jliu034 · ‎03-15-2020

Guys, I have further question regarding this. How if I have group by in the time chart in your example? Is that possible to identify the outlier for each single http_status in this case?

index=paloalto event_id=globalprotectgateway-auth-fail
| timechart count span=10m by http_status
| streamstats window=12 avg(count) as avg, stdev(count) as stdev
| eval multiplier = 2
| eval lower_bound = avg - (stdev * multiplier)
| eval upper_bound = avg + (stdev * multiplier)
| fields - multiplier stdev
| eval isOutlier = if(count > upper_bound OR count < lower_bound, 5, 0)

muebel · ‎02-22-2017

sounds good 😄

Please accept my answer if it helped in any way

Detect abnormal failure event based on machine learning?

eval

timechart

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

SignalFlow: What? Why? How?