Solved: Splunk Machine Learning Toolkit: detecting both up...

lradics · ‎07-17-2017

Hello!

I have numeric data which has values primarily in the range of 500-1,000, with acceptable values being in the range of 500-10,000. I also have a number of outliers below 500 (ranging from 3 to 499), and some outliers above 10,000 (the most noticeable being as high as 1,000,000). I would like to use the Machine Learning Toolkit to detect all the outliers (both those too high, and those too low), ideally to set up some sort of alert.

My base search is pretty straightforward:

index=xxx source="xxx" reactionTime!=-1 reactionTime=* user=* | dedup ID

I tried using the built-in Detect Numeric Outliers assistant, but the higher outliers threw it off (even if I excluded the absolute highest), so it couldn't reliably mark values below 500 as outliers.

More recently I've been working with the OneClassSVM algorithm; however, it seems that no matter what I do (I've tried playing around with all the parameters I can), it only marks the bottom nu percent of my data as outliers - completely ignoring the too-high values.

Is there any way to detect both upper and lower outliers for my data, either with one of the abovementioned algorithms, or through some other method altogether?

Thank you!

cmerriman · ‎07-17-2017

there are a few mathematical ways you can predict outliers with, however you cant save them as a model, at least not to my knowledge.

there is the anomalydetection command, which can save you a lot of time then by typing out the SPL that would create them.
zscore is for standard deviation, histogram is for median absolute deviation and IQR is, well, IQR 🙂

https://docs.splunk.com/Documentation/SplunkCloud/6.6.0/SearchReference/Anomalydetection

View solution in original post

jasuchung · ‎03-12-2024

Is there a way to use only Upper bound to define Outliers?
I wish to only define Outliers using the Upper Bound.
(I want to define striking up outliers)

cmerriman · ‎07-17-2017

there are a few mathematical ways you can predict outliers with, however you cant save them as a model, at least not to my knowledge.

there is the anomalydetection command, which can save you a lot of time then by typing out the SPL that would create them.
zscore is for standard deviation, histogram is for median absolute deviation and IQR is, well, IQR 🙂

https://docs.splunk.com/Documentation/SplunkCloud/6.6.0/SearchReference/Anomalydetection

lradics · ‎07-17-2017

That's a pity about not being able to save as a model; I was hoping to be able to train whatever method I ended up using.

I've been playing with the various forms of the anomalydetection command, and none of them are doing quite what I want, at least so far - they're all marking the highest values as outliers, but none of them do anything with the lower outliers. Do you know of a specific parameter that does that? And I'll keep exploring it, to see what I can do...

cmerriman · ‎07-17-2017

it's likely because it thinks the lower bound is negative.

this is a similar SPL to the IQR method and you might be able to tweak the lower bound eval to see get it where you want it. This breaks the data into hourly counts and then uses the overall median to break it into sections to find the outliers. It is set to 2 IQRs above and below the median for the upper and lower bounds. you can play with those (as well as the other parts, obviously, to fit your needs) to see where you need it.

|timechart span=1h count
|eventstats median(count) as median p25(count) as p25 p75(count) as p75
|eval IQR=p75-p25
|eval lower_bound=median-(IQR*2) 
|eval upper_bound=median+(IQR*2) 
|eval isOutlier=if(count>upper_bound OR count<lower_bound,10,0)

lradics · ‎07-18-2017

That looks promising - thank you! I'll work with that and see what I can get.

skoelpin · ‎07-17-2017

Hello @iradics

Are you familiar with the predict command?

https://docs.splunk.com/Documentation/SplunkCloud/6.6.0/SearchReference/Predict

lradics · ‎07-17-2017

Hi @skoelpin,

I am familiar with the command - how would you suggest I use it? I don't want to forecast values for missing data; I'm looking instead to detect outliers in existing data. Is there a way to do that with predict?

jcvytla · ‎03-27-2018

Hello @iradics

I'm also working on similar problem.. I need your help in seeing through the solution..

Splunk Machine Learning Toolkit: detecting both upper and lower outliers

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Event Series: Splunk Observability Metrics Cost Optimization

Kick the Tires Before You Commit: A Hands-On Tour of the Splunk Observability Cloud ...

Deep insights, no barriers: Splunk Observability Cloud Free Edition

Join the Conversation