All Apps and Add-ons

Splunk Machine Learning Toolkit: detecting both upper and lower outliers

lradics
Path Finder

Hello!

I have numeric data which has values primarily in the range of 500-1,000, with acceptable values being in the range of 500-10,000. I also have a number of outliers below 500 (ranging from 3 to 499), and some outliers above 10,000 (the most noticeable being as high as 1,000,000). I would like to use the Machine Learning Toolkit to detect all the outliers (both those too high, and those too low), ideally to set up some sort of alert.

My base search is pretty straightforward:

index=xxx source="xxx" reactionTime!=-1 reactionTime=* user=* | dedup ID

I tried using the built-in Detect Numeric Outliers assistant, but the higher outliers threw it off (even if I excluded the absolute highest), so it couldn't reliably mark values below 500 as outliers.

More recently I've been working with the OneClassSVM algorithm; however, it seems that no matter what I do (I've tried playing around with all the parameters I can), it only marks the bottom nu percent of my data as outliers - completely ignoring the too-high values.

Is there any way to detect both upper and lower outliers for my data, either with one of the abovementioned algorithms, or through some other method altogether?

Thank you!

0 Karma
1 Solution

cmerriman
Super Champion

there are a few mathematical ways you can predict outliers with, however you cant save them as a model, at least not to my knowledge.

there is the anomalydetection command, which can save you a lot of time then by typing out the SPL that would create them.
zscore is for standard deviation, histogram is for median absolute deviation and IQR is, well, IQR 🙂

https://docs.splunk.com/Documentation/SplunkCloud/6.6.0/SearchReference/Anomalydetection

View solution in original post

jasuchung
Explorer

Is there a way to use only Upper bound to define Outliers?
I wish to only define Outliers using the Upper Bound.
(I want to define striking up outliers)

0 Karma

cmerriman
Super Champion

there are a few mathematical ways you can predict outliers with, however you cant save them as a model, at least not to my knowledge.

there is the anomalydetection command, which can save you a lot of time then by typing out the SPL that would create them.
zscore is for standard deviation, histogram is for median absolute deviation and IQR is, well, IQR 🙂

https://docs.splunk.com/Documentation/SplunkCloud/6.6.0/SearchReference/Anomalydetection

lradics
Path Finder

That's a pity about not being able to save as a model; I was hoping to be able to train whatever method I ended up using.

I've been playing with the various forms of the anomalydetection command, and none of them are doing quite what I want, at least so far - they're all marking the highest values as outliers, but none of them do anything with the lower outliers. Do you know of a specific parameter that does that? And I'll keep exploring it, to see what I can do...

0 Karma

cmerriman
Super Champion

it's likely because it thinks the lower bound is negative.

this is a similar SPL to the IQR method and you might be able to tweak the lower bound eval to see get it where you want it. This breaks the data into hourly counts and then uses the overall median to break it into sections to find the outliers. It is set to 2 IQRs above and below the median for the upper and lower bounds. you can play with those (as well as the other parts, obviously, to fit your needs) to see where you need it.

|timechart span=1h count
|eventstats median(count) as median p25(count) as p25 p75(count) as p75
|eval IQR=p75-p25
|eval lower_bound=median-(IQR*2) 
|eval upper_bound=median+(IQR*2) 
|eval isOutlier=if(count>upper_bound OR count<lower_bound,10,0)

lradics
Path Finder

That looks promising - thank you! I'll work with that and see what I can get.

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Hello @iradics

Are you familiar with the predict command?

https://docs.splunk.com/Documentation/SplunkCloud/6.6.0/SearchReference/Predict

lradics
Path Finder

Hi @skoelpin,

I am familiar with the command - how would you suggest I use it? I don't want to forecast values for missing data; I'm looking instead to detect outliers in existing data. Is there a way to do that with predict?

0 Karma

jcvytla
New Member

Hello @iradics

I'm also working on similar problem.. I need your help in seeing through the solution..

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...