I have numeric data which has values primarily in the range of 500-1,000, with acceptable values being in the range of 500-10,000. I also have a number of outliers below 500 (ranging from 3 to 499), and some outliers above 10,000 (the most noticeable being as high as 1,000,000). I would like to use the Machine Learning Toolkit to detect all the outliers (both those too high, and those too low), ideally to set up some sort of alert.
My base search is pretty straightforward:
index=xxx source="xxx" reactionTime!=-1 reactionTime=* user=* | dedup ID
I tried using the built-in Detect Numeric Outliers assistant, but the higher outliers threw it off (even if I excluded the absolute highest), so it couldn't reliably mark values below 500 as outliers.
More recently I've been working with the OneClassSVM algorithm; however, it seems that no matter what I do (I've tried playing around with all the parameters I can), it only marks the bottom nu percent of my data as outliers - completely ignoring the too-high values.
Is there any way to detect both upper and lower outliers for my data, either with one of the abovementioned algorithms, or through some other method altogether?
I am familiar with the command - how would you suggest I use it? I don't want to forecast values for missing data; I'm looking instead to detect outliers in existing data. Is there a way to do that with
I'm also working on similar problem.. I need your help in seeing through the solution..
there are a few mathematical ways you can predict outliers with, however you cant save them as a model, at least not to my knowledge.
there is the anomalydetection command, which can save you a lot of time then by typing out the SPL that would create them.
zscore is for standard deviation, histogram is for median absolute deviation and IQR is, well, IQR 🙂
That's a pity about not being able to save as a model; I was hoping to be able to train whatever method I ended up using.
I've been playing with the various forms of the anomalydetection command, and none of them are doing quite what I want, at least so far - they're all marking the highest values as outliers, but none of them do anything with the lower outliers. Do you know of a specific parameter that does that? And I'll keep exploring it, to see what I can do...
it's likely because it thinks the lower bound is negative.
this is a similar SPL to the IQR method and you might be able to tweak the lower bound eval to see get it where you want it. This breaks the data into hourly counts and then uses the overall median to break it into sections to find the outliers. It is set to 2 IQRs above and below the median for the upper and lower bounds. you can play with those (as well as the other parts, obviously, to fit your needs) to see where you need it.
|timechart span=1h count |eventstats median(count) as median p25(count) as p25 p75(count) as p75 |eval IQR=p75-p25 |eval lower_bound=median-(IQR*2) |eval upper_bound=median+(IQR*2) |eval isOutlier=if(count>upper_bound OR count<lower_bound,10,0)