Splunk Search

How to remove max outliers from timechart?

plucas_splunk
Splunk Employee
Splunk Employee

I'm time-charting public transit vehicle "layover" time. ("Layover" is how long a driver takes a break upon reaching the end of a line before resuming driving in the opposite direction.)

My query and timechart work fine: I'm doing a stacked-bar of Max. Layover over Avg. Layover. However, some max. values are outliers (that could be caused by any number of reasons that aren't relevant here). I'd like to remove the outliers.

Every example I've seen of doing this using the outlier command is of the form:

... | eval layover=(end-start)/60 | timechart span=1d eval(round(max(layover),0)) as "Max. Layover" eval(round(avg(layover),0)) as "Avg. Layover" | outlier action=rm

The problem with that is that outlier max values are entirely removed from the chart leaving no max value at all for certain bars. That's not what I want.

In order to compute the max. layover in the first place, Splunk takes all the layover values, sorts them, then takes the largest value.

What I want is to do that, but if the largest value is an outlier, remove only that value and instead use the next-most max. value; then repeat (i.e., if that value is also an outlier, remove that value too; etc.).

So I tried this instead:

... | eval layover=(end-start)/60 | outlier action=rm layover | timechart ...

That seems to work (I don't see outlier values and no bars have missing values). But is it (a) correct and (b) the optimal way to solve my problem?

0 Karma

woodcock
Esteemed Legend

I am assuming that the real problem is that the huge outliers are dwarfing the wiggle of the "normal" data so it all looks essentially like zero and makes the line-chart useless as a visual cue. There is a GREAT way to handle this: change your Y-Axis scale form linear to log. You will be very pleased! Edit your panel, click the paintbrush icon -> Y-Axis -> Scale -> Linear.

plucas_splunk
Splunk Employee
Splunk Employee

No, that's not the real problem. The real problem is in my question: the outlier max value "hides" the "real" max. value. Simply chopping off tall bars doesn't solve the problem. I want to remove the outlier layover values entirely.

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

You've come to one of the most interesting subjects in data analytics, and you've asked two different questions that often will have at least two different answers. "Is it optimal" is extremely problematic, since you then have to define what exactly you are optimizing.

"Is it correct?" isn't quite as much a problem, but it still boils down into "How well does it match what I am TRYING to achieve?"

I'd suggest a third and fourth questions to substitute for the above two -- "Is it reasonable, and is it defensible?"

It's entirely reasonable to throw out outliers when calculating the statistics of a normal or natural process, assuming you can positively identify outliers. However, from your description, I don't think they really ARE outliers, so much as errors in the data collection specification. You are trying to collect information on the time required to accomplish turnaround, and if someone isn't trying to DO that, then the statistic you are collecting isn't the one that you are trying to collect.

Unfortunately, you DON'T know whether the two extra hours in any given case were the driver visiting with her old boyfriend (perhaps irrelevant) or the driver held up in a depot queue for inspection or waiting for traffic to clear from an exit port before proceeding (probably relevant).

Presumably, the typical times at any given station are going to vary from the typical times at another station, so OVERALL times may not even be the right strategy to identify outliers. It depends entirely on what you plan to achieve with the analysis.

Which comes to the most important factor. Whatever you decide, document your decision and what makes that a reasonable strategy, and put it into a brief memo to your data consumer. Perhaps also into a footnote on the report itself, or into an option (perhaps hidden) to view the report WITH outliers included. Making sure that your decision is both defensible and transparent is the long-run best strategy to establish credibility with your data consumers, which is the long-run most useful definition of both "correct" and "optimal".

0 Karma

mattymo
Splunk Employee
Splunk Employee

Great answer, definitely a situation where presentation of the analysis will be important, and another reason MLTK is a great way to attack this!

- MattyMo
0 Karma

mattymo
Splunk Employee
Splunk Employee

Like most things in Splunk, there are many ways to achieve your desired outcomes. Generally "the best" way is the one that works in the time you have!

Personally I would recommend checking out the Machine Learning Toolkit. https://splunkbase.splunk.com/app/2890/

It contains a "Detect Numerical Outliers" assistant that will allow you to apply some common algorithms (std dev, abs mean dev, etc) which will let you experiment until you find one that works the best for your situation.

alt text

This will give you much more control over the definition of "outlier", will allow you validate, and also provide you with some awesome new vizualizations like that awesome outliers chart!

alt text

The greatest part is it will spit out the SPL used to identify the outliers, which you could then use eliminate the outliers (based on the isOutlier field) then timechart the values that remain.

Here is an example where I used absolute mean deviation to identify spikes in data ingest in one of my indexes....

alt text

Here is the SPL it spit out:
(you will mainly be interested in everything after the timechart. Just substitute your base search and field values accordingly.)

index=`meta_woot_summary` sourcetype=meta_woot orig_sourcetype!=stash orig_sourcetype=* orig_host=* orig_index=* 
| timechart limit=20 span=30m sum(count) as totalEvents by orig_index 
| eventstats median("n00blab") as median 
| eval absDev=(abs('n00blab'-median)) 
| eventstats median(absDev) as medianAbsDev 
| eval lowerBound=(median-medianAbsDev*2.500000), upperBound=(median+medianAbsDev*2.500000) 
| eval isOutlier=if('n00blab' < lowerBound OR 'n00blab' > upperBound, 1, 0) 
| table _time, "n00blab", lowerBound, upperBound, isOutlier

The idea is that you will look across all your data points (or a defined window) and settle in on a multiplier that accurately identifies your outliers...then you can add a where condition or subsearch to remove all outliers, then timechart them!

example you could just append

| where isOutlier=0 

Give it a try! I am confident it will provide a great workbench for this and many more of your use cases and will show you some awesome SPL tricks to add to your Splunk superhero bag of tricks!

- MattyMo
0 Karma

somesoni2
Revered Legend

Have you considered using perc<X>(Y) function? You may select like perc99(layover) instead of max(layover) to get the 99th percentile values. (or 98th percentile value).

... | eval layover=(end-start)/60 | timechart span=1d eval(round(perc99(layover),0)) as "Max. Layover" eval(round(avg(layover),0)) as "Avg. Layover"

More details on the function here: http://docs.splunk.com/Documentation/Splunk/6.5.2/SearchReference/CommonStatsFunctions

0 Karma

plucas_splunk
Splunk Employee
Splunk Employee

How do I decide what is the "correct" value of X? (I'd rather have the computer decide based on the actual data.)

Is my use of outlier wrong?

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Nope. Your use of outlier is fine. Unless you plan to analyze your data pretty thoroughly, I'd just use it a single time, at default.

Alternatively, you could back into the desired value for param by deciding which events you think OUGHT to be excluded, and then calculating how many IRQs they are above the 75th percentile.

You can also experiment with rm vs tf. With a higher number of high outliers, using tf might more accurately represent the underlying shape of your data, which is what it's all about.

0 Karma

somesoni2
Revered Legend

Since your action is to remove the event altogether, if you invoke outlier command before timechart, your average values is also affected. Which in your case seems to be correct.

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!