All Apps and Add-ons

Box Plot isn't calculating outliers correctly

mikev
Path Finder

In using box plot (successfully we thought) a request came through to identify all upper outliers (extreme) when trying to determine what the math was behind the box plot calculations it came to light that if a user had low numbers then the calculations don't work out correctly. Here is one set of numbers that does not work:

0
0
0
0
1
1
2
2
2
5
7
9
9
11
12
17
18
46
62
74
103
136
188
193

In the above instance box plot said the upper whisker was 18 so all numbers above 18 are outliers, in performing the math, the upper whisker is 122 so all above 103 would be outliers

Here are a set of number that work correctly:

0
0
0
7
13
14
16
45
62
76
119
126
136
145
151
154
156
164
176
232
298
334
382
472

This was tagged correctly with the upper whisker coming in at 382, so the only outlier is 472.

I know that Satoshi Kawasaki only bundled box plot into the visualizations package, but if anyone has any idea as to why it is not working as it should, it would be great. I did look into the JS, but I am not a java script coder and to mee it looks good, I can follow what it is done just not sure why it is not working as it should for the sample data.

Thanks Mike

0 Karma

skawasaki_splun
Splunk Employee
Splunk Employee

Here is the code that calculates the IQR (not written by me):

boxplot.js

// Returns a function to compute the interquartile range.
function iqr(k) {
    return function(d, i) {
        var q1 = d.quartiles[0],
        q3 = d.quartiles[2],
        iqr = (q3 - q1) * k,
        i = -1,
        j = d.length;
        while (d[++i] < q1 - iqr)
            ;
        while (d[--j] > q3 + iqr)
            ;
        return [i, j];
    };
}

d3.box.js

box.quartiles = function(x) {
    if (!arguments.length) return quartiles;
    quartiles = x;
    return box;
};

and

function boxQuartiles(d) {
    return [
        d3.quantile(d, .25),
        d3.quantile(d, .5),
        d3.quantile(d, .75)
    ];
}

So it looks like the bottom and upper whiskers are the first and third quartiles, respectively. https://en.wikipedia.org/wiki/Quartile

These files are found under $SPLUNK_HOME/etc/apps/custom_vizs/appserver/static/components/boxplot

0 Karma

jeffland
SplunkTrust
SplunkTrust

First off, it is not precisely defined what the whiskers represent. According to wikipedia, one version is that the whiskers represent the minimum and maximum of your values, which is apparently not your case. Alternatively, you could be looking for the Tukey standard, which is something like "the lowerst/highest value within 1.5 IQR". In that case, the math for your first sample is p25=2, p75=62, iqr=60, lowerWhisker=0 and upperWhisker=136:

| stats count | fields - count | eval foo="0,0,0,0,1,1,2,2,2,5,7,9,9,11,12,17,18,46,62,74,103,136,188,193" | makemv delim="," foo | mvexpand foo 
| eventstats p25(foo) as p25 p75(foo) as p75 | eval iqr=p75-p25 | eval lowerWhisker=if(foo>=(p25-iqr*1.5), foo, iqr) | eval upperWhisker=if(foo<=(p75+iqr*1.5), foo, iqr) | eventstats min(lowerWhisker) as lowerWhisker max(upperWhisker) as upperWhisker | eval status=case(foo>=p25 AND foo<=p75, "in box", foo<lowerWhisker OR foo>upperWhisker, "outlier", 1=1, "in whiskers")

This doesn't seem to match either of your results (the one from boxplot and your own math).
Then there's the option that the whiskers represent the standard deviation around the median, which would result in mean=9, stdev=59.220235 and thus lowerWhisker=-50.220235 (or 0, depending on your preference) and upperWhisker=68.220235:

| stats count | fields - count | eval foo="0,0,0,0,1,1,2,2,2,5,7,9,9,11,12,17,18,46,62,74,103,136,188,193" | makemv delim="," foo | mvexpand foo 
| eventstats median(foo) as median stdev(foo) as stdev | eval lowerWhisker=median-stdev | eval upperWhisker=median+stdev

These are not your values either, so my guess is your case is one of an arbitrary definition of what your whiskers are supposed to be, wikipedia mentions some kind of 9th/91st percentile. There are many funny ways to make your statistics look how you want it to.
I think your problem is not with boxplot, your problem is with how you "do the math" and apply it to the visual representation - there are multiple ways to do it, and you can't point out which one is "correct".

mikev
Path Finder

I too like the write-up, from everything I can determine ( I dug into the js code as well) box-plot is following the 25/50/75 rule with outliers / extreams at irq*1.5 . It is just puzzling as to why it works consistently with higher numbers but strings of numbers with a lot of zeros or very low end numbers always seems to be incorrect. I appreciate the response from Jeff and Satoshi, I just wish I could figure out the rhyme / reason it is not correct. I'm surely not going to tell the boss the numbers he has been looking at are not correct 😉 Thanks for the tip on makeresults vs. stats count. As there is no solution, I can't select either as the answer, but I do appreciate the comments, keep 'em coming.

0 Karma

skawasaki_splun
Splunk Employee
Splunk Employee

@mikev If it's helpful then just upvote the answer :-).

0 Karma

skawasaki_splun
Splunk Employee
Splunk Employee

Nice analysis! Quick tip: It's better to use | makeresults instead of | stats count :-).

0 Karma

jeffland
SplunkTrust
SplunkTrust

Thanks for the tip. I've already started using that, but old habits die hard...

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...