Solved: Using outlier with grouping

caffein · ‎05-11-2012

How can I use outlier with grouping. For instance, if I want to group my data by country, I would like to remove outliers from each group's data, not from the population as a whole. This is a problem for me because some countries generate far more events than others, which skews the data. So, while a data point might seem like an outlier for the total population, it might be relatively normal for that particular country. Is there a way to do this?

Here's an example to help clarify:

user | country | foo | bar
---------------------------
1    | us      | 1   | 10
2    | us      | 2   | 12
3    | us      | 21  | 12
4    | ca      | 20  | 13
5    | ca      | 21  | 11

Ultimate output desired:
country | avg(foo) | avg(bar)
-----------------------------
us      | 1.5      | 11
ca      | 20.5     | 12

From the above, in the US users 3's foo value is an outlier, but that's a normal value for CA users. What I would like to be able to do is detect that user 3 is an outlier and discard that data, but keep the values for users 4 and 5 intact. Also, in my data there would be a lot more US events which would cause almost all the CA values to look like outliers.

Ayn · ‎07-06-2012

A pretty similar question (as it seems to me at least) was posted a couple of days ago, and the question/answer/following discussion perhaps might help you get some inspiration on how to achieve your goal? http://splunk-base.splunk.com/answers/52107/how-do-i-remove-data-read-anomalies

View solution in original post

Ayn · ‎07-06-2012

A pretty similar question (as it seems to me at least) was posted a couple of days ago, and the question/answer/following discussion perhaps might help you get some inspiration on how to achieve your goal? http://splunk-base.splunk.com/answers/52107/how-do-i-remove-data-read-anomalies

caffein · ‎07-06-2012

Oh and just to put down how I think this would work:
[search] | eventstats median(foo) as medfoo, stdev(foo) as stdfoo by country | where abs(foo - medfoo)<stdfoo | table avg(foo) by country

caffein · ‎07-06-2012

I think that will do the trick. Thanks!

araitz · ‎07-06-2012

Oops, sorry, I just rememebered this issue. Let me think about this again.

caffein · ‎05-17-2012

Essentially I want to say
[search]...|outlier by country|table country, avg(foo), avg(bar)

I know outlier doesn't support "by", but that's basically what I'm going for.

caffein · ‎05-17-2012

Not really. My other question is about finding various box and whisker plot values for a full whole set of data. For this question I'm asking how I can loop through a set of groups and remove outliers within each group, rather than the population as a whole.

araitz · ‎05-17-2012

It seems like you have already solved this problem per your other question. Please correct me if I am mistaken.

Using outlier with grouping

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life