How can I use outlier with grouping. For instance, if I want to group my data by country, I would like to remove outliers from each group's data, not from the population as a whole. This is a problem for me because some countries generate far more events than others, which skews the data. So, while a data point might seem like an outlier for the total population, it might be relatively normal for that particular country. Is there a way to do this?
Here's an example to help clarify:
user | country | foo | bar
---------------------------
1 | us | 1 | 10
2 | us | 2 | 12
3 | us | 21 | 12
4 | ca | 20 | 13
5 | ca | 21 | 11
Ultimate output desired:
country | avg(foo) | avg(bar)
-----------------------------
us | 1.5 | 11
ca | 20.5 | 12
From the above, in the US users 3's foo value is an outlier, but that's a normal value for CA users. What I would like to be able to do is detect that user 3 is an outlier and discard that data, but keep the values for users 4 and 5 intact. Also, in my data there would be a lot more US events which would cause almost all the CA values to look like outliers.
A pretty similar question (as it seems to me at least) was posted a couple of days ago, and the question/answer/following discussion perhaps might help you get some inspiration on how to achieve your goal? http://splunk-base.splunk.com/answers/52107/how-do-i-remove-data-read-anomalies
A pretty similar question (as it seems to me at least) was posted a couple of days ago, and the question/answer/following discussion perhaps might help you get some inspiration on how to achieve your goal? http://splunk-base.splunk.com/answers/52107/how-do-i-remove-data-read-anomalies
Oh and just to put down how I think this would work:
[search] | eventstats median(foo) as medfoo, stdev(foo) as stdfoo by country | where abs(foo - medfoo)<stdfoo | table avg(foo) by country
I think that will do the trick. Thanks!
Oops, sorry, I just rememebered this issue. Let me think about this again.
Essentially I want to say
[search]...|outlier by country|table country, avg(foo), avg(bar)
I know outlier doesn't support "by", but that's basically what I'm going for.
Not really. My other question is about finding various box and whisker plot values for a full whole set of data. For this question I'm asking how I can loop through a set of groups and remove outliers within each group, rather than the population as a whole.
It seems like you have already solved this problem per your other question. Please correct me if I am mistaken.