I'm trying to get an accurate percentile representation from a dataset of hourly metrics, excluding outliers. The dataset consists of user sessions by group of machines for each hour where there's a production and a DR set of machines. On occasion, to validate DR, those machines are used as production so when those occasions occur, they drastically skew the percentiles of an otherwise low number of DR sessions in use.
Data would be like so.....
Environment-Group | Day | Hour | Session Count | Environment-Group | Day | Hour | Session Count | ||
Prod-A | Monday | 8:00 | 1000 | DR-A | Monday | 8:00 | 10 | ||
Prod-A | Monday | 12:00 | 1500 | DR-A | Monday | 12:00 | 25 | ||
Prod-A | Monday | 16:00 | 1300 | DR-A | Monday | 16:00 | 15 | ||
Prod-A | Tuesday | 8:00 | 1050 | DR-A | Tuesday | 8:00 | 20 | ||
Prod-A | Tuesday | 12:00 | 1600 | DR-A | Tuesday | 12:00 | 30 | ||
Prod-A | Tuesday | 16:00 | 1400 | DR-A | Tuesday | 16:00 | 25 | ||
Prod-A | Wednesday | 8:00 | 500 | Outliers-low | DR-A | Wednesday | 8:00 | 500 | Outliers-high |
Prod-A | Wednesday | 12:00 | 800 | Outliers-low | DR-A | Wednesday | 12:00 | 800 | Outliers-high |
Prod-A | Wednesday | 16:00 | 600 | Outliers-low | DR-A | Wednesday | 16:00 | 600 | Outliers-high |
Prod-A | Thursday | 8:00 | 1000 | DR-A | Thursday | 8:00 | 15 | ||
Prod-A | Thursday | 12:00 | 1500 | DR-A | Thursday | 12:00 | 50 | ||
Prod-A | Thursday | 16:00 | 1300 | DR-A | Thursday | 16:00 | 30 |
For this data, I might have 30 days of data where each hourly metric is below 50 for a DR group but for 1 or two days in the month it might be in the hundreds or thousands and I'm trying to represent what the consumption looks like for the month, without skewing the numbers with a DR test event.
Ideally I'd like to omit the top and bottom 1, 2 or 3 percent, then get percentiles from the remaining values.
The link below shows an excel example of this type calculation, excluding top & bottom values from percentiles.
Using the Percentile function while excluding outliers : excel (reddit.com)
=PERCENTILE.INC(IF((Values>Min)*(Values<Max),Values),Percentile)
Thanks,
Jim
As @ITWhisperer says, choose your outlier detection and then it's just SPL
| eventstats p5("Session Count") as p5, p95("Session Count") as p95
| where 'Session Count'>p5 AND 'Session Count'<p95
so this example will calculate 5th and 95th percentile and filter out all those below 5th and above 95th
or you could do it on fixed values or standard deviation or...
How do you decide what an outlier is? For example, you could take the mean and standard deviation by environment group and hour, then see how many standard deviations away from the mean each value is.
| eventstats mean('Session Count') as mean stdev('Session Count') as stdev by 'Environment Group' Hour
| eval stdev_from_mean=abs('Session Count'-mean)/stdev
| where stdev_from_mean < 1