I'm trying to get an accurate percentile representation from a dataset of hourly metrics, excluding outliers. The dataset consists of user sessions by group of machines for each hour where there's a production and a DR set of machines. On occasion, to validate DR, those machines are used as production so when those occasions occur, they drastically skew the percentiles of an otherwise low number of DR sessions in use.
Data would be like so.....
Environment-Group
Day
Hour
Session Count
Environment-Group
Day
Hour
Session Count
Prod-A
Monday
8:00
1000
DR-A
Monday
8:00
10
Prod-A
Monday
12:00
1500
DR-A
Monday
12:00
25
Prod-A
Monday
16:00
1300
DR-A
Monday
16:00
15
Prod-A
Tuesday
8:00
1050
DR-A
Tuesday
8:00
20
Prod-A
Tuesday
12:00
1600
DR-A
Tuesday
12:00
30
Prod-A
Tuesday
16:00
1400
DR-A
Tuesday
16:00
25
Prod-A
Wednesday
8:00
500
Outliers-low
DR-A
Wednesday
8:00
500
Outliers-high
Prod-A
Wednesday
12:00
800
Outliers-low
DR-A
Wednesday
12:00
800
Outliers-high
Prod-A
Wednesday
16:00
600
Outliers-low
DR-A
Wednesday
16:00
600
Outliers-high
Prod-A
Thursday
8:00
1000
DR-A
Thursday
8:00
15
Prod-A
Thursday
12:00
1500
DR-A
Thursday
12:00
50
Prod-A
Thursday
16:00
1300
DR-A
Thursday
16:00
30
For this data, I might have 30 days of data where each hourly metric is below 50 for a DR group but for 1 or two days in the month it might be in the hundreds or thousands and I'm trying to represent what the consumption looks like for the month, without skewing the numbers with a DR test event.
Ideally I'd like to omit the top and bottom 1, 2 or 3 percent, then get percentiles from the remaining values.
The link below shows an excel example of this type calculation, excluding top & bottom values from percentiles.
Using the Percentile function while excluding outliers : excel (reddit.com) =PERCENTILE.INC(IF((Values>Min)*(Values<Max),Values),Percentile)
Thanks,
Jim
... View more