Splunk Search

How to exclude top N values or percent from percentile calculations?

JM_dataguy
New Member

I'm trying to get an accurate percentile representation from a dataset of hourly metrics, excluding outliers.  The dataset consists of user sessions by group of machines for each hour where there's a production and a DR set of machines.  On occasion, to validate DR, those machines are used as production so when those occasions occur, they drastically skew the percentiles of an otherwise low number of DR sessions in use.

Data would be like so.....

Environment-Group Day Hour Session Count   Environment-Group Day Hour Session Count  
Prod-A            Monday 8:00 1000   DR-A Monday 8:00 10  
Prod-A            Monday 12:00 1500   DR-A Monday 12:00 25  
Prod-A            Monday 16:00 1300   DR-A Monday 16:00 15  
Prod-A            Tuesday 8:00 1050   DR-A Tuesday 8:00 20  
Prod-A            Tuesday 12:00 1600   DR-A Tuesday 12:00 30  
Prod-A            Tuesday 16:00 1400   DR-A Tuesday 16:00 25  
Prod-A            Wednesday 8:00 500 Outliers-low DR-A Wednesday 8:00 500 Outliers-high
Prod-A            Wednesday 12:00 800 Outliers-low DR-A Wednesday 12:00 800 Outliers-high
Prod-A            Wednesday 16:00 600 Outliers-low DR-A Wednesday 16:00 600 Outliers-high
Prod-A            Thursday 8:00 1000   DR-A Thursday 8:00 15  
Prod-A            Thursday 12:00 1500   DR-A Thursday 12:00 50  
Prod-A            Thursday 16:00 1300   DR-A Thursday 16:00 30  

 

For this data, I might have 30 days of data where each hourly metric is below 50 for a DR group but for 1 or two days in the month it might be in the hundreds or thousands and I'm trying to represent what the consumption looks like for the month, without skewing the numbers with a DR test event.

Ideally I'd like to omit the top and bottom 1, 2 or 3 percent, then get percentiles from the remaining values.

The link below shows an excel example of this type calculation, excluding top & bottom values from percentiles.

Using the Percentile function while excluding outliers : excel (reddit.com)
=PERCENTILE.INC(IF((Values>Min)*(Values<Max),Values),Percentile)

 

Thanks,

Jim

0 Karma

bowesmana
SplunkTrust
SplunkTrust

As @ITWhisperer says, choose your outlier detection and then it's just SPL 

| eventstats p5("Session Count") as p5, p95("Session Count") as p95
| where 'Session Count'>p5 AND 'Session Count'<p95

so this example will calculate 5th and 95th percentile and filter out all those below 5th and above 95th

or you could do it on fixed values or standard deviation or...

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

How do you decide what an outlier is? For example, you could take the mean and standard deviation by environment group and hour, then see how many standard deviations away from the mean each value is.

| eventstats mean('Session Count') as mean stdev('Session Count') as stdev by 'Environment Group' Hour
| eval stdev_from_mean=abs('Session Count'-mean)/stdev
| where stdev_from_mean < 1

 

0 Karma
Get Updates on the Splunk Community!

Best Strategies to Optimize Observability Costs

 Join us on Tuesday, May 6, 2025, at 11 AM PDT / 2 PM EDT for an insightful session on optimizing ...

Fueling your curiosity with new Splunk ILT and eLearning courses

At Splunk Education, we’re driven by curiosity—both ours and yours! That’s why we’re committed to delivering ...

Splunk AI Assistant for SPL 1.1.0 | Now Personalized to Your Environment for Greater ...

Splunk AI Assistant for SPL has transformed how users interact with Splunk, making it easier than ever to ...