Splunk Search

## count average for sorted 90% of the events (tricky)

Path Finder

This is little tricky.....I already know there's a method available which can give me percentile...perc90... but what i would want is not discard the events by value of a percentile but number of events of a percentile....e.g. sorted values are : 1,2,3,4,5,6,6,6,7,8,9.Let's say x percentile of dataset is 5 (by value)... and if i discard greater than 5. it would discard all the three 6. but what i would want is sort and calculate the average X percentage of total which means may be 1,2,3,4,5,6,6. Is there any suggestions how to do this ? Already have search query as below which gives by value...need to find how i can fit the sort and than head x percentage by count ???

.. | eventstats perc90(response_time) as response_time_90p | where response_time < response_time_90p | stats avg(response_time)

UPDATE:

Thanks for the prompt response. Sorry if I'm not clear. Let's say if i have response times:

1) data= 1,1,3,3,4,4,4,5,6,6,6,9,9,9,9,7,7,7,7,8,8,8,8,2,2

2) Sort data :1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9

3) Discard Highest 10% of data and keep 90% by volume and not by value. I'm left with 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9 (CORRECT)

4) take an average of the 90%

Did you see the difference if i use perc90(data) on above series I get '9' and If i will calculate average of series less than value 9 i would get 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8 (WRONG) and would miss out on last two nines. Please compare the series WRONG and CORRECT. I hope i'm able to explain the issue.

Thanks, Amit

Tags (3)
Splunk Employee

This isn't that hard to do, but it's very hard to parallelize, i.e., it's very ineffective to map-reduce, compared with averaging below a percentile. May I ask what the mathematical basis for preferring this computation to simply averaging everything below the 90th percentile would be? Especially since they would only have different results when considering artificial and pathological data sets, particularly if you measure your response times with sufficient granularity.

``````... | sort response_time | eventstats count as ttl | streamstats global=t current=t count as pos | where pos<(0.9*ttl) | stats avg(response_time)
``````
Path Finder

Thanks for the prompt response. Sorry if I'm not clear. Let's say if i have response times: 1) data= 1,1,3,3,4,4,4,5,6,6,6,9,9,9,9,7,7,7,7,8,8,8,8,2,2 2) Sort data :1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9 3) Discard Highest 10% of data and keep 90% by volume and not by value. I'm left with 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9 (CORRECT) 4) take an average of the 90%

Did you see the difference if i use perc90(data) on above series I get '9' and If i will calculate average of series less than value 9 i would get 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8 (WRONG) and would miss out on last two nines. Please compare the series WRONG and CORRECT. I hope i'm able to explain the issue.

Thanks, Amit

Splunk Employee

If I understand correctly, you calculate the arithmetic mean ("average") of all values that are below the (e.g.) 90th percentile? And it looks like you know how to do that. However, I can't really understand what you then want to do, i.e., what you mean by "sort and then head x percentage"?

SplunkTrust

I'm having difficulty understanding - in your example what does the "average X percentage of total which means may be 1,2,3,4,5,6,6" mean? Maybe add more examples?

Get Updates on the Splunk Community!

#### Splunk Life | Happy Pride Month!

Happy Pride Month, Splunk Community! &#x1f308; In the United States, as well as many countries around the ...

#### SplunkTrust | Where Are They Now - Michael Uschmann

The Background Five years ago, Splunk published several videos showcasing members of the SplunkTrust to share ...