Splunk Search

count average for sorted 90% of the events (tricky)

Path Finder

This is little tricky.....I already know there's a method available which can give me percentile...perc90... but what i would want is not discard the events by value of a percentile but number of events of a percentile....e.g. sorted values are : 1,2,3,4,5,6,6,6,7,8,9.Let's say x percentile of dataset is 5 (by value)... and if i discard greater than 5. it would discard all the three 6. but what i would want is sort and calculate the average X percentage of total which means may be 1,2,3,4,5,6,6. Is there any suggestions how to do this ? Already have search query as below which gives by value...need to find how i can fit the sort and than head x percentage by count ???

.. | eventstats perc90(responsetime) as responsetime90p | where responsetime < responsetime90p | stats avg(response_time)

UPDATE:

Thanks for the prompt response. Sorry if I'm not clear. Let's say if i have response times:

1) data= 1,1,3,3,4,4,4,5,6,6,6,9,9,9,9,7,7,7,7,8,8,8,8,2,2

2) Sort data :1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9

3) Discard Highest 10% of data and keep 90% by volume and not by value. I'm left with 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9 (CORRECT)

4) take an average of the 90%

Did you see the difference if i use perc90(data) on above series I get '9' and If i will calculate average of series less than value 9 i would get 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8 (WRONG) and would miss out on last two nines. Please compare the series WRONG and CORRECT. I hope i'm able to explain the issue.

Thanks, Amit

0 Karma

Splunk Employee
Splunk Employee

This isn't that hard to do, but it's very hard to parallelize, i.e., it's very ineffective to map-reduce, compared with averaging below a percentile. May I ask what the mathematical basis for preferring this computation to simply averaging everything below the 90th percentile would be? Especially since they would only have different results when considering artificial and pathological data sets, particularly if you measure your response times with sufficient granularity.

... | sort response_time | eventstats count as ttl | streamstats global=t current=t count as pos | where pos<(0.9*ttl) | stats avg(response_time)
0 Karma

Path Finder

Thanks for the prompt response. Sorry if I'm not clear. Let's say if i have response times: 1) data= 1,1,3,3,4,4,4,5,6,6,6,9,9,9,9,7,7,7,7,8,8,8,8,2,2 2) Sort data :1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9 3) Discard Highest 10% of data and keep 90% by volume and not by value. I'm left with 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9 (CORRECT) 4) take an average of the 90%

Did you see the difference if i use perc90(data) on above series I get '9' and If i will calculate average of series less than value 9 i would get 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8 (WRONG) and would miss out on last two nines. Please compare the series WRONG and CORRECT. I hope i'm able to explain the issue.

Thanks, Amit

0 Karma

Splunk Employee
Splunk Employee

If I understand correctly, you calculate the arithmetic mean ("average") of all values that are below the (e.g.) 90th percentile? And it looks like you know how to do that. However, I can't really understand what you then want to do, i.e., what you mean by "sort and then head x percentage"?

0 Karma

SplunkTrust
SplunkTrust

I'm having difficulty understanding - in your example what does the "average X percentage of total which means may be 1,2,3,4,5,6,6" mean? Maybe add more examples?

0 Karma