count average for sorted 90% of the events (tricky...

amitsehgal · ‎04-22-2011

This is little tricky.....I already know there's a method available which can give me percentile...perc90... but what i would want is not discard the events by value of a percentile but number of events of a percentile....e.g. sorted values are : 1,2,3,4,5,6,6,6,7,8,9.Let's say x percentile of dataset is 5 (by value)... and if i discard greater than 5. it would discard all the three 6. but what i would want is sort and calculate the average X percentage of total which means may be 1,2,3,4,5,6,6. Is there any suggestions how to do this ? Already have search query as below which gives by value...need to find how i can fit the sort and than head x percentage by count ???

.. | eventstats perc90(response_time) as response_time_90p | where response_time < response_time_90p | stats avg(response_time)

UPDATE:

Thanks for the prompt response. Sorry if I'm not clear. Let's say if i have response times:

1) data= 1,1,3,3,4,4,4,5,6,6,6,9,9,9,9,7,7,7,7,8,8,8,8,2,2

2) Sort data :1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9

3) Discard Highest 10% of data and keep 90% by volume and not by value. I'm left with 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9 (CORRECT)

4) take an average of the 90%

Did you see the difference if i use perc90(data) on above series I get '9' and If i will calculate average of series less than value 9 i would get 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8 (WRONG) and would miss out on last two nines. Please compare the series WRONG and CORRECT. I hope i'm able to explain the issue.

Thanks, Amit

gkanapathy · ‎04-25-2011

This isn't that hard to do, but it's very hard to parallelize, i.e., it's very ineffective to map-reduce, compared with averaging below a percentile. May I ask what the mathematical basis for preferring this computation to simply averaging everything below the 90th percentile would be? Especially since they would only have different results when considering artificial and pathological data sets, particularly if you measure your response times with sufficient granularity.

... | sort response_time | eventstats count as ttl | streamstats global=t current=t count as pos | where pos<(0.9*ttl) | stats avg(response_time)

amitsehgal · ‎04-25-2011

Thanks for the prompt response. Sorry if I'm not clear. Let's say if i have response times: 1) data= 1,1,3,3,4,4,4,5,6,6,6,9,9,9,9,7,7,7,7,8,8,8,8,2,2 2) Sort data :1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9 3) Discard Highest 10% of data and keep 90% by volume and not by value. I'm left with 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8,9,9 (CORRECT) 4) take an average of the 90%

Did you see the difference if i use perc90(data) on above series I get '9' and If i will calculate average of series less than value 9 i would get 1,1,2,2,3,3,4,4,4,5,6,6,6,7,7,7,7,8,8,8,8 (WRONG) and would miss out on last two nines. Please compare the series WRONG and CORRECT. I hope i'm able to explain the issue.

Thanks, Amit

gkanapathy · ‎04-23-2011

If I understand correctly, you calculate the arithmetic mean ("average") of all values that are below the (e.g.) 90th percentile? And it looks like you know how to do that. However, I can't really understand what you then want to do, i.e., what you mean by "sort and then head x percentage"?

sideview · ‎04-23-2011

I'm having difficulty understanding - in your example what does the "average X percentage of total which means may be 1,2,3,4,5,6,6" mean? Maybe add more examples?

count average for sorted 90% of the events (tricky)

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!