Splunk Search

Find the average of the first 99% range of data

Path Finder

My search currently gives me some statistics regarding response times including total count, average, min, max and 99% percentile value (I'm assuming that this is ordered) based on one of the fields from the query.

I need to find the average query count for the first 99% of my data count per query name, so my logic was to take the minimum value from the 99% percentile value, and dividing that by 99% of the total count. But when I run this, I often get values for this not in the range between minimum and maximum for certain values, e.g. Average: 15533.5, Min: 9076, Max: 24737, *99% Average: 479.0*, which clearly wouldn't make sense even if the numbers are heavily skewed.

This was my initial search attempt:

index=* ...
| stats count AS "Query Count", avg(ResponseTime) AS "AvgRespTime", min(ResponseTime) AS "MinRespTime", max(ResponseTime) AS "MaxRespTime", p99(ResponseTime) AS "99PercRespTime" by queryName
| eval AvgRespTime=round(AvgRespTime,1), 99PercAvgRespTime=('99PercRespTime'-'MinRespTime')/(0.99
'Query Count'), 99PercAvgRespTime=round('99PercAvgRespTime',1)
| sort - 99PercAvgRespTime

*EDIT: I've seen a flaw in my initial idea where I'm taking single values rather than sums, which are how you get averages (oops) - what I'd ideally want is to take the sum of the 99th percentile (and above) value(s) away from the total sum and divide that by the count of queries minus the count of queries above (and including) the 99th percentile. *

(1) Is it even possible to accurately get the average of the first 99% of the data? My logic was that by removing the last 1% of queries (so if it was 1000, the 10 queries with the highest response times), I can get this average but what should the query be? I considered taking 99% average, but I assume that would remove 0.5% from the lower end as well as the upper end, which is what I've been asked to avoid.

(2) Is there a way to show skewness of data based off statistics on Splunk? If possible, I'd want to show a graph or at least data showing this if possible.

0 Karma
1 Solution

Path Finder

I ended up doing it so it was just the values less than or equal to the 99th percentile value for 'x':

index=* ... "OK"
| eventstats perc99(ResponseTime) as p99Resp by x
| where ResponseTime<=p99Resp
| stats min(ResponseTime) as p98ResponseMin, max(ResponseTime) as p98ResponseMax, avg(ResponseTime) as p98AvgRespTime by x

For average, I imagine using an 'eval' to get the value may be necessary.

View solution in original post

0 Karma

Path Finder

I ended up doing it so it was just the values less than or equal to the 99th percentile value for 'x':

index=* ... "OK"
| eventstats perc99(ResponseTime) as p99Resp by x
| where ResponseTime<=p99Resp
| stats min(ResponseTime) as p98ResponseMin, max(ResponseTime) as p98ResponseMax, avg(ResponseTime) as p98AvgRespTime by x

For average, I imagine using an 'eval' to get the value may be necessary.

View solution in original post

0 Karma

SplunkTrust
SplunkTrust

It's not very difficult, just use eventstats to get the 99the percentile and then act accordingly

your query that gets the data  by queryName
| fields ResponseTime queryName
| eventstats perc99(ResponseTime) as p99Resp by queryName 
| where ResponseTime<=p99Resp
| rename COMMENT as "the above gives you only those records which are below the 99th percentile"

| rename COMMENT as "you can now analyze them by themselves"
| stats count as P98Count, avg(ResponseTime) as p98ResponseAvg, stdev(ResponseTime) as p98ResponseStdev,
        min(ResponseTime) as p98ResponseMin, max(ResponseTime) as p98ResponseMax
        by queryName

...where p98 is the population below the 99th percentile.

...or...

your query that gets the data  by queryName
| fields ResponseTime queryName
| eventstats perc99(ResponseTime) as p99Resp by queryName 
| eval p00Unit=1
| eval p98Unit=if(ResponseTime<=p99Resp,1,0)
| eval p99Unit=if(ResponseTime<=p99Resp,0,1)
| eval p00Response=ResponseTime
| eval p98Response=if(ResponseTime<=p99Resp,ResponseTime,null())
| eval p99Response=if(ResponseTime<=p99Resp,null(),ResponseTime)
| stats sum(p00Unit) as P00Count, avg(p00Response) as p00ResponseAvg, stdev(p00Response) as p00ResponseStdev,
        min(p00Response) as p00ResponseMin, max(p00Response) as p00ResponseMax,
        sum(p98Unit) as P98Count, avg(p98Response) as p98ResponseAvg, stdev(p98Response) as p98ResponseStdev, 
        min(p98Response) as p98ResponseMin, max(p98Response) as p98ResponseMax,
        sum(p99Unit) as P99Count, avg(p99Response) as p99ResponseAvg, stdev(p99Response) as p99ResponseStdev, 
        min(p99Response) as p99ResponseMin, max(p99Response) as p99ResponseMax
        by queryName

...where p00 is the entire population, p98 is the population below the 99th percentile, and p99 is the population above the 99th percentile.

Path Finder

I tweaked the first query so it'd work with the initial search - I was running a comparison of the query counts and I can see whilst some of the query names that have fewer queries don't lose any, some queries are losing more than 1% from the top (e.g. the query with the highest count initially has a count of 152638, but when running your version, it falls to 149949, which is a drop of around 1.8%) - is this expected?

Also, is there a way I can still display the 100% values for certain items, i.e. Query Count, Min and Max but show the 99% Average Response as well? I'd like to show that although some queries may take longer (as outiers), the majority (99%) of searches have a low response time, but it's more of a preference.

0 Karma

SplunkTrust
SplunkTrust

@sepkarimpour - yes, that result is mathematically possible, especially if the top end of the data has lots of clumping. To make up a simple example, let's suppose the data was one each of the numbers from 1 to 97, then three 100s. the 99th percentile is 100, so only 97% of the data are below the 99th percentile.

Yes, you can do all kinds of things like that easily. I already did that for you in the last example, where P00 is the whole data, p98 is the data below the 99th, and p99 is the data in the 99th.

You can also use the percentile aggregate functions p25(), p50()/median(), and p75() against the p00 data if you'd like, to show the quartiles.

0 Karma

SplunkTrust
SplunkTrust

@sepkarimpour - what do you mean by "first 99%"? do you mean the data that is below the 99th percentile?

0 Karma

Path Finder

I think so. I want to eliminate the top 1% of queries that take the longest time in order to show that all the other queries below this have a low response time - general analysis of this to show whether this'll remove the tail-end outliers that could be affecting the average OR to show that these queries have a higher response time in general so we can assess what could be causing this.

0 Karma

Esteemed Legend

Does if have to be "the last" portion? If not, check out the (newish) Event Sampling feature that is a blue link on the left side towards the middle under the histogram/timeline and above the Event/Patterns/Statistics/Visualization tabs.

0 Karma

Path Finder

That's what I've been asked to provide - using 'Event Sampling' will show a random set of results, and given some of the query names have single digit results and some have thousands so that wouldn't work for this.

0 Karma

SplunkTrust
SplunkTrust

Have you looked through "finding and removing outliers"? That also references the outlier command, which may be of use too.

0 Karma

Path Finder

I'm currently checking the 'outlier' command - I'm not sure how you can set the upper bound to 99% though. My concern is when running the query now, I can see multiple query names, which all have different counts so I'm not sure how to set a limit. I'll try using this and see if I can get an answer.

0 Karma