Solved: Re: How to group by percentile?

gtran · ‎04-14-2014

Hi,
I'm new to Splunk and I'm quite stuck on how to group users by percentile.

Each user has the option of paying for services and I want to group these users by their payment percentile. So if the max anyone has cumulatively paid is $100, they would show up in the 99th percentile while the 50th percentile would be someone who paid $50 or more. Based on that grouping I want to know the average number of sessions they've logged in my app.

It'd look something like this:

Percentile	Avg Payment	Sessions
99%	99	50
80%	82	44
50%	60	25
0%	0	60

I would be grateful for any help on how to achieve this. Thanks!

martin_mueller · ‎04-14-2014

Here's what you do, assuming your data has an event per session with a user ID and the amount paid for that session:

| inputlookup sessions.csv
| stats count as sessions sum(amount) as total by user
| eventstats p99(total) as p99 p80(total) as p80 p50(total) as p50
| eval Percentile = case(total >= p99, "99%", total >= p80, "80%", total >= p50, "50%", 1=1, "0%")
| stats avg(total) as average avg(sessions) as sessions by Percentile

The inputlookup loads my dummy data, see below.
The stats calculates the number of sessions and total amount paid per user.
The eventstats uses those totals to calculate where the percentile borders are.
The eval groups the users into their percentile buckets, without duplication - if you're in the 99% bucket you're not also in all the other buckets.
The final stats computes the average total amount and average number of sessions per percentile bucket.

Sample data looks like this:

session,user,amount 1,a,10 2,a,15 3,a,20 4,b,0 5,c,100 6,d,10 7,e,20 8,e,10 9,f,1 10,g,5 11,h,6 12,i,7 13,j,8 14,k,9 15,l,10 16,m,11 17,n,12 18,o,13 19,p,14 20,q,15

Output based on that sample data:

Percentile average sessions 0% 5.142857 1 50% 11.666667 1 80% 30.000000 2 99% 100.000000 1

To cross-check, let's look at the 80% bucket. I've got 17 users, so the lower-bound 80% bucket should be the top four people. The top one user is already in the 99% bucket, so we're looking at 2, 3, and 4 - they're a, e, and q and spent 45, 30, and 15 respectively giving you an average of 30.

View solution in original post

martin_mueller · ‎04-14-2014

Here's what you do, assuming your data has an event per session with a user ID and the amount paid for that session:

| inputlookup sessions.csv
| stats count as sessions sum(amount) as total by user
| eventstats p99(total) as p99 p80(total) as p80 p50(total) as p50
| eval Percentile = case(total >= p99, "99%", total >= p80, "80%", total >= p50, "50%", 1=1, "0%")
| stats avg(total) as average avg(sessions) as sessions by Percentile

The inputlookup loads my dummy data, see below.
The stats calculates the number of sessions and total amount paid per user.
The eventstats uses those totals to calculate where the percentile borders are.
The eval groups the users into their percentile buckets, without duplication - if you're in the 99% bucket you're not also in all the other buckets.
The final stats computes the average total amount and average number of sessions per percentile bucket.

Sample data looks like this:

session,user,amount 1,a,10 2,a,15 3,a,20 4,b,0 5,c,100 6,d,10 7,e,20 8,e,10 9,f,1 10,g,5 11,h,6 12,i,7 13,j,8 14,k,9 15,l,10 16,m,11 17,n,12 18,o,13 19,p,14 20,q,15

Output based on that sample data:

Percentile average sessions 0% 5.142857 1 50% 11.666667 1 80% 30.000000 2 99% 100.000000 1

To cross-check, let's look at the 80% bucket. I've got 17 users, so the lower-bound 80% bucket should be the top four people. The top one user is already in the 99% bucket, so we're looking at 2, 3, and 4 - they're a, e, and q and spent 45, 30, and 15 respectively giving you an average of 30.

martin_mueller · ‎04-14-2014

That yields this table for my dummy data:

Percentile average sessions 99% 100 1 80% 47.500000 1.750000 50% 26 1.300000 0% 17.411765 1.176471

As a cross-check, running this:

| inputlookup sessions.csv
| stats count as sessions sum(amount) as total by user
| stats avg(total)

yields an average total per user of 17.411765, just as predicted by the 0% value 🙂

martin_mueller · ‎04-14-2014

Sure, it just adds a bit of post-processing:

| inputlookup sessions.csv
| stats count as sessions sum(amount) as total by user
| eventstats p99(total) as p99 p80(total) as p80 p50(total) as p50
| eval Percentile = case(total >= p99, "99%", total >= p80, "80%", total >= p50, "50%", 1=1, "0%")
| stats count sum(total) as total sum(sessions) as sessions by Percentile
| sort - Percentile
| streamstats sum(*) as running_*
| eval average = running_total / running_count
| eval sessions = running_sessions / running_count
| table Percentile average sessions

gtran · ‎04-14-2014

Thanks so much for this explanation. I think my biggest mistake was using stats to calculate the actual percentile values instead of eventstats. Is there any way for a user to be included in multiple groups? So the top user would be in the 99%, 80% and 50%?

somesoni2 · ‎04-14-2014

Could you post some sample events?

linu1988 · ‎04-14-2014

will it be

....|eval percentile=case(Payment=="50$" AND Payment=="100$","50%",Payment="100$","100%",cases..)|stats count(Sessions) as Session by percentile,Payment

gtran · ‎04-14-2014

It is a bit confusing... The top 50% would be both the person who paid $50 and the person who paid $100 while the 99th percentile is the top 1% so just the guy who paid $100.

I have not looked into rangemap yet, thanks for the tip!

linu1988 · ‎04-14-2014

so does that mean 99% is the person who has paid 100$ and the one paid 50$ they are in 50%? little confusing. You could try

eval per=case

or

rangemap

i guess that is the case here.

How to group by percentile?

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Join the Conversation

How to group by percentile?

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...