Hi,
I'm new to Splunk and I'm quite stuck on how to group users by percentile.
Each user has the option of paying for services and I want to group these users by their payment percentile. So if the max anyone has cumulatively paid is $100, they would show up in the 99th percentile while the 50th percentile would be someone who paid $50 or more. Based on that grouping I want to know the average number of sessions they've logged in my app.
It'd look something like this:
Percentile | Avg Payment | Sessions |
99% | 99 | 50 |
80% | 82 | 44 |
50% | 60 | 25 |
0% | 0 | 60 |
I would be grateful for any help on how to achieve this. Thanks!
Here's what you do, assuming your data has an event per session with a user ID and the amount paid for that session:
| inputlookup sessions.csv
| stats count as sessions sum(amount) as total by user
| eventstats p99(total) as p99 p80(total) as p80 p50(total) as p50
| eval Percentile = case(total >= p99, "99%", total >= p80, "80%", total >= p50, "50%", 1=1, "0%")
| stats avg(total) as average avg(sessions) as sessions by Percentile
The inputlookup loads my dummy data, see below.
The stats calculates the number of sessions and total amount paid per user.
The eventstats uses those totals to calculate where the percentile borders are.
The eval groups the users into their percentile buckets, without duplication - if you're in the 99% bucket you're not also in all the other buckets.
The final stats computes the average total amount and average number of sessions per percentile bucket.
Sample data looks like this:
session,user,amount
1,a,10
2,a,15
3,a,20
4,b,0
5,c,100
6,d,10
7,e,20
8,e,10
9,f,1
10,g,5
11,h,6
12,i,7
13,j,8
14,k,9
15,l,10
16,m,11
17,n,12
18,o,13
19,p,14
20,q,15
Output based on that sample data:
Percentile average sessions
0% 5.142857 1
50% 11.666667 1
80% 30.000000 2
99% 100.000000 1
To cross-check, let's look at the 80% bucket. I've got 17 users, so the lower-bound 80% bucket should be the top four people. The top one user is already in the 99% bucket, so we're looking at 2, 3, and 4 - they're a, e, and q and spent 45, 30, and 15 respectively giving you an average of 30.
Here's what you do, assuming your data has an event per session with a user ID and the amount paid for that session:
| inputlookup sessions.csv
| stats count as sessions sum(amount) as total by user
| eventstats p99(total) as p99 p80(total) as p80 p50(total) as p50
| eval Percentile = case(total >= p99, "99%", total >= p80, "80%", total >= p50, "50%", 1=1, "0%")
| stats avg(total) as average avg(sessions) as sessions by Percentile
The inputlookup loads my dummy data, see below.
The stats calculates the number of sessions and total amount paid per user.
The eventstats uses those totals to calculate where the percentile borders are.
The eval groups the users into their percentile buckets, without duplication - if you're in the 99% bucket you're not also in all the other buckets.
The final stats computes the average total amount and average number of sessions per percentile bucket.
Sample data looks like this:
session,user,amount
1,a,10
2,a,15
3,a,20
4,b,0
5,c,100
6,d,10
7,e,20
8,e,10
9,f,1
10,g,5
11,h,6
12,i,7
13,j,8
14,k,9
15,l,10
16,m,11
17,n,12
18,o,13
19,p,14
20,q,15
Output based on that sample data:
Percentile average sessions
0% 5.142857 1
50% 11.666667 1
80% 30.000000 2
99% 100.000000 1
To cross-check, let's look at the 80% bucket. I've got 17 users, so the lower-bound 80% bucket should be the top four people. The top one user is already in the 99% bucket, so we're looking at 2, 3, and 4 - they're a, e, and q and spent 45, 30, and 15 respectively giving you an average of 30.
That yields this table for my dummy data:
Percentile average sessions
99% 100 1
80% 47.500000 1.750000
50% 26 1.300000
0% 17.411765 1.176471
As a cross-check, running this:
| inputlookup sessions.csv
| stats count as sessions sum(amount) as total by user
| stats avg(total)
yields an average total per user of 17.411765, just as predicted by the 0% value 🙂
Sure, it just adds a bit of post-processing:
| inputlookup sessions.csv
| stats count as sessions sum(amount) as total by user
| eventstats p99(total) as p99 p80(total) as p80 p50(total) as p50
| eval Percentile = case(total >= p99, "99%", total >= p80, "80%", total >= p50, "50%", 1=1, "0%")
| stats count sum(total) as total sum(sessions) as sessions by Percentile
| sort - Percentile
| streamstats sum(*) as running_*
| eval average = running_total / running_count
| eval sessions = running_sessions / running_count
| table Percentile average sessions
Thanks so much for this explanation. I think my biggest mistake was using stats to calculate the actual percentile values instead of eventstats. Is there any way for a user to be included in multiple groups? So the top user would be in the 99%, 80% and 50%?
Could you post some sample events?
will it be
....|eval percentile=case(Payment=="50$" AND Payment=="100$","50%",Payment="100$","100%",cases..)|stats count(Sessions) as Session by percentile,Payment
It is a bit confusing... The top 50% would be both the person who paid $50 and the person who paid $100 while the 99th percentile is the top 1% so just the guy who paid $100.
I have not looked into rangemap yet, thanks for the tip!
so does that mean 99% is the person who has paid 100$ and the one paid 50$ they are in 50%? little confusing. You could try
eval per=case
or
rangemap
i guess that is the case here.