Splunk Search

How to write a search to calculate a weighed percentile?

jambalaya_rice
Engager

Hi all,

I would like to calculate the weighted percentile defined as the following:

Value,Weight
1,3
2,2
3,1
4,4
5,5
6,6
7,7
8,8
9,3
10,2

In the above example, if I calculate the 25th weighted percentile, I would calculate that the sum of the total weights are 41 and so the 25th percentile would be the value corresponding to the 10th or 11th in weight which would be 4 or 5 in this case. Similarly the 75th weighted percentile would be corresponding to 30/31 which in this case would be 8. Any suggestions would be helpful.

Thanks,
Andrew

0 Karma
1 Solution

Richfez
SplunkTrust
SplunkTrust

Fun!

So, I'm SURE this can be tweaked out a bit, but assuming your value and weight fields show up correctly...

Your Base Search Here ...
| reverse
| streamstats sum(weight) as runningWeight
| search [ search Your Base Search Here AGAIN 
    | stats sum(weight) AS totalWeight 
    | eval search="runningWeight<=".totalWeight*.25 ]
| top 1 value

First line is your base search, for instance mine's something like source=mytemptextfile.txt index=mytempindex.
Second line reverses the results so they'll be in the right order to do the math on.
The third line gets you a running total of your weights. If you run the search up to this point you'll see what it does.
The fourth line starts a search ... with a subsearch which continues for the next two lines. In this put in the ENTIRE base search again, just like in the first line.
The fifth line just grabs a total weight, right? just a stats sum(blah) over all the lines in your data.
Line six is where some magic happens. We eval a new field called "search" that equals , well, literally it equals search="runningWeight<=".totalWeight*.25, which since it ends the subsearch feeds that string back as a substitution into the start of the search/subsearch (line 4).

Line seven's just to grab the top one. Leave this off to see how it culls your data up to that point.

So, a short example. Suppose you have the numbers you have. The subsearch's stats will end up 41, so totalWeight is 41. Since totalWeight and .25 are both numbers, the math gets done in the subsearch when building the new search string, right? That means that subsearch in this case would collapse out to ...

Your Base Search Here ...
| reverse
| streamstats sum(weight) as runningWeight
| search runningWeight<=10.25
| top 1 value

So if you leave off the top 1 value at the end you'll see you get all the rows back that are 10.25 or under for the value. The "top 1 value" then just chops off all but the biggest of those, so your value is 4.

To change the percentage, you can change it in the subsearch ( totalWeight*0.33 or whatever).

View solution in original post

Richfez
SplunkTrust
SplunkTrust

Fun!

So, I'm SURE this can be tweaked out a bit, but assuming your value and weight fields show up correctly...

Your Base Search Here ...
| reverse
| streamstats sum(weight) as runningWeight
| search [ search Your Base Search Here AGAIN 
    | stats sum(weight) AS totalWeight 
    | eval search="runningWeight<=".totalWeight*.25 ]
| top 1 value

First line is your base search, for instance mine's something like source=mytemptextfile.txt index=mytempindex.
Second line reverses the results so they'll be in the right order to do the math on.
The third line gets you a running total of your weights. If you run the search up to this point you'll see what it does.
The fourth line starts a search ... with a subsearch which continues for the next two lines. In this put in the ENTIRE base search again, just like in the first line.
The fifth line just grabs a total weight, right? just a stats sum(blah) over all the lines in your data.
Line six is where some magic happens. We eval a new field called "search" that equals , well, literally it equals search="runningWeight<=".totalWeight*.25, which since it ends the subsearch feeds that string back as a substitution into the start of the search/subsearch (line 4).

Line seven's just to grab the top one. Leave this off to see how it culls your data up to that point.

So, a short example. Suppose you have the numbers you have. The subsearch's stats will end up 41, so totalWeight is 41. Since totalWeight and .25 are both numbers, the math gets done in the subsearch when building the new search string, right? That means that subsearch in this case would collapse out to ...

Your Base Search Here ...
| reverse
| streamstats sum(weight) as runningWeight
| search runningWeight<=10.25
| top 1 value

So if you leave off the top 1 value at the end you'll see you get all the rows back that are 10.25 or under for the value. The "top 1 value" then just chops off all but the biggest of those, so your value is 4.

To change the percentage, you can change it in the subsearch ( totalWeight*0.33 or whatever).

jambalaya_rice
Engager

thanks a lot for the help rich! this kind of breaks when I would like to return P25, P50, P75 and plot it with Value,Value1,Value2,Weight. Any suggestions for this? your help is greatly appreciated 🙂

0 Karma

Richfez
SplunkTrust
SplunkTrust

I don't have my test box here at home to try anything and I can't think of an elegant way to do what you want. I can think of a simple way, though - append. This will make your search 3x as long because basically you'll just run the set of searches three times, once for each with append in between.

Something like (I removed some extra carriage returns for fear of triggering some sort of TL;DR response...) 🙂

Your Base Search Here ...
| reverse | streamstats sum(weight) as runningWeight
| search [ search Your Base Search Here AGAIN 
    | stats sum(weight) AS totalWeight 
    | eval search="runningWeight<=".totalWeight*.25 ] | top 1 value
| append [ Your Base Search Here ...
  | reverse | streamstats sum(weight) as runningWeight
  | search [ search Your Base Search Here AGAIN 
    | stats sum(weight) AS totalWeight 
    | eval search="runningWeight<=".totalWeight*.5 ] | top 1 value ]
| append [Your Base Search Here ...
  | reverse | streamstats sum(weight) as runningWeight
  | search [ search Your Base Search Here AGAIN 
    | stats sum(weight) AS totalWeight 
    | eval search="runningWeight<=".totalWeight*.75 ] | top 1 value ]

You can completely pull the three pieces apart and run them separately for testing and proving they're working independently - append is just a way to hook together multiple searches. Hopefully I didn't get the square brackets out of order.

0 Karma

jambalaya_rice
Engager

thanks a lot rich 🙂

0 Karma

ppablo
Retired

Hi @jambalay_rice

Did the answer by @rich7177 solve your question? If yes, please don't forget to resolve the post by clicking "Accept" directly below his answer. Also, be sure to upvote the answer and/or comment if they helped you come up with your final search.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...