Hi all,
I would like to calculate the weighted percentile defined as the following:
Value,Weight
1,3
2,2
3,1
4,4
5,5
6,6
7,7
8,8
9,3
10,2
In the above example, if I calculate the 25th weighted percentile, I would calculate that the sum of the total weights are 41 and so the 25th percentile would be the value corresponding to the 10th or 11th in weight which would be 4 or 5 in this case. Similarly the 75th weighted percentile would be corresponding to 30/31 which in this case would be 8. Any suggestions would be helpful.
Thanks,
Andrew
Fun!
So, I'm SURE this can be tweaked out a bit, but assuming your value
and weight
fields show up correctly...
Your Base Search Here ...
| reverse
| streamstats sum(weight) as runningWeight
| search [ search Your Base Search Here AGAIN
| stats sum(weight) AS totalWeight
| eval search="runningWeight<=".totalWeight*.25 ]
| top 1 value
First line is your base search, for instance mine's something like source=mytemptextfile.txt index=mytempindex
.
Second line reverses the results so they'll be in the right order to do the math on.
The third line gets you a running total of your weights. If you run the search up to this point you'll see what it does.
The fourth line starts a search ...
with a subsearch which continues for the next two lines. In this put in the ENTIRE base search again, just like in the first line.
The fifth line just grabs a total weight, right? just a stats sum(blah)
over all the lines in your data.
Line six is where some magic happens. We eval a new field called "search" that equals , well, literally it equals search="runningWeight<=".totalWeight*.25
, which since it ends the subsearch feeds that string back as a substitution into the start of the search/subsearch (line 4).
Line seven's just to grab the top one. Leave this off to see how it culls your data up to that point.
So, a short example. Suppose you have the numbers you have. The subsearch's stats
will end up 41, so totalWeight is 41. Since totalWeight and .25 are both numbers, the math gets done in the subsearch when building the new search string, right? That means that subsearch in this case would collapse out to ...
Your Base Search Here ...
| reverse
| streamstats sum(weight) as runningWeight
| search runningWeight<=10.25
| top 1 value
So if you leave off the top 1 value
at the end you'll see you get all the rows back that are 10.25 or under for the value. The "top 1 value" then just chops off all but the biggest of those, so your value is 4.
To change the percentage, you can change it in the subsearch ( totalWeight*0.33
or whatever).
Fun!
So, I'm SURE this can be tweaked out a bit, but assuming your value
and weight
fields show up correctly...
Your Base Search Here ...
| reverse
| streamstats sum(weight) as runningWeight
| search [ search Your Base Search Here AGAIN
| stats sum(weight) AS totalWeight
| eval search="runningWeight<=".totalWeight*.25 ]
| top 1 value
First line is your base search, for instance mine's something like source=mytemptextfile.txt index=mytempindex
.
Second line reverses the results so they'll be in the right order to do the math on.
The third line gets you a running total of your weights. If you run the search up to this point you'll see what it does.
The fourth line starts a search ...
with a subsearch which continues for the next two lines. In this put in the ENTIRE base search again, just like in the first line.
The fifth line just grabs a total weight, right? just a stats sum(blah)
over all the lines in your data.
Line six is where some magic happens. We eval a new field called "search" that equals , well, literally it equals search="runningWeight<=".totalWeight*.25
, which since it ends the subsearch feeds that string back as a substitution into the start of the search/subsearch (line 4).
Line seven's just to grab the top one. Leave this off to see how it culls your data up to that point.
So, a short example. Suppose you have the numbers you have. The subsearch's stats
will end up 41, so totalWeight is 41. Since totalWeight and .25 are both numbers, the math gets done in the subsearch when building the new search string, right? That means that subsearch in this case would collapse out to ...
Your Base Search Here ...
| reverse
| streamstats sum(weight) as runningWeight
| search runningWeight<=10.25
| top 1 value
So if you leave off the top 1 value
at the end you'll see you get all the rows back that are 10.25 or under for the value. The "top 1 value" then just chops off all but the biggest of those, so your value is 4.
To change the percentage, you can change it in the subsearch ( totalWeight*0.33
or whatever).
thanks a lot for the help rich! this kind of breaks when I would like to return P25, P50, P75 and plot it with Value,Value1,Value2,Weight. Any suggestions for this? your help is greatly appreciated 🙂
I don't have my test box here at home to try anything and I can't think of an elegant way to do what you want. I can think of a simple way, though - append. This will make your search 3x as long because basically you'll just run the set of searches three times, once for each with append in between.
Something like (I removed some extra carriage returns for fear of triggering some sort of TL;DR response...) 🙂
Your Base Search Here ...
| reverse | streamstats sum(weight) as runningWeight
| search [ search Your Base Search Here AGAIN
| stats sum(weight) AS totalWeight
| eval search="runningWeight<=".totalWeight*.25 ] | top 1 value
| append [ Your Base Search Here ...
| reverse | streamstats sum(weight) as runningWeight
| search [ search Your Base Search Here AGAIN
| stats sum(weight) AS totalWeight
| eval search="runningWeight<=".totalWeight*.5 ] | top 1 value ]
| append [Your Base Search Here ...
| reverse | streamstats sum(weight) as runningWeight
| search [ search Your Base Search Here AGAIN
| stats sum(weight) AS totalWeight
| eval search="runningWeight<=".totalWeight*.75 ] | top 1 value ]
You can completely pull the three pieces apart and run them separately for testing and proving they're working independently - append
is just a way to hook together multiple searches. Hopefully I didn't get the square brackets out of order.
thanks a lot rich 🙂
Hi @jambalay_rice
Did the answer by @rich7177 solve your question? If yes, please don't forget to resolve the post by clicking "Accept" directly below his answer. Also, be sure to upvote the answer and/or comment if they helped you come up with your final search.