I need help on doing cumulative percentiles, such as p90, over a period of time. This is different from rolling averages or taking the p90 of individual spans of time.
For example, I'm trying to calculate the cumulative, rolling p90 over a month. Here's a table that illustrates what I want:
Day # | P90 Value Calculation |
2020-07-01 (Day 1) | p90 of all events on day 1 |
2020-07-02 (Day 2) | p90 of all events on day 1 and day 2 |
2020-07-03 (Day 3) | p90 of all events from day 1 to day 3 |
... | ... |
2020-07-31 (Day 31) | p90 of all events during the month (day 1 - 31) |
If I wanted to do P90 for individual days, it's very simple: base_query | timechart span=1d p90(latency). Unfortunately, that's not what I want.
There are 100,000+ events each day. I tried using streamstats but didn't get the expected results. I suspect that this is because streamstats only has a maximum window size of 10,000 events. My failed approach to using streamstats was to take the cumulative p90 of every event over a time period, and then return the final event of each day in that period:
Can anyone show me how to do a cumulative p90 in Splunk? I'm using an imperfect workaround of cumulative weighted average p90 values, but it's just an approximation and I'd like to have the real deal instead.
Not sure if this will give you what you want, but you could create a summary index each night giving you the 90th percentile for the days to date and then your query, which I am assuming will be in a dashboard somewhere, will just fetch the data from the summary index
base_query earliest=@mon latest=@d
| stats p90(latency) as p90latency
| addinfo
| eval _time=info_max_time-1
| collect index=p90_latency_summary
The collect statement can either be done this way, or you can just enable summary indexing. Note that max_time -1 will set the timestamp of the event in the summary index to 23:59:59 on the previous day.
Anyway, as well as being very fast to show the search it's simply
index=p90_latency_summary earliest=@mon latest=@d
| table _time p90latency
If you want to clean up the time somewhat, you can always do a timechart span=1d rather than have the time as 23:59:59.
Just wondering what you show on the 1st of the month though...
Hope this helps
That appears to be one better way to work around the issue, but it's not quite the answer I'm looking for. Ideally, I'd be able to do this as an inline search so that the results can be on demand and up-to-date.
I'll give it a spin though, maybe updating the frequency of the summary index could also be helpful.
Yes, ultimately the results will be on a dashboard panel.
I had a similar goal some years ago and ended up going summary indexes. There is another similar approach, using summary indexes, you could use.
The sistats command, intended for summary indexes, will store percentile calculated field, so that you can then recalculate percentiles across a different time span, i.e. it will store all the unique values and counts for a percentile calculated field, so that a new percentile can be calculated from a different time span bin.
Depending on the cardinality of your data and the number of different data points that would need to be stored, especially if you are splitting by some field, it can end up storing a lot of data.
However, if you calculated a _daily_ sistats percentile and summarise/store that, then for the month-to-date percentile, I believe you can calculate the percentile from that summary just by specifying a different bin. This would have a slight advantage in that the daily searches would be faster as they only search 1 day of data, but you would still have the ability to calculate percentiles over _any_ rolling interval rather than the fixed start of the month.
Anyway, hopefully food for thought.