Hello,
I have some data for which I calculate hourly avg/max/stdev into a summary index, then calculate daily summary rollups of the same stats based on the hourly rollups. (Running daily rollups directly on the raw data is infeasible because the data set is too large.)
I've been asked to add p95 to these rollups as well. Syntactically it's easy to add p95(value) to my rollups, but mathematically I wondered what this actually calculates. The hourly rollup has the 95th percentile value from the raw data, but what would the daily rollup contain--the 95th percentile of those 24 hourly p95 values? Or does it do some kind of magic and approximate the 95th percentile from the (rolled up and invisible) raw values? Basically, how close to the "true" raw p95 value is a p95 that's based on hourly summary data?
As much as I like to think that splunk is magic, splunk has no such feature to re-calculate from raw data if the search doesn't run on raw data.
If you do an hourly summary with p95(value) in it, a daily summary on those hourly summaries is the 95th percentile of your 24 data points per day. Maybe you can do a p95(value) at the end of each day for that day? Because with min, max (and avg without wheights), the summaries should be okay to use in subsequent summaries. I'm not sure what it's called, but I think there is a mathematical word for this "stability" over multiple iterations... if anyone knows, feel free to comment.
Anyhow, min/max are okay but percentiles will differ. How much this deviates from the "actual" value is not easily predictable - take this example:
index=_internal earliest=-1h latest=now kbps="*"
| bucket span=1m _time as time_buckets_minutes | bucket span=10m _time as time_buckets_10m
| eventstats p70(kbps) as p70_minute by time_buckets_minutes | eventstats p70(kbps) as p70_10m by time_buckets_10m | eventstats p70(p70_minute) as p70_10m_from_p70_minute by time_buckets_10m
| table _time time_buckets_* bytes p70_* | fieldformat time_buckets_minutes=strftime(time_buckets_minutes, "%F %T") | fieldformat time_buckets_10m=strftime(time_buckets_10m, "%F %T")
It shows the raw sample data (_time, kbps), two bucket time spans to simulate your summary ranges (one is each minute, one is ten minutes) and then the 70th percentile of kbps per minute (p70_minute), per 10 minutes (p70_10m) and one per 10 minutes which is based on the p70_minute (p70_10m_from_p70_minute) (I chose he 70th percentile for the sake of easier comprehension). If you compare p70_10m with p70_10m_from_p70_minute, you'll see the differences. If you append
| eval diff=abs(p70_10m-p70_10m_from_p70_minute) | stats avg(kbps) p10(diff) p90(diff) avg(diff)
to the above search, you get some statistics about the differences - but this may be deceiving, because this will vary depending on the raw data.
I'd suggest doing a summary over the whole day's raw data after each day to calculate that day's percentiles.
Yeah, I kind of figured there wasn't any magic. We have the same problem with average, that we're calculating an average of averages, but I knew about that one. I don't think running any daily search over the raw data will fly, but I'll see if people are okay with the reduced accuracy of the p* values over the hourly data. Thanks.