I have couple of questions, and issues regarding this method of cpu metric calculation: On the question front: Is there a reason that it was chosen to make 1 cpu (or 1000 millicores) directly equivalent to 100% utilization regardless of limits on the pods themselves? I agree that comparing against all CPUs available on the host machine would be silly for pod level utilization metrics, but why the rigid assumption that 1 cpu would always be the limit? Could the cluster agent use the available metadata on pod limits to define the true utilization within the spec of its workload? Using the example in this post, 4630m utilization with a limit of 8000m is really ~58% not 463% when taken in context with what the expected headroom for that pod is. That same 4630m would be way more concerning if the limit were 4700m, but it would still be reported as 463% either way with the current setup. In effect, this current system has a couple of observable downsides with the current AppD setup. First, if you use these metrics as written against the default node health hardware cpu utilization health rules that live as boilerplate in applications, anything that uses more than 1 cpu (and has plenty of headspace in regards to limit) will alert as being too high because those are static thresholds expecting a 0-100 range. If you want to disable those and make custom rules, you could, but this doesn't lend itself to a large number of differing workloads that all have differing limits. Second, on a similar note, the current setup of a server cpu page in the context of an application node will have a max of 100%, so any value over that will not read. In the example of a 463% cpu utilization would just show up on this page as 100% currently, so even more context to the real situation of resource headroom is lost now.
... View more