About Stephen_Jacobs

Stephen_Jacobs · ‎06-18-2021

Hey Abhi, Thanks for pointing out the scaled metric. This is the calculation that is desirable in regards to a more realistic representation of an individual container's usage. The question that remains is: Why isn't this more realistic scaled metric used for driving the spark lines from both the container metrics in the pod view of the cluster dashboard, and the server metrics in relation to the node when looking at it from the perspective of an application view? If the purpose of this metric was for backwards compatibility, wouldn't it make sense to make the "%Busy" metric the scaled value and have a "%Unscaled Busy" metric since this is the new calculation that breaks away from a 0-100% box? Here is an example from an API that is running showing the metrics you list (in which the scaled one is definitely the desired output) and what shows up under the pod detail for the same container and its related painted metric line within the application node view (the API is instrumented as well). All three of these are from the same 1 hour of metric values. Metrics as visible from browser for %Busy and %Busy Scaled: The CPU line shown in context of the pod in the Servers > Clusters > Pods > Pod Detail view: And finally here is what it looks like in the context of the server cpu usage visualization reporting for the node: This last one is the most misleading since it just tops out at 100 for the limit on the y axis. When teams look at it from this perspective it is a much more concerning view than what the reality of around ~30% usage with the "%Busy Scaled" metric. It's also worth noting that the default "CPU utilization is too high" health rule in applications uses the %Busy metric as well. The upswing here is you could change this tho use the relative path of the %Busy Scaled metric instead. Assuming everything in the application is running in a container and has this metric, that is a workable solution. While you can change this health rule or create others, the visualizations listed above do not offer this option. With all this said, out of curiosity, is there a reason that a percentage over 100% would be more desirable when displaying cpu usage at the container level? Traditionally in cpu perf monitoring at large, values over 100% could easily arise with multi core cpus and would be scaled in relation to available cores to provide the value in 0-100%. Why the change of outlook for Kubernetes specifically in regards to the default %Busy metric?

Stephen_Jacobs · ‎06-16-2021

I have couple of questions, and issues regarding this method of cpu metric calculation: On the question front: Is there a reason that it was chosen to make 1 cpu (or 1000 millicores) directly equivalent to 100% utilization regardless of limits on the pods themselves? I agree that comparing against all CPUs available on the host machine would be silly for pod level utilization metrics, but why the rigid assumption that 1 cpu would always be the limit? Could the cluster agent use the available metadata on pod limits to define the true utilization within the spec of its workload? Using the example in this post, 4630m utilization with a limit of 8000m is really ~58% not 463% when taken in context with what the expected headroom for that pod is. That same 4630m would be way more concerning if the limit were 4700m, but it would still be reported as 463% either way with the current setup. In effect, this current system has a couple of observable downsides with the current AppD setup. First, if you use these metrics as written against the default node health hardware cpu utilization health rules that live as boilerplate in applications, anything that uses more than 1 cpu (and has plenty of headspace in regards to limit) will alert as being too high because those are static thresholds expecting a 0-100 range. If you want to disable those and make custom rules, you could, but this doesn't lend itself to a large number of differing workloads that all have differing limits. Second, on a similar note, the current setup of a server cpu page in the context of an application node will have a max of 100%, so any value over that will not read. In the example of a 463% cpu utilization would just show up on this page as 100% currently, so even more context to the real situation of resource headroom is lost now.

Posts	8
Solutions	0
Karma Given	17
Karma Received	92
Member Since	‎07-26-2019

Online Status	Offline
Date Last Visited	‎02-28-2025 11:09 PM

Re: Defining Pod CPU Percent calculated by Cluster...

Re: Defining Pod CPU Percent calculated by Cluster...

Join the Conversation

Re: Defining Pod CPU Percent calculated by Cluster...

Re: Defining Pod CPU Percent calculated by Cluster...