-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP-4205] Concerns on using CPU PSI pressure to taint nodes #5062
Comments
/sig node |
@tiraboschi: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
So step 1 will be to expose metrics for PSI. Our hope will in 1.33 to bring KEP-4205 to alpha by exposing PSI metrics. We thank you for your concerns at the moment. |
I think that the issue I described here is specific to the CPU limit due to how cgroup enforces it with throttling that at PSI eyes is not distinguible from a stall caused by CPU contention. |
Thanks for the investigation and writeup @tiraboschi ! I am wondering if instead of using the top-level PSI we use kubepods.slice and only use the full line does that give more useful information? We could also maybe take the system.slice into account to make sure system processes have enough CPU to do tasks. |
The We can easily reproduce it.
At node level we see:
So the CPUs are idle for 0.1% of the time. Now in terms of PSI pressure we see:
but we are not able to tell if 83% on average in the last 10 seconds on
and the same consideration as above applies for the
So here we somehow see that the CPU contention due to k8s pods is causing some performance impact. |
CPU stalling information almost seems like it'd be most useful in the opposite direction: for a VPA or other controller to see that there is unused CPU and scale up pods before another pod needs it. I imagine once we have PSI metrics in and we can start collecting the data we could find some level of pressure we want a node to stay at to be fully utilized |
Phase 2 of KEP-4205 (PSI Based Node Conditions) is proposing to utilize the node level PSI metric to set node condition and node taints.
We conducted some investigation observing it on a real cluster and we can conclude that it could be really risky.
We can state that we cannot (still?) directly/solely use PSI metrics at node level to identify nodes under "pressure" and taint them.
At least not regarding CPU pressure when we have pods with stringent CPU limits.
This because PSI is not currently able to discriminate pressure caused by the contention of a scarce resource from pressure due to CPU throttling according to the limit that the user explicitly asked for.
As for kernel doc,
the pressure interface is something like:
where the “some” line indicates the share of time in which at least some tasks are stalled on a given resource.
CPU full is undefined at the system level, but has been reported since 5.13, so it is set to zero for backward compatibility.
So basically for CPU we have only the "some" line at node level, and even a single misconfigured pod is already some.
Let's try for instance with a simple prod running
stress-ng
with 8 parallel CPU stressors but having that container limited at 0.02 cores (20 milli-cores).Kubernetes will translate
spec.resources.limits.cpu: "20m"
to2000 100000
incpu.max
for the cgroup slice of the test pod.Now if we check the CPU pressure for that pod (reading its cgroup slice) we will find something like:
Which is basically correct and accurate when reported at workload level since that container is getting by far less CPU than needed and so it's under a significant pressure.
Now the issue is how to read this at node level: we can safely assume that node is definitely not overloaded since the problematic pod is getting only a small amount of CPU due to the throttling.
But when we look at the CPU pressure at Kubelet level we see something like:
since "some" of the slices under the Kubelet slice are at that high pressure.
And the same at node level:
(exactly the same reading it from
/proc
) sincesome
(in our corner case just our problematic test pod but formally still some) of the slices running on that node are under a considerable CPU pressure.So, although this is absolutely correct according to the pressure interface as reported by the Kernel (since at least one pod, so some, was really suffering due to the lack of CPU), we shouldn't really take any action based on that.
In our exaggerated test corner case, the lack of CPU was only caused by CPU throttling and not really by resource contention with other neighbors. In this specific case, tainting the node to prevent scheduling additional load there will provide no benefits.
The text was updated successfully, but these errors were encountered: