Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP-4205] Concerns on using CPU PSI pressure to taint nodes #5062

Open
tiraboschi opened this issue Jan 21, 2025 · 8 comments
Open

[KEP-4205] Concerns on using CPU PSI pressure to taint nodes #5062

tiraboschi opened this issue Jan 21, 2025 · 8 comments
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@tiraboschi
Copy link

Phase 2 of KEP-4205 (PSI Based Node Conditions) is proposing to utilize the node level PSI metric to set node condition and node taints.

We conducted some investigation observing it on a real cluster and we can conclude that it could be really risky.
We can state that we cannot (still?) directly/solely use PSI metrics at node level to identify nodes under "pressure" and taint them.

At least not regarding CPU pressure when we have pods with stringent CPU limits.
This because PSI is not currently able to discriminate pressure caused by the contention of a scarce resource from pressure due to CPU throttling according to the limit that the user explicitly asked for.
As for kernel doc,
the pressure interface is something like:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

where the “some” line indicates the share of time in which at least some tasks are stalled on a given resource.
CPU full is undefined at the system level, but has been reported since 5.13, so it is set to zero for backward compatibility.
So basically for CPU we have only the "some" line at node level, and even a single misconfigured pod is already some.

Let's try for instance with a simple prod running stress-ng with 8 parallel CPU stressors but having that container limited at 0.02 cores (20 milli-cores).

apiVersion: v1
kind: Pod
metadata:
  name: stress-ng
spec:
  containers:
    - name: stress-ng
      image: quay.io/tiraboschi/stress-ng
      command: ['/usr/bin/stress-ng', '--temp-path', '/var/tmp/', '--cpu', '8']
      resources:
        requests:
          cpu: "10m"
        limits:
          cpu: "20m"

Kubernetes will translate spec.resources.limits.cpu: "20m" to
2000 100000 in cpu.max for the cgroup slice of the test pod.

Now if we check the CPU pressure for that pod (reading its cgroup slice) we will find something like:

sh-5.1# cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5d95f454_57f6_4fc3_b5d4_6c7cf673b0cd.slice/cpu.pressure
some avg10=99.00 avg60=98.93 avg300=98.83 total=109943161035
full avg10=99.00 avg60=98.93 avg300=98.83 total=109935133476

Which is basically correct and accurate when reported at workload level since that container is getting by far less CPU than needed and so it's under a significant pressure.
Now the issue is how to read this at node level: we can safely assume that node is definitely not overloaded since the problematic pod is getting only a small amount of CPU due to the throttling.

But when we look at the CPU pressure at Kubelet level we see something like:

sh-5.1# cat /sys/fs/cgroup/kubepods.slice/cpu.pressure
some avg10=88.89 avg60=89.90 avg300=92.84 total=678445993193
full avg10=3.13 avg60=4.25 avg300=17.54 total=645991129996

since "some" of the slices under the Kubelet slice are at that high pressure.

And the same at node level:

sh-5.1# cat /sys/fs/cgroup/cpu.pressure
some avg10=85.02 avg60=88.95 avg300=92.77 total=670759972464
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

(exactly the same reading it from /proc) since some (in our corner case just our problematic test pod but formally still some) of the slices running on that node are under a considerable CPU pressure.

So, although this is absolutely correct according to the pressure interface as reported by the Kernel (since at least one pod, so some, was really suffering due to the lack of CPU), we shouldn't really take any action based on that.
In our exaggerated test corner case, the lack of CPU was only caused by CPU throttling and not really by resource contention with other neighbors. In this specific case, tainting the node to prevent scheduling additional load there will provide no benefits.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 21, 2025
@tiraboschi tiraboschi changed the title Concerns on using CPU PSI pressure to taint nodes [KEP-4205] Concerns on using CPU PSI pressure to taint nodes Jan 21, 2025
@tiraboschi
Copy link
Author

/sig node
/wg node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 21, 2025
@k8s-ci-robot
Copy link
Contributor

@tiraboschi: The label(s) wg/node cannot be applied, because the repository doesn't have them.

In response to this:

/sig node
/wg node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 21, 2025
@fabiand
Copy link

fabiand commented Jan 21, 2025

cc @mrunalp @kannon92

@kannon92
Copy link
Contributor

So step 1 will be to expose metrics for PSI. Our hope will in 1.33 to bring KEP-4205 to alpha by exposing PSI metrics.

We thank you for your concerns at the moment.
Does your concerns also include Memory/IO PSI pressure? We don't really taint nodes based on "CPU Pressure" so this would be a net-new feature for Kubernetes.

@tiraboschi
Copy link
Author

I think that the issue I described here is specific to the CPU limit due to how cgroup enforces it with throttling that at PSI eyes is not distinguible from a stall caused by CPU contention.
Memory limit as declared on a pod is instead enforced by the kernel with the OOM killer that can kill containers exceeding their memory limit.
But the PSI memory pressure is telling us something else: it's not going to tell us how much memory a container (cgroup slice) is using related to its memory limit but it will tell us how much time a cgroup slice is simply waiting to get the memory it needs for instance due to swap-in/swap-out or page cache refaults and page reclaims.
So in the case of a single pod suffering from memory "slowness" I expect that also other pods on the same node are going to be affected and in this case the "some" line in the PSI interface is still relevant.
IO is similar to memory and we do not have IO limits on pods.

@haircommander
Copy link
Contributor

Thanks for the investigation and writeup @tiraboschi ! I am wondering if instead of using the top-level PSI we use kubepods.slice and only use the full line does that give more useful information? We could also maybe take the system.slice into account to make sure system processes have enough CPU to do tasks.

@tiraboschi
Copy link
Author

The full line on the kubepods.slice will not help due to the dual issue: according to the Kernel documentation, the “full” line indicates the share of time in which *all* non-idle tasks are stalled on a given resource simultaneously. But some of our pods are expected to get some CPU cycles at any time and so we are basically never expecting to have all the pods completely stuck at the same time.
The system.slice could be instead somehow interesting.

We can easily reproduce it.
This time let's use a replicaset (ensuring we have enough replicas to overload the node according to its HW) using a node selector to be sure that all of the pods are going to land on the same labelled node.
This time let's avoid setting resources.limits.cpu to be sure that all the reported pressure is really due to CPU contention and not to the cgroup throttling.
Here I'm going with 24 cpu stressors in 6 pods on a node with 16 cores:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: stress-rs
  labels:
    stress: cpu
spec:
  replicas: 6
  selector:
    matchLabels:
      stress: cpu
  template:
    metadata:
      labels:
        stress: cpu
    spec:
      containers:
      - name: stress-ng
        image: quay.io/tiraboschi/stress-ng
        command: ['/usr/bin/stress-ng', '--temp-path', '/var/tmp/', '--cpu', '4']
        resources:
          requests:
            cpu: "10m"
      nodeSelector:
        stress: cpu

At node level we see:

sh-5.1# top -b -n 1 | grep Cpu
%Cpu(s): 96.4 us,  2.7 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.6 hi,  0.3 si,  0.0 st

So the CPUs are idle for 0.1% of the time.

Now in terms of PSI pressure we see:

  • at node scope:
sh-5.1# cat /proc/pressure/cpu 
some avg10=83.55 avg60=81.64 avg300=57.07 total=752284265521
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

but we are not able to tell if 83% on average in the last 10 seconds on some slice is due to contention or throttling.
And the full line is 0 by design.

  • at kubepods.slice instead:
sh-5.1# cat /sys/fs/cgroup/kubepods.slice/cpu.pressure
some avg10=83.72 avg60=81.85 avg300=80.40 total=758528777000
full avg10=2.86 avg60=2.76 avg300=2.72 total=693555595647

and the same consideration as above applies for the some line.
While on the full line we see only a bare 2.86% because at any time at least a pod is getting a CPU quota.

  • on the system.slice:
some avg10=10.62 avg60=10.44 avg300=10.33 total=23801467697
full avg10=9.48 avg60=9.32 avg300=9.17 total=12122500001

So here we somehow see that the CPU contention due to k8s pods is causing some performance impact.
But is a 10% significative enough to take any remediation action?

@haircommander
Copy link
Contributor

CPU stalling information almost seems like it'd be most useful in the opposite direction: for a VPA or other controller to see that there is unused CPU and scale up pods before another pod needs it. I imagine once we have PSI metrics in and we can start collecting the data we could find some level of pressure we want a node to stay at to be fully utilized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

5 participants