[KEP-4205] Concerns on using CPU PSI pressure to taint nodes #5062

tiraboschi · 2025-01-21T15:39:58Z

Phase 2 of KEP-4205 (PSI Based Node Conditions) is proposing to utilize the node level PSI metric to set node condition and node taints.

We conducted some investigation observing it on a real cluster and we can conclude that it could be really risky.
We can state that we cannot (still?) directly/solely use PSI metrics at node level to identify nodes under "pressure" and taint them.

At least not regarding CPU pressure when we have pods with stringent CPU limits.
This because PSI is not currently able to discriminate pressure caused by the contention of a scarce resource from pressure due to CPU throttling according to the limit that the user explicitly asked for.
As for kernel doc,
the pressure interface is something like:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

where the “some” line indicates the share of time in which at least some tasks are stalled on a given resource.
CPU full is undefined at the system level, but has been reported since 5.13, so it is set to zero for backward compatibility.
So basically for CPU we have only the "some" line at node level, and even a single misconfigured pod is already some.

Let's try for instance with a simple prod running stress-ng with 8 parallel CPU stressors but having that container limited at 0.02 cores (20 milli-cores).

apiVersion: v1
kind: Pod
metadata:
  name: stress-ng
spec:
  containers:
    - name: stress-ng
      image: quay.io/tiraboschi/stress-ng
      command: ['/usr/bin/stress-ng', '--temp-path', '/var/tmp/', '--cpu', '8']
      resources:
        requests:
          cpu: "10m"
        limits:
          cpu: "20m"

Kubernetes will translate spec.resources.limits.cpu: "20m" to
2000 100000 in cpu.max for the cgroup slice of the test pod.

Now if we check the CPU pressure for that pod (reading its cgroup slice) we will find something like:

sh-5.1# cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5d95f454_57f6_4fc3_b5d4_6c7cf673b0cd.slice/cpu.pressure
some avg10=99.00 avg60=98.93 avg300=98.83 total=109943161035
full avg10=99.00 avg60=98.93 avg300=98.83 total=109935133476

Which is basically correct and accurate when reported at workload level since that container is getting by far less CPU than needed and so it's under a significant pressure.
Now the issue is how to read this at node level: we can safely assume that node is definitely not overloaded since the problematic pod is getting only a small amount of CPU due to the throttling.

But when we look at the CPU pressure at Kubelet level we see something like:

sh-5.1# cat /sys/fs/cgroup/kubepods.slice/cpu.pressure
some avg10=88.89 avg60=89.90 avg300=92.84 total=678445993193
full avg10=3.13 avg60=4.25 avg300=17.54 total=645991129996

since "some" of the slices under the Kubelet slice are at that high pressure.

And the same at node level:

sh-5.1# cat /sys/fs/cgroup/cpu.pressure
some avg10=85.02 avg60=88.95 avg300=92.77 total=670759972464
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

(exactly the same reading it from /proc) since some (in our corner case just our problematic test pod but formally still some) of the slices running on that node are under a considerable CPU pressure.

So, although this is absolutely correct according to the pressure interface as reported by the Kernel (since at least one pod, so some, was really suffering due to the lack of CPU), we shouldn't really take any action based on that.
In our exaggerated test corner case, the lack of CPU was only caused by CPU throttling and not really by resource contention with other neighbors. In this specific case, tainting the node to prevent scheduling additional load there will provide no benefits.

The text was updated successfully, but these errors were encountered:

tiraboschi · 2025-01-21T15:41:27Z

/sig node
/wg node

k8s-ci-robot · 2025-01-21T15:41:30Z

@tiraboschi: The label(s) wg/node cannot be applied, because the repository doesn't have them.

In response to this:

/sig node
/wg node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

fabiand · 2025-01-21T19:42:22Z

cc @mrunalp @kannon92

kannon92 · 2025-01-23T15:50:08Z

So step 1 will be to expose metrics for PSI. Our hope will in 1.33 to bring KEP-4205 to alpha by exposing PSI metrics.

We thank you for your concerns at the moment.
Does your concerns also include Memory/IO PSI pressure? We don't really taint nodes based on "CPU Pressure" so this would be a net-new feature for Kubernetes.

tiraboschi · 2025-01-23T18:27:14Z

I think that the issue I described here is specific to the CPU limit due to how cgroup enforces it with throttling that at PSI eyes is not distinguible from a stall caused by CPU contention.
Memory limit as declared on a pod is instead enforced by the kernel with the OOM killer that can kill containers exceeding their memory limit.
But the PSI memory pressure is telling us something else: it's not going to tell us how much memory a container (cgroup slice) is using related to its memory limit but it will tell us how much time a cgroup slice is simply waiting to get the memory it needs for instance due to swap-in/swap-out or page cache refaults and page reclaims.
So in the case of a single pod suffering from memory "slowness" I expect that also other pods on the same node are going to be affected and in this case the "some" line in the PSI interface is still relevant.
IO is similar to memory and we do not have IO limits on pods.

haircommander · 2025-01-23T19:51:06Z

Thanks for the investigation and writeup @tiraboschi ! I am wondering if instead of using the top-level PSI we use kubepods.slice and only use the full line does that give more useful information? We could also maybe take the system.slice into account to make sure system processes have enough CPU to do tasks.

tiraboschi · 2025-01-24T16:54:10Z

The full line on the kubepods.slice will not help due to the dual issue: according to the Kernel documentation, the “full” line indicates the share of time in which *all* non-idle tasks are stalled on a given resource simultaneously. But some of our pods are expected to get some CPU cycles at any time and so we are basically never expecting to have all the pods completely stuck at the same time.
The system.slice could be instead somehow interesting.

We can easily reproduce it.
This time let's use a replicaset (ensuring we have enough replicas to overload the node according to its HW) using a node selector to be sure that all of the pods are going to land on the same labelled node.
This time let's avoid setting resources.limits.cpu to be sure that all the reported pressure is really due to CPU contention and not to the cgroup throttling.
Here I'm going with 24 cpu stressors in 6 pods on a node with 16 cores:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: stress-rs
  labels:
    stress: cpu
spec:
  replicas: 6
  selector:
    matchLabels:
      stress: cpu
  template:
    metadata:
      labels:
        stress: cpu
    spec:
      containers:
      - name: stress-ng
        image: quay.io/tiraboschi/stress-ng
        command: ['/usr/bin/stress-ng', '--temp-path', '/var/tmp/', '--cpu', '4']
        resources:
          requests:
            cpu: "10m"
      nodeSelector:
        stress: cpu

At node level we see:

sh-5.1# top -b -n 1 | grep Cpu
%Cpu(s): 96.4 us,  2.7 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.6 hi,  0.3 si,  0.0 st

So the CPUs are idle for 0.1% of the time.

Now in terms of PSI pressure we see:

at node scope:

sh-5.1# cat /proc/pressure/cpu 
some avg10=83.55 avg60=81.64 avg300=57.07 total=752284265521
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

but we are not able to tell if 83% on average in the last 10 seconds on some slice is due to contention or throttling.
And the full line is 0 by design.

at kubepods.slice instead:

sh-5.1# cat /sys/fs/cgroup/kubepods.slice/cpu.pressure
some avg10=83.72 avg60=81.85 avg300=80.40 total=758528777000
full avg10=2.86 avg60=2.76 avg300=2.72 total=693555595647

and the same consideration as above applies for the some line.
While on the full line we see only a bare 2.86% because at any time at least a pod is getting a CPU quota.

on the system.slice:

some avg10=10.62 avg60=10.44 avg300=10.33 total=23801467697
full avg10=9.48 avg60=9.32 avg300=9.17 total=12122500001

So here we somehow see that the CPU contention due to k8s pods is causing some performance impact.
But is a 10% significative enough to take any remediation action?

haircommander · 2025-01-27T14:11:39Z

CPU stalling information almost seems like it'd be most useful in the opposite direction: for a VPA or other controller to see that there is unused CPU and scale up pods before another pod needs it. I imagine once we have PSI metrics in and we can start collecting the data we could find some level of pressure we want a node to stay at to be fully utilized

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 21, 2025

tiraboschi changed the title ~~Concerns on using CPU PSI pressure to taint nodes~~ [KEP-4205] Concerns on using CPU PSI pressure to taint nodes Jan 21, 2025

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 21, 2025

k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 21, 2025

github-project-automation bot added this to SIG Node 1.33 KEPs planning Jan 21, 2025

This was referenced Jan 21, 2025

KEP-4205: reporting experimental results and concerns on phase 2 #5063

Open

Support PSI based on cgroupv2 #4205

Open

haircommander moved this to Triage in SIG Node 1.33 KEPs planning Jan 24, 2025

haircommander removed this from SIG Node 1.33 KEPs planning Jan 24, 2025

xinau mentioned this issue Jan 27, 2025

Add Pressure Stall Information Metrics google/cadvisor#3649

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KEP-4205] Concerns on using CPU PSI pressure to taint nodes #5062

[KEP-4205] Concerns on using CPU PSI pressure to taint nodes #5062

tiraboschi commented Jan 21, 2025

tiraboschi commented Jan 21, 2025

k8s-ci-robot commented Jan 21, 2025

fabiand commented Jan 21, 2025

kannon92 commented Jan 23, 2025

tiraboschi commented Jan 23, 2025

haircommander commented Jan 23, 2025

tiraboschi commented Jan 24, 2025

haircommander commented Jan 27, 2025

[KEP-4205] Concerns on using CPU PSI pressure to taint nodes #5062

[KEP-4205] Concerns on using CPU PSI pressure to taint nodes #5062

Comments

tiraboschi commented Jan 21, 2025

tiraboschi commented Jan 21, 2025

k8s-ci-robot commented Jan 21, 2025

fabiand commented Jan 21, 2025

kannon92 commented Jan 23, 2025

tiraboschi commented Jan 23, 2025

haircommander commented Jan 23, 2025

tiraboschi commented Jan 24, 2025

haircommander commented Jan 27, 2025