-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update to latest base image with containerd 2.0.2, ensure containerd is ready before importing images #3848
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BenTheElder The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
/retest |
All of CI passed the first try (ignoring failure to schedule the CI workload itself, unrelated, timing mismatch with autoscaling vs job scheduler), however when building a node image locally:
Debugging, my current suspicion is that we need to wait for containerd to be ready, it takes longer to start? Around 1s on a fairly large cloud VM:
If i start containerd (v1.7.24) in the previous base image like this to simulate the build process:
These times are pretty representative of repeated attempts, containerd 2.0.2 takes about 3x to start versus v1.7.24 (NOTE: these are the arm64 cross-build on an amd64 host) circled back in #3828 (comment) |
To replicate:
docker run -d --entrypoint=sleep --name="test-old-base" --platform=linux/arm64 --security-opt=seccomp=unconfined docker.io/kindest/base:v20241212-9f82dd49 infinity
docker run -d --entrypoint=sleep --name="test-new-base" --platform=linux/arm64 --security-opt=seccomp=unconfined docker.io/kindest/base:v20250117-f528b021 infinity
docker exec -it test-old-base containerd docker exec -it test-new-base containerd (then watch for the log line like "containerd successfully booted in 0.325773s" and "containerd successfully booted in 0.972910s") |
Update: this is probably not worth discussing upstream, because 2.0.2 is still < 0.07s with amd64 + amd64 host. It is however consistently longer than 1.7.4. Something with arm64 qemu must be even more pathological. I think let's add image pull retries + waiting for it to start. We only need to do this for pulling, not imports (we do all pulling first), which is a nice idea anyhow to handle transient network issues. |
to put this in their radar |
I don't think this is a valid upstream issue, on further digging, it's only present under emulation (arm64 on amd64 host), or at least I don't currently have access to verify on an arm64 host. On amd64 2.x is technically slower than 1.7 but with a much smaller difference (both in absolute and relative terms). |
/lgtm unhold once you feel we are ready |
iterating on retry/wait for ready as a fix, we need the images to build multi-arch. the patch I just pushed is sufficient to get a working build but it's just a quick hack.
|
a1ce616
to
0beeffb
Compare
0beeffb
to
18f8445
Compare
OK, it both waits for containerd to be ready AND implements retry on image pull now. Next I'm pushing a test node image we can try in the github actions (now that arm64 builds succesfully) |
Everything seems to work locally, and all the github actions are passing with the node image pushed to staging, so I'll promote the node image and I think we can proceed, barring review comments. |
5769287
to
33d8c7a
Compare
}) | ||
} | ||
} | ||
// Wait for containerd socket to be ready, which may take 1s when running under emulation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also consider starting the image importer earlier on, but I would prefer to get the functional fix merged first over further micro-optimization. it won't make any meaningful difference in the 99% case of users building for their host architecture anyhow, and even when it does it will be negligible, basically one sleep
in WaitForReady vs 0.
/hold cancel |
/lgtm here we go |
contains the base image built from #3828 in https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/post-kind-push-base-image/1880330906284068864
TODO: node image (prow e2es will use this, github actions will use the default node image only, though I expect we are more likely to catch issues in the full kubernetes e2e tests anyhow for this particular change)