update to latest base image with containerd 2.0.2, ensure containerd is ready before importing images #3848

BenTheElder · 2025-01-17T19:26:30Z

contains the base image built from #3828 in https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/post-kind-push-base-image/1880330906284068864

TODO: node image (prow e2es will use this, github actions will use the default node image only, though I expect we are more likely to catch issues in the full kubernetes e2e tests anyhow for this particular change)

k8s-ci-robot · 2025-01-17T19:26:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [BenTheElder]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

BenTheElder · 2025-01-17T19:27:24Z

/hold

BenTheElder · 2025-01-17T19:38:08Z

/retest
[pod scheduling timeout, CI never ran]

BenTheElder · 2025-01-17T21:06:25Z

All of CI passed the first try (ignoring failure to schedule the CI workload itself, unrelated, timing mismatch with autoscaling vs job scheduler), however when building a node image locally:

Failed to pull docker.io/kindest/kindnetd:v20241212-9f82dd49 with error: command "docker exec --privileged kind-build-1737147227-770353019 ctr --namespace=k8s.io content fetch --platform=linux/arm64 docker.io/kindest/kindnetd:v20241212-9f82dd49" failed with error: exit status 1
time="2025-01-17T20:53:53Z" level=warning msg="Failed to check deprecations" error="connection error: desc = \"transport: Error while dialing: dial unix:///run/containerd/containerd.sock: timeout\""
ctr: connection error: desc = "transport: Error while dialing: dial unix:///run/containerd/containerd.sock: timeout"
Failed to pull docker.io/kindest/local-path-provisioner:v20241212-8ac705d0 with error: command "docker exec --privileged kind-build-1737147227-770353019 ctr --namespace=k8s.io content fetch --platform=linux/arm64 docker.io/kindest/local-path-provisioner:v20241212-8ac705d0" failed with error: exit status 1
time="2025-01-17T20:53:53Z" level=warning msg="Failed to check deprecations" error="connection error: desc = \"transport: Error while dialing: dial unix:///run/containerd/containerd.sock: timeout\""
ctr: connection error: desc = "transport: Error while dialing: dial unix:///run/containerd/containerd.sock: timeout"
Failed to pull docker.io/kindest/local-path-helper:v20241212-8ac705d0 with error: command "docker exec --privileged kind-build-1737147227-770353019 ctr --namespace=k8s.io content fetch --platform=linux/arm64 docker.io/kindest/local-path-helper:v20241212-8ac705d0" failed with error: exit status 1
time="2025-01-17T20:53:53Z" level=warning msg="Failed to check deprecations" error="connection error: desc = \"error reading server preface: read unix @->/run/containerd/containerd.sock: use of closed network connection\""
ctr: connection error: desc = "error reading server preface: read unix @->/run/containerd/containerd.sock: use of closed network connection"

Debugging, my current suspicion is that we need to wait for containerd to be ready, it takes longer to start?

Around 1s on a fairly large cloud VM:

INFO[2025-01-17T20:58:38.686355016Z] containerd successfully booted in 0.956085s`

If i start containerd (v1.7.24) in the previous base image like this to simulate the build process:

docker run -d --entrypoint=sleep --name="test-old-base" --platform=linux/arm64 --security-opt=seccomp=unconfined docker.io/kindest/base:v20241212-9f82dd49 infinity
docker exec -it test-old-base containerd

INFO[2025-01-17T21:03:28.470803043Z] containerd successfully booted in 0.357321s

These times are pretty representative of repeated attempts, containerd 2.0.2 takes about 3x to start versus v1.7.24

(NOTE: these are the arm64 cross-build on an amd64 host)

circled back in #3828 (comment)

BenTheElder · 2025-01-17T21:17:58Z

To replicate:

start both versions

docker run -d --entrypoint=sleep --name="test-old-base" --platform=linux/arm64 --security-opt=seccomp=unconfined docker.io/kindest/base:v20241212-9f82dd49 infinity
docker run -d --entrypoint=sleep --name="test-new-base" --platform=linux/arm64 --security-opt=seccomp=unconfined docker.io/kindest/base:v20250117-f528b021 infinity

try starting containerd in each of these:

docker exec -it test-old-base containerd

docker exec -it test-new-base containerd

(then watch for the log line like "containerd successfully booted in 0.325773s" and "containerd successfully booted in 0.972910s")

BenTheElder · 2025-01-17T21:28:23Z

Update: this is probably not worth discussing upstream, because 2.0.2 is still < 0.07s with amd64 + amd64 host. It is however consistently longer than 1.7.4. Something with arm64 qemu must be even more pathological.

I think let's add image pull retries + waiting for it to start. We only need to do this for pulling, not imports (we do all pulling first), which is a nice idea anyhow to handle transient network issues.

aojea · 2025-01-18T12:06:36Z

Update: this is probably not worth discussing upstream, because 2.0.2 is still < 0.07s with amd64 + amd64 host. It is however consistently longer than 1.7.4. Something with arm64 qemu must be even more pathological.

I think let's add image pull retries + waiting for it to start. We only need to do this for pulling, not imports (we do all pulling first), which is a nice idea anyhow to handle transient network issues.

cc: @samuelkarp @AkihiroSuda

to put this in their radar

BenTheElder · 2025-01-21T18:55:44Z

I don't think this is a valid upstream issue, on further digging, it's only present under emulation (arm64 on amd64 host), or at least I don't currently have access to verify on an arm64 host. On amd64 2.x is technically slower than 1.7 but with a much smaller difference (both in absolute and relative terms).

aojea · 2025-01-22T09:19:17Z

/lgtm

unhold once you feel we are ready

BenTheElder · 2025-01-22T23:22:18Z

unhold once you feel we are ready

iterating on retry/wait for ready as a fix, we need the images to build multi-arch.

the patch I just pushed is sufficient to get a working build but it's just a quick hack.

ctr --connect-timeout doesn't seem to work, if I run ctr --connect-timeout=30s --timeout=30s info without containerd running it fails just as quickly as without those flags (appears to be 1s, though the usage suggests it should be 0s)

BenTheElder · 2025-01-22T23:38:19Z

OK, it both waits for containerd to be ready AND implements retry on image pull now.

Next I'm pushing a test node image we can try in the github actions (now that arm64 builds succesfully)

BenTheElder · 2025-01-22T23:47:51Z

Everything seems to work locally, and all the github actions are passing with the node image pushed to staging, so I'll promote the node image and I think we can proceed, barring review comments.

BenTheElder · 2025-01-23T00:02:42Z

pkg/build/nodeimage/buildcontext.go

+			})
+		}
+	}
+	// Wait for containerd socket to be ready, which may take 1s when running under emulation


we could also consider starting the image importer earlier on, but I would prefer to get the functional fix merged first over further micro-optimization. it won't make any meaningful difference in the 99% case of users building for their host architecture anyhow, and even when it does it will be negligible, basically one sleep in WaitForReady vs 0.

BenTheElder · 2025-01-23T00:06:32Z

/hold cancel

aojea · 2025-01-23T06:59:08Z

/lgtm

here we go

update to latest base image with containerd 2.0.2

d66a745

k8s-ci-robot requested a review from aojea January 17, 2025 19:26

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 17, 2025

k8s-ci-robot requested a review from stmcginnis January 17, 2025 19:26

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 17, 2025

BenTheElder mentioned this pull request Jan 17, 2025

update containerd to v2.0.2 #3828

Merged

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 17, 2025

BenTheElder mentioned this pull request Jan 21, 2025

test: upgrade nerdctl to v2 #3850

Open

k8s-ci-robot assigned aojea Jan 22, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2025

retry image pulls once, fail if that doesn't work

b594068

BenTheElder force-pushed the containerd202 branch 2 times, most recently from a1ce616 to 0beeffb Compare January 22, 2025 23:33

wait for containerd to be ready

18f8445

BenTheElder force-pushed the containerd202 branch from 0beeffb to 18f8445 Compare January 22, 2025 23:36

TEST: v1.32.1 node image

33d8c7a

BenTheElder force-pushed the containerd202 branch from 5769287 to 33d8c7a Compare January 22, 2025 23:49

BenTheElder commented Jan 23, 2025

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2025

BenTheElder changed the title ~~update to latest base image with containerd 2.0.2~~ update to latest base image with containerd 2.0.2, ensure containerd is ready before importing images Jan 23, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 23, 2025

k8s-ci-robot merged commit bc142d0 into kubernetes-sigs:main Jan 23, 2025
29 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update to latest base image with containerd 2.0.2, ensure containerd is ready before importing images #3848

update to latest base image with containerd 2.0.2, ensure containerd is ready before importing images #3848

BenTheElder commented Jan 17, 2025

k8s-ci-robot commented Jan 17, 2025

BenTheElder commented Jan 17, 2025

BenTheElder commented Jan 17, 2025

BenTheElder commented Jan 17, 2025 •

edited

Loading

BenTheElder commented Jan 17, 2025 •

edited

Loading

BenTheElder commented Jan 17, 2025 •

edited

Loading

aojea commented Jan 18, 2025

BenTheElder commented Jan 21, 2025

aojea commented Jan 22, 2025

BenTheElder commented Jan 22, 2025

BenTheElder commented Jan 22, 2025 •

edited

Loading

BenTheElder commented Jan 22, 2025

BenTheElder Jan 23, 2025

BenTheElder commented Jan 23, 2025

aojea commented Jan 23, 2025

update to latest base image with containerd 2.0.2, ensure containerd is ready before importing images #3848

update to latest base image with containerd 2.0.2, ensure containerd is ready before importing images #3848

Conversation

BenTheElder commented Jan 17, 2025

k8s-ci-robot commented Jan 17, 2025

BenTheElder commented Jan 17, 2025

BenTheElder commented Jan 17, 2025

BenTheElder commented Jan 17, 2025 • edited Loading

BenTheElder commented Jan 17, 2025 • edited Loading

BenTheElder commented Jan 17, 2025 • edited Loading

aojea commented Jan 18, 2025

BenTheElder commented Jan 21, 2025

aojea commented Jan 22, 2025

BenTheElder commented Jan 22, 2025

BenTheElder commented Jan 22, 2025 • edited Loading

BenTheElder commented Jan 22, 2025

BenTheElder Jan 23, 2025

Choose a reason for hiding this comment

BenTheElder commented Jan 23, 2025

aojea commented Jan 23, 2025

BenTheElder commented Jan 17, 2025 •

edited

Loading

BenTheElder commented Jan 17, 2025 •

edited

Loading

BenTheElder commented Jan 17, 2025 •

edited

Loading

BenTheElder commented Jan 22, 2025 •

edited

Loading