Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cloud-init blocking on WSL1 #508

Merged
merged 10 commits into from
Dec 18, 2024
Merged

Fix cloud-init blocking on WSL1 #508

merged 10 commits into from
Dec 18, 2024

Conversation

CarlosNihelton
Copy link
Collaborator

@CarlosNihelton CarlosNihelton commented Dec 16, 2024

cloud-init status --wait blocks forever if cloud-init won't ever start. Known cases where that may happen:

  1. the cloud-init systemd units being masked or disabled by default in the rootfs (not our case)
  2. systemd not enabled by default in the rootfs image (not our case either)
  3. the distro instance being registered as a WSL 1 instance (we don't control that :) )

We're not considering that properly. With the patch bb72652, we check if both the cloud-init.service unit is enabled (which covers 1) as well as systemd not offline, which covers 2. and 3.
The very same check was implemented in the new script that will replace this launch in the new WSL format via this PR: ubuntu/wsl-setup#17.

I also added an end-to-end test case that sets WSL1 as the default setting globally and attempts to register the instance. Without that patch the test would timeout. See 472a2d8.

UDENG-5629.


CI had two issues blocking me:

  1. releases-info, a Go binary used to select which rootfses to build the application packages from, had outdated considerations about what CPC publishes under https://cloud-images.ubuntu.com/wsl:
  • Jammy file naming conventions are now the same as Noble's.
  • The CPC publisher no longer strips the "ubuntu" suffix on both releases.

Look at https://cloud-images.ubuntu.com/wsl/jammy/current/:

...
[   ] SHA256SUMS.gpg                                      2024-12-13 15:01  833
[   ] ubuntu-jammy-wsl-amd64-ubuntu.rootfs.tar.gz         2024-12-12 21:42  324M  File system image and Kernel packed
[   ] ubuntu-jammy-wsl-amd64-ubuntu22.04lts.rootfs.tar.gz 2024-12-12 21:42  324M  File system image and Kernel packed
[TXT] ubuntu-jammy-wsl-amd64-wsl.manifest                 2024-12-12 21:42   17K  Package manifest file
[   ] ubuntu-jammy-wsl-arm64-ubuntu.rootfs.tar.gz         2024-12-12 21:45  308M  File system image and Kernel packed
[   ] ubuntu-jammy-wsl-arm64-ubuntu22.04lts.rootfs.tar.gz 2024-12-12 21:45  308M  File system image and Kernel packed
[TXT] ubuntu-jammy-wsl-arm64-wsl.manifest                 2024-12-12 21:45   17K  Package manifest file
...

The images contain a suffix "ubuntu22.04lts".
Also, that naming convention used to be from 24.04 onwards.
But as we can clearly see 22.04 and 24.04 are following the same naming convention
(see https://cloud-images.ubuntu.com/wsl/noble/current/):

[   ] SHA256SUMS.gpg                                      2024-12-13 05:37  833
[   ] ubuntu-noble-wsl-amd64-ubuntu.rootfs.tar.gz         2024-12-12 20:21  347M  File system image and Kernel packed
[   ] ubuntu-noble-wsl-amd64-ubuntu24.04lts.rootfs.tar.gz 2024-12-12 20:21  347M  File system image and Kernel packed
[TXT] ubuntu-noble-wsl-amd64-wsl.manifest                 2024-12-12 20:20   16K  Package manifest file
[   ] ubuntu-noble-wsl-arm64-ubuntu.rootfs.tar.gz         2024-12-12 20:23  332M  File system image and Kernel packed
[   ] ubuntu-noble-wsl-arm64-ubuntu24.04lts.rootfs.tar.gz 2024-12-12 20:23  332M  File system image and Kernel packed
[TXT] ubuntu-noble-wsl-arm64-wsl.manifest                 2024-12-12 20:23   16K  Package manifest file

  1. Every now and then the PowerShell script that downloads the rootfs and validate the checksums failed with a shallow stack trace of an attempt to index into a null array.
    That was not really helpful. After investing some time on this I could finally learn that we implemented the wrong regex pattern (two spaces instead of a space and a raw star/asterisk character).

The CPC publisher no longer strips the "ubuntu" prefix.
Look at https://cloud-images.ubuntu.com/wsl/jammy/current/:

..
[   ] SHA256SUMS.gpg                                      2024-12-13 15:01  833
[   ] ubuntu-jammy-wsl-amd64-ubuntu.rootfs.tar.gz         2024-12-12 21:42  324M  File system image and Kernel packed
[   ] ubuntu-jammy-wsl-amd64-ubuntu22.04lts.rootfs.tar.gz 2024-12-12 21:42  324M  File system image and Kernel packed
[TXT] ubuntu-jammy-wsl-amd64-wsl.manifest                 2024-12-12 21:42   17K  Package manifest file
[   ] ubuntu-jammy-wsl-arm64-ubuntu.rootfs.tar.gz         2024-12-12 21:45  308M  File system image and Kernel packed
[   ] ubuntu-jammy-wsl-arm64-ubuntu22.04lts.rootfs.tar.gz 2024-12-12 21:45  308M  File system image and Kernel packed
[TXT] ubuntu-jammy-wsl-arm64-wsl.manifest                 2024-12-12 21:45   17K  Package manifest file
...

The images contain a suffix "ubuntu22.04lts".
Also, that naming convention used to be from 24.04 onwards.
But as we can clearly see 22.04 and 24.04 are following the same naming convention
(see https://cloud-images.ubuntu.com/wsl/noble/current/):

[   ] SHA256SUMS.gpg                                      2024-12-13 05:37  833
[   ] ubuntu-noble-wsl-amd64-ubuntu.rootfs.tar.gz         2024-12-12 20:21  347M  File system image and Kernel packed
[   ] ubuntu-noble-wsl-amd64-ubuntu24.04lts.rootfs.tar.gz 2024-12-12 20:21  347M  File system image and Kernel packed
[TXT] ubuntu-noble-wsl-amd64-wsl.manifest                 2024-12-12 20:20   16K  Package manifest file
[   ] ubuntu-noble-wsl-arm64-ubuntu.rootfs.tar.gz         2024-12-12 20:23  332M  File system image and Kernel packed
[   ] ubuntu-noble-wsl-arm64-ubuntu24.04lts.rootfs.tar.gz 2024-12-12 20:23  332M  File system image and Kernel packed
[TXT] ubuntu-noble-wsl-arm64-wsl.manifest                 2024-12-12 20:23   16K  Package manifest file
Every now and then we see something like:
C:\actions-runner\_work\WSL\WSL\.github\actions\download-rootfs\download-rootfs.ps1 : Cannot index into a null array.
At C:\actions-runner\_work\_temp\aa03e702-b864-4685-b25e-83a2e2853d79.ps1:31 char:3
+   & C:\actions-runner\_work\WSL\WSL\./.github/actions/download-rootfs ...
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [download-rootfs.ps1], RuntimeException
    + FullyQualifiedErrorId : NullArray,download-rootfs.ps1

The trace was too shallow to figure out what was that about.
After putting some time to run this script locally I finally saw wheer it was coming from:
- The pattern was incorrect.
- When attempting to test the images checksum we'd never have a match.
- Luckily we don't test that at every CI run (to make things harder).
As a additional layer of precaution, I'm checking the size of the matched collection.
That's the only surprising behaviour affecting WSL1.
Also, if systemd is disabled (or the cloud-int.service unit),
running cloud-init --wait wouild block forever.
While I'd like the cloud-init command to properly handle the cases where it cannot even start,
this is a simple enough fix on our side.
If systemd is disabled by /etc/wsl.conf,
systemctl is-enabled --quiet cloud-init.service
would still report the unit as enabled,
so we need a broader check.
If this works as expected, one test case should run under WSL1.
Without the current patch that would block forever.
With the launcher patch, cloud-init status --wait won't even run, so no hangs.
@CarlosNihelton CarlosNihelton changed the title [WIP] Fix clud-init blocking on WSL1 Dec 17, 2024
@CarlosNihelton CarlosNihelton changed the title Fix clud-init blocking on WSL1 Fix cloud-init blocking on WSL1 Dec 17, 2024
@CarlosNihelton CarlosNihelton marked this pull request as ready for review December 17, 2024 14:58
@CarlosNihelton CarlosNihelton requested a review from a team as a code owner December 17, 2024 14:58
The withWSL1 test case parameter was never set,
diff got lost during rebase :)
@CarlosNihelton CarlosNihelton self-assigned this Dec 17, 2024
Copy link

@ashuntu ashuntu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now :)

@CarlosNihelton
Copy link
Collaborator Author

That was weird:

The self-hosted runner: wsl-ci lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error. Source

We don't need to shutdown it again later, when restoring WSL 2, because we already hooked up
a test Cleanup to do so.
The VM was crashing. There are quite a few events like the one below that happened yesterday and today.

![image](https://github.com/user-attachments/assets/f6160c34-08e1-4fdd-9b3d-9f79a76001e3)

I resized it to the [second VM size compatible with GitHub large Windows runners](https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/about-larger-runners#specifications-for-general-larger-runners) (which are available by default now for open source projects) so that, if we were to migrate this out of our self-hosted runners we wouldn't be surprised with a downgrade. I then resized to the first size, as the change didn't solve the crash. That seems outside of our control. Even with more powerfull specs than my laptop, the crash still happened in the VM but didn't in my computer.

I'll turn the WSL1 test case off for now and park a card to investigate this later.
@CarlosNihelton
Copy link
Collaborator Author

The VM was crashing. There are quite a few events like the one below that happened yesterday and today.

image

I resized it to the second VM size compatible with GitHub large Windows runners (which are available by default now for open source projects) so that, if we were to migrate this out of our self-hosted runners we wouldn't be surprised with a downgrade. I then resized to the first size, as the change didn't solve the crash. That seems outside of our control. Even with more powerfull specs than my laptop, the crash still happened in the VM but didn't in my computer.

I'll turn the WSL1 test case off for now and park a card to investigate this later.

Copy link

@ashuntu ashuntu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting issue with the VMs, something to investigate later. Everything else looks good

@CarlosNihelton CarlosNihelton merged commit 76ef2f6 into main Dec 18, 2024
7 checks passed
@CarlosNihelton CarlosNihelton deleted the fix-on-wsl1 branch December 18, 2024 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants