Backport commits from main #1036

HomayoonAlimohammadi · 2025-02-04T15:30:28Z

Overview

This PR adds commits from main that we wanted to be backported in release-1.32.

These commits include everything in main since the release-1.32 branch out, except the ones that were already backported, and the following:

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>

Co-authored-by: neoaggelos <1888650+neoaggelos@users.noreply.github.com>

…ap/join-cluster`. (#863) * fix: ensure containerd-related directories removed on failed `bootstrap/join-cluster` `k8sd` automatically sets up some directories with the appropriate ownership/permissions to be used by containerd in the early stages of the `bootstrap` and `join-cluster` commands. In the classic (non-strict) version of the k8s-snap, these containerd directories are system-wide (e.g. `/etc/containerd`, `/run/containerd`, etc). Should any of the other setup steps fail after the containerd directories were set up, the directories would still remain on disk and thus lead to a 'partial installation' of on the host system. This patch ensures that `k8s` will automatically remove any containerd-related directories which were created in the event of the `bootstrap` / `join-cluster` commands failing. Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com> * fix: ensure containerd Base Dir lockfile is never accidentally deleted. The containerd Base Dir is the special path all other containerd-related paths on the snap are derived from. Under classic confinement and default settings, this path defaults to the host's root (`/`), and thus extreme care must be taken to not accidentally include it in k8sd's cleanup routine or the k8s-snap's remove hook. Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com> --------- Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>

* Restructure the CIS and DISA STIG hardening guides * Fix spelling errors --------- Co-authored-by: Etienne Audet-Cobello <etienne.audet-cobello@canonical.com> Co-authored-by: nhennigan <niamh.hennigan@canonical.com>

* Deduplicate github actions We have multiple github actions that run e2e tests and share a significant amount of logic. We'll add reusable actions, making the workflows much easier to maintain. * Fix flaky microk8s test As part of the test cleanup, we're removing the k8s snap, ensuring that its services and mounts go away. One of the tests installs microk8s, which interferes with the k8s snap cleanup assertions. We'll fix this flaky test by removing the microk8s snap. * Fix flaky ingress test get_external_service_ip returns an empty string, however the test asserts that the ip is not None and proceeds with the curl: 2024-12-12 11:28:46 DEBUG Execute command ['curl', '', '-H', 'Host: foo.bar.com'] in instance k8s-integration-530bc4-37 We'll update the assertion and catch empty strings as well. At the same time, we'll increase the timeouts to reduce test flakiness. * Merge nightly test and cron job The nightly job is also a cron job that executes daily, so it makes sense to merge those two workflows. * Fix nightly job tag * Pass test flavor * Include all namespaces in inspection reports The moonray job is failing, however we only have logs from the "default" and "kube-system" namespaces. This change will collect logs from all k8s namespaces. * Apply flavor patches before running the tests We'll need to apply the strict/moonray patches not only when building the snap, but also when running the tests. * Skip broken test test_containerd_path_cleanup_on_failed_init holds an open port and expects the bootstrap to fail, however that won't be the case when using the lxd harness. We'll skip this test for now. * Revert "Include all namespaces in inspection reports" This reverts commit 5020f39. * Address PR feedback * cover 1.32 as part of the nightly tests * get go version from go.mod * update step names * add some TODOs * make lxd channel configurable * bump ubuntu versions * add get-e2e-tags dependencies

The LocalHarness is a harness used for running integration tests on the local machine where the tests are directly invoked (be it via `tox`, `pytest`, etc). It presents numerous limitations (can't run any multi-node tests) and poses a lot of potential risks (cleanup failing in case of fatal errors in the test fixtures) which outweighs most of its convenience benefits (especially when compared to the LXD substrate). This patch completely removes the LocalHarness and all references to it from the documentation, making LXD the new default. Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>

* 1.32 release docs for snap

It was reported in issue #537 that the edit this page button did not work. Previous work was done to fix most pages but the about and community page were still affected. This PR fixes that functionality

* Update etcd guide config option Highlighted in issue #905 this configuration option was not updated when datastore was changed to bootstrap-datastore

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>

…ciated test. (#910) --------- Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com> Co-authored-by: Berkay Tekin Oz <ozberkaytekin@gmail.com>

Fix broken links and add ignore links to custom_conf due to known sphinx issue with anchors causing false positives

* Add 1.32 charms release notes * Update navigation and channel * Update charm channel to 1.32 stable * Reorganize the navigation side bar to be able to see both snap and charm release notes --------- Co-authored-by: nhennigan <niamh.hennigan@canonical.com>

* Ensure lxd is installed before attempting snap refresh This change checks if the lxd snap is installed before running `snap refresh lxd`, preventing failures when lxd is missing. If lxd is not found, it installs the snap using the specified channel. This is required because the LXD snap is not shipped by default in 24.04 anymore. * use newgrp instead of sg, use sudo --user

When investigating the state of a Kubernetes node, it may be useful to check a few host resources and their availability, to root cause potential memory / disk pressure / other system related issues.

Currently, if the k8sd/v1alpha/lifecycle/skip-stop-services-on-remove annotation is set, we're not stopping the Kubernetes-related services, but we're still removing its certificates and containerd-related paths. This will end up paralyzing services like kubelet, which might have to do Pod evictions, blocking it from finishing its job, and resulting CAPI not being able to complete its downscaling or upgrade operations. We should remove those certificates only if we're also stopping the services.

* Enable cluster-config.load-balancer.l2-mode by default We'll change the defalt value of cluster-config.load-balancer.l2-mode, enabling it by default. * Bump k8s-snap-api version * Update unit test * Update test_smoke * update expected l2 mode * bump the timeout

* update titles Update page titles and headers according to the style guide - with correct capitalization and also the imperative version of the verbs

* Document k8s-snap installation on dev environments We'll recommend the users to use a clean virtual machine or LXD container when trying out k8s-snap. At the same time, we'll document common problems that can arise when installing k8s-snap directly on the development machine along with possible workarounds: * docker and containerd conflicts * fixing the "FORWARD" rules * custom containerd base dir * changing ip addresses * listening on "localhost" Other changes: * move dqlite docs to a separate reference page and add an example on how to connect to k8s-dqlite * fix the release note on "containerd-base-dir", it's a bootstrap config yaml entry, not a cli parameter * document k8sd sql commands * Remove 'strict' reference * Update k8sd sql section * Fix linter error (>80 characters)

When bootstrapping or starting a cluster, we wait for the k8sd server to be fully ready before interacting with it. However, there are edge cases—such as during a snap refresh—where the snap attempts to interact with the CLI (e.g., to configure snap settings) while the database is still initializing. In these scenarios, immediate failure is unnecessary. The k8sd client now retries such requests, ensuring smoother operation. This behavior applies only to specific edge cases where it is known that the microcluster database will eventually become available.

The command "kubectl config show" does not exist and is "kubectl config view".

* Move two-node HA to moonray This is more of a POC rather than a fully supported feature. This was done at request for moonray so moving there

* Include debug symbols We'll include golang and dqlite debug symbols even for release builds. This increases the snap size by 30MB, however it allows us to investigate core dumps. * Generate core dumps In order to effectively investigate k8s-snap crashes, especially ones caused by external C libraries such as dqlite, we'll need core dumps. This change will: * use GOTRACEBACK="crash" * adjust the core dump limit * Collect core dumps * Add inspect.sh --core-dump-dir parameter * default: /var/crash * collect core dumps found at the specified location * add core dump dir and pattern as e2e test settings * TEST_CORE_DUMP_PATTERN * TEST_CORE_DUMP_DIR * configure core dumps as part of the e2e instance initialization * update the "exec" helpers to allow stdout redirection (">"), just like the k8s-dqlite e2e tests * Remove leftover -Wno-suggest-attribute=noreturn

* Add k8s inspect command We're adding a "k8s inspect" command that will invoke the "inspect.sh" script, aiming to improve the user experience. Note that we'll avoid parsing the arguments twice since that's unnecessary and would complicate the process of adding new parameters. * fix formatting * Include auto-generated docs * Fix spell check warning * Fix mock snap * add --core-dump-dir param to help string

* Log a message if the cluster is unitialized Users coming from microk8s may not be used to having to bootstrap the cluster. We'll check k8sd errors and if the message contains "Database is not yet initialized", we'll ask the users to either bootstrap a new cluster or join an existing one. We're adding this check to the query function of the k8sd client so that all the k8s commands may benefit from it. Implements: KU-2481 * Clean up the log messages * Copy the "bootstrapped" checked to each individual command To improve the error messages and avoid having too many nested errors, we'll have each individual command check if the cluster was initialized. That being considered, we'll make the k8sd client error less verbose. * Fix unit tests, updating k8sd mock

Fix typo that causes the script to fail

* add dqlite configuration to troubleshooting page Co-authored-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

* inspect.sh: avoid logging an error if there are no core dumps The inspect.sh script will log an error if the core dump dir is empty. We'll add a check to improve the user experience. INFO: Copy dmesg entries INFO: Collecting core dumps from /var/crash. Size: 4.0K /var/crash cp: cannot stat '/var/crash/*': No such file or directory Collecting snap and related information * inspect.sh: Replace backticks https://www.shellcheck.net/wiki/SC2006

…ps (#1032) * tests: enrich node configuration controller tests with signed configmaps Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com> * fix: change configmap to trigger restart in case of valid signature Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com> * chore: try with different configs with invalid signature to make sure they are not applied Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com> * tests: add a test case to update node configuration controller to account for signed configmaps Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com> * fix: improve test case names Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com> --------- Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com>

This change will collect all the inspection reports before initiating the node cleanup process. Otherwise we interfere with the observed cluster, potentially breaking it, which can impede the debugging process.

…oin (#1029) Use certificates from join config while a new control plane is joining.

bschimke95

LGTM, nice little PR

Historically, we used the Cilium loadbalancer for the loadbalancer feature. With the move to Metallb, the dependency to the network feature is no longer required.

* capi docs: add intermediate ca how-to We're adding a guide that shows how intermediate CAs can be generated using HashiCorp Vault and passed to CAPI using management cluster secrets. * Address PR comments * Address comments * address PR comments * avoid using more than 80 characters per line, this is likely to upset linters * Add link to an article that describes Vault cert-manager integration

* Fix custom containerd paths For some reason, there are two almost identical kubelet containerd flags and we have to set both: ``` $ snap logs k8s.kubelet -n 30000 | grep FLAG | grep containerd FLAG: --container-runtime-endpoint="/home/ubuntu/containerd/k8s-containerd/run/containerd/containerd.sock" FLAG: --containerd="/run/containerd/containerd.sock" FLAG: --containerd-namespace="k8s.io" ``` ``` $ kubelet -h | grep containerd --container-runtime-endpoint string The endpoint of container runtime service. --containerd string containerd endpoint ``` This change will: * pass the missing containerd flag * update the e2e test for custom containerd paths to check if the cluster actually becomes available after bootstrap * update the dev doc to enable the net, dns and local storage features * do the same for the e2e test * Update unit tests

--------- Co-authored-by: Mateo Florido <mateo.florido@canonical.com>

bschimke95 and others added 30 commits February 4, 2025 19:04

Add 1.32 to supported releases (#883)

5ac2694

docs: minor amendments to the DISA doc. (#884)

448b7fe

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>

[main] Update component versions (#887)

b5a6e74

Co-authored-by: neoaggelos <1888650+neoaggelos@users.noreply.github.com>

Unit test ValidateNodeTokenAccessHandler (#880)

d73fa08

Unit test ValidateCAPIAuthTokenAccessHandler (#885)

8e19555

Remove 1.30 branch from automatic updates (#893)

20489a1

Exit early if node name is not provided in CLI (#895)

498f8c2

[main] Update component versions (#896)

cb16ca4

How-to Openstack Integration (#855)

94e972a

Restructure the CIS and DISA STIG hardening guides (#890)

d730525

* Restructure the CIS and DISA STIG hardening guides * Fix spelling errors --------- Co-authored-by: Etienne Audet-Cobello <etienne.audet-cobello@canonical.com> Co-authored-by: nhennigan <niamh.hennigan@canonical.com>

Bump golang.org/x/crypto from v0.28.0 to v0.31.0 (#909)

87b1161

CAPI docs: Remove manual provider config for clusterctl (#911)

d27f885

Fix dualstack yaml intendation (#914)

91e0ae9

1.32 release docs (#899)

95283ac

* 1.32 release docs for snap

move files under src so edit button works (#917)

5c73f7e

It was reported in issue #537 that the edit this page button did not work. Previous work was done to fix most pages but the about and community page were still affected. This PR fixes that functionality

Add Terraform Documentation for k8s, k8s-worker charms (#920)

3fb8932

Microcluster schema change warning (#922)

8150a91

Update etcd guide config option (#918)

46faf6a

* Update etcd guide config option Highlighted in issue #905 this configuration option was not updated when datastore was changed to bootstrap-datastore

integration: fix all linter warnings in test harness. (#898)

4b338d7

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>

Fix containerd-related path cleanup on failed bootstrap/join and asso…

ffb7c5a

…ciated test. (#910) --------- Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com> Co-authored-by: Berkay Tekin Oz <ozberkaytekin@gmail.com>

Bump golang.org/x/net to v0.33.0 (#919)

cab2375

fix broken links (#924)

2186013

Fix broken links and add ignore links to custom_conf due to known sphinx issue with anchors causing false positives

Retry seed loading in case snap is not ready yet (#925)

e2fb867

Run "tox -e format", bumping copyright headers (#931)

2abc5a0

We need to bump the copyright headers as expected by the tox "format" job: Copyright 2025 Canonical, Ltd.

Include system information in the inspect.sh script (#921)

17121d6

When investigating the state of a Kubernetes node, it may be useful to check a few host resources and their availability, to root cause potential memory / disk pressure / other system related issues.

claudiubelu and others added 22 commits February 4, 2025 19:28

[Docs] fix headers naming style (#1005)

b31e948

* update titles Update page titles and headers according to the style guide - with correct capitalization and also the imperative version of the verbs

bump to v4 (#1003)

936b0b6

Bump microcluster version (#1010)

6b2ea8d

Review docs pages (#1011)

de94b6d

Fix config "show" command (#1012)

0bd68a5

The command "kubectl config show" does not exist and is "kubectl config view".

[Docs] Move two-node HA page to /moonray (#1006)

197f10d

* Move two-node HA to moonray This is more of a POC rather than a fully supported feature. This was done at request for moonray so moving there

Add warning for dual stack ingress (#1008)

fe3be59

Split tutorial for CAPI, add capi troubleshooting pages (#1019)

d75c2b2

Fix typo in setup-image.sh (#1024)

3e159a0

Fix typo that causes the script to fail

[Docs] add dqlite configuration to troubleshooting page (#1022)

95f809d

* add dqlite configuration to troubleshooting page Co-authored-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

[Docs] Remove /src dir (#1018)

2552a65

tests: collect all inspection reports before cleaning up nodes (#1034)

21ab752

This change will collect all the inspection reports before initiating the node cleanup process. Otherwise we interfere with the observed cluster, potentially breaking it, which can impede the debugging process.

fix: Use ControlPlaneJoinConfig certificates during Control Plane J…

00750cc

…oin (#1029) Use certificates from join config while a new control plane is joining.

HomayoonAlimohammadi requested a review from a team as a code owner February 4, 2025 15:30

bschimke95 approved these changes Feb 4, 2025

View reviewed changes

bschimke95 and others added 6 commits February 5, 2025 11:19

Remove obsolete loadbalancer feature check (#1037)

3076b96

Historically, we used the Cilium loadbalancer for the loadbalancer feature. With the move to Metallb, the dependency to the network feature is no longer required.

Switch to cilium native routing mode for ipv6 only setup (#1007)

4010b78

Fix pre-release update with same risk-level (#1047)

c823f02

--------- Co-authored-by: Mateo Florido <mateo.florido@canonical.com>

ci: Fix spell checking (#1050)

d942178

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport commits from main #1036

Backport commits from main #1036

HomayoonAlimohammadi commented Feb 4, 2025 •

edited

Loading

bschimke95 left a comment

Backport commits from main #1036

Are you sure you want to change the base?

Backport commits from main #1036

Conversation

HomayoonAlimohammadi commented Feb 4, 2025 • edited Loading

Overview

bschimke95 left a comment

Choose a reason for hiding this comment

HomayoonAlimohammadi commented Feb 4, 2025 •

edited

Loading