Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport commits from main #1036

Open
wants to merge 117 commits into
base: release-1.32
Choose a base branch
from

Conversation

HomayoonAlimohammadi
Copy link
Contributor

@HomayoonAlimohammadi HomayoonAlimohammadi commented Feb 4, 2025

Overview

This PR adds commits from main that we wanted to be backported in release-1.32.

These commits include everything in main since the release-1.32 branch out, except the ones that were already backported, and the following:

bschimke95 and others added 30 commits February 4, 2025 19:04
Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>
Co-authored-by: neoaggelos <1888650+neoaggelos@users.noreply.github.com>
…ap/join-cluster`. (#863)

* fix: ensure containerd-related directories removed on failed `bootstrap/join-cluster`

`k8sd` automatically sets up some directories with the appropriate
ownership/permissions to be used by containerd in the early stages
of the `bootstrap` and `join-cluster` commands.

In the classic (non-strict) version of the k8s-snap, these
containerd directories are system-wide (e.g. `/etc/containerd`,
`/run/containerd`, etc).

Should any of the other setup steps fail after the containerd
directories were set up, the directories would still remain on
disk and thus lead to a 'partial installation' of on the host system.

This patch ensures that `k8s` will automatically remove any
containerd-related directories which were created in the event of
the  `bootstrap` / `join-cluster` commands failing.

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>

* fix: ensure containerd Base Dir lockfile is never accidentally deleted.

The containerd Base Dir is the special path all other containerd-related
paths on the snap are derived from.

Under classic confinement and default settings, this path defaults
to the host's root (`/`), and thus extreme care must be taken to
not accidentally include it in k8sd's cleanup routine or the
k8s-snap's remove hook.

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>

---------

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>
* Restructure the CIS and DISA STIG hardening guides

* Fix spelling errors

---------

Co-authored-by: Etienne Audet-Cobello <etienne.audet-cobello@canonical.com>
Co-authored-by: nhennigan <niamh.hennigan@canonical.com>
* Deduplicate github actions

We have multiple github actions that run e2e tests and share a
significant amount of logic.

We'll add reusable actions, making the workflows much easier to
maintain.

* Fix flaky microk8s test

As part of the test cleanup, we're removing the k8s snap, ensuring
that its services and mounts go away.

One of the tests installs microk8s, which interferes with the k8s
snap cleanup assertions.

We'll fix this flaky test by removing the microk8s snap.

* Fix flaky ingress test

get_external_service_ip returns an empty string, however
the test asserts that the ip is not None and proceeds with the
curl:

  2024-12-12 11:28:46 DEBUG Execute command ['curl', '', '-H', 'Host: foo.bar.com']
  in instance k8s-integration-530bc4-37

We'll update the assertion and catch empty strings as well.

At the same time, we'll increase the timeouts to reduce test
flakiness.

* Merge nightly test and cron job

The nightly job is also a cron job that executes daily, so it
makes sense to merge those two workflows.

* Fix nightly job tag

* Pass test flavor

* Include all namespaces in inspection reports

The moonray job is failing, however we only have logs from the
"default" and "kube-system" namespaces.

This change will collect logs from all k8s namespaces.

* Apply flavor patches before running the tests

We'll need to apply the strict/moonray patches not only when
building the snap, but also when running the tests.

* Skip broken test

test_containerd_path_cleanup_on_failed_init holds an open port
and expects the bootstrap to fail, however that won't be the case
when using the lxd harness.

We'll skip this test for now.

* Revert "Include all namespaces in inspection reports"

This reverts commit 5020f39.

* Address PR feedback

* cover 1.32 as part of the nightly tests
* get go version from go.mod
* update step names
* add some TODOs
* make lxd channel configurable
* bump ubuntu versions
* add get-e2e-tags dependencies
The LocalHarness is a harness used for running integration tests
on the local machine where the tests are directly invoked
(be it via `tox`, `pytest`, etc).

It presents numerous limitations (can't run any multi-node tests)
and poses a lot of potential risks (cleanup failing in case of
fatal errors in the test fixtures) which outweighs most of its
convenience benefits (especially when compared to the LXD
substrate).

This patch completely removes the LocalHarness and all references to it
from the documentation, making LXD the new default.

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>
* 1.32 release docs for snap
It was reported in issue #537 that the edit this page button did not work. Previous work was done to fix most pages but the about and community page were still affected. This PR fixes that functionality
* Update etcd guide config option

Highlighted in issue #905 this configuration option was not updated when datastore was changed to bootstrap-datastore
Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>
…ciated test. (#910)

---------

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>
Co-authored-by: Berkay Tekin Oz <ozberkaytekin@gmail.com>
Fix broken links and add ignore links to custom_conf due to known sphinx issue with anchors causing false positives
* Add 1.32 charms release notes
* Update navigation and channel
* Update charm channel to 1.32 stable
* Reorganize the navigation side bar to be able to see both snap and charm release notes

---------

Co-authored-by: nhennigan <niamh.hennigan@canonical.com>
* Ensure lxd is installed before attempting snap refresh

This change checks if the lxd snap is installed before running `snap refresh lxd`,
preventing failures when lxd is missing. If lxd is not found,
it installs the snap using the specified channel.
This is required because the LXD snap is not shipped by default in 24.04 anymore.

* use newgrp instead of sg, use sudo --user
We need to bump the copyright headers as expected by the tox
"format" job:

   Copyright 2025 Canonical, Ltd.
When investigating the state of a Kubernetes node, it may be useful to
check a few host resources and their availability, to root cause
potential memory / disk pressure / other system related issues.
claudiubelu and others added 22 commits February 4, 2025 19:28
Currently, if the k8sd/v1alpha/lifecycle/skip-stop-services-on-remove
annotation is set, we're not stopping the Kubernetes-related services,
but we're still removing its certificates and containerd-related paths.
This will end up paralyzing services like kubelet, which might have to
do Pod evictions, blocking it from finishing its job, and resulting CAPI
not being able to complete its downscaling or upgrade operations.

We should remove those certificates only if we're also stopping the
services.
* Enable cluster-config.load-balancer.l2-mode by default

We'll change the defalt value of cluster-config.load-balancer.l2-mode,
enabling it by default.

* Bump k8s-snap-api version

* Update unit test

* Update test_smoke

* update expected l2 mode
* bump the timeout
* update titles

Update page titles and headers according to the style guide - with correct capitalization and also the imperative version of the verbs
* Document k8s-snap installation on dev environments

We'll recommend the users to use a clean virtual machine
or LXD container when trying out k8s-snap.

At the same time, we'll document common problems that can
arise when installing k8s-snap directly on the development
machine along with possible workarounds:

* docker and containerd conflicts
  * fixing the "FORWARD" rules
  * custom containerd base dir
* changing ip addresses
  * listening on "localhost"

Other changes:
* move dqlite docs to a separate reference page and add an example
  on how to connect to k8s-dqlite
* fix the release note on "containerd-base-dir", it's a bootstrap
  config yaml entry, not a cli parameter

* document k8sd sql commands

* Remove 'strict' reference

* Update k8sd sql section

* Fix linter error (>80 characters)
When bootstrapping or starting a cluster, we wait for the k8sd server to be fully ready before interacting with it.
However, there are edge cases—such as during a snap refresh—where the snap attempts to interact with the CLI
(e.g., to configure snap settings) while the database is still initializing.

In these scenarios, immediate failure is unnecessary.
The k8sd client now retries such requests, ensuring smoother operation.

This behavior applies only to specific edge cases where it is known that the microcluster database will eventually become available.
The command "kubectl config show" does not exist and is "kubectl config view".
* Move two-node HA to moonray

This is more of a POC rather than a fully supported feature. This was done at request for moonray so moving there
* Include debug symbols

We'll include golang and dqlite debug symbols even for release
builds.

This increases the snap size by 30MB, however it allows us to
investigate core dumps.

* Generate core dumps

In order to effectively investigate k8s-snap crashes, especially
ones caused by external C libraries such as dqlite, we'll need
core dumps.

This change will:

* use GOTRACEBACK="crash"
* adjust the core dump limit

* Collect core dumps

* Add inspect.sh --core-dump-dir parameter
   * default: /var/crash
* collect core dumps found at the specified location
* add core dump dir and pattern as e2e test settings
   * TEST_CORE_DUMP_PATTERN
   * TEST_CORE_DUMP_DIR
* configure core dumps as part of the e2e instance initialization
* update the "exec" helpers to allow stdout redirection (">"),
  just like the k8s-dqlite e2e tests

* Remove leftover -Wno-suggest-attribute=noreturn
* Add k8s inspect command

We're adding a "k8s inspect" command that will invoke the "inspect.sh"
script, aiming to improve the user experience.

Note that we'll avoid parsing the arguments twice since that's
unnecessary and would complicate the process of adding new
parameters.

* fix formatting

* Include auto-generated docs

* Fix spell check warning

* Fix mock snap

* add --core-dump-dir param to help string
* Log a message if the cluster is unitialized

Users coming from microk8s may not be used to having to bootstrap
the cluster.

We'll check k8sd errors and if the message contains "Database is
not yet initialized", we'll ask the users to either bootstrap
a new cluster or join an existing one.

We're adding this check to the query function of the k8sd client
so that all the k8s commands may benefit from it.

Implements: KU-2481

* Clean up the log messages

* Copy the "bootstrapped" checked to each individual command

To improve the error messages and avoid having too many nested
errors, we'll have each individual command check if the cluster
was initialized.

That being considered, we'll make the k8sd client error less verbose.

* Fix unit tests, updating k8sd mock
Fix typo that causes the script to fail
* add dqlite configuration to troubleshooting page

Co-authored-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
* inspect.sh: avoid logging an error if there are no core dumps

The inspect.sh script will log an error if the core dump dir
is empty. We'll add a check to improve the user experience.

   INFO:  Copy dmesg entries
   INFO:  Collecting core dumps from /var/crash. Size: 4.0K    /var/crash
   cp: cannot stat '/var/crash/*': No such file or directory
   Collecting snap and related information

* inspect.sh: Replace backticks

https://www.shellcheck.net/wiki/SC2006
…ps (#1032)

* tests: enrich node configuration controller tests with signed configmaps

Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com>

* fix: change configmap to trigger restart in case of valid signature

Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com>

* chore: try with different configs with invalid signature to make sure they are not applied

Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com>

* tests: add a test case to update node configuration controller to account for signed configmaps

Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com>

* fix: improve test case names

Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com>

---------

Signed-off-by: Reza Abbasalipour <reza.abbasalipour@canonical.com>
This change will collect all the inspection reports before
initiating the node cleanup process. Otherwise we interfere
with the observed cluster, potentially breaking it, which can
impede the debugging process.
…oin (#1029)

Use certificates from join config while a new control plane is joining.
@HomayoonAlimohammadi HomayoonAlimohammadi requested a review from a team as a code owner February 4, 2025 15:30
Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice little PR

bschimke95 and others added 6 commits February 5, 2025 11:19
Historically, we used the Cilium loadbalancer for the loadbalancer
feature. With the move to Metallb, the dependency to the network feature
is no longer required.
* capi docs: add intermediate ca how-to

We're adding a guide that shows how intermediate CAs can be
generated using HashiCorp Vault and passed to CAPI using
management cluster secrets.

* Address PR comments

* Address comments

* address PR comments
* avoid using more than 80 characters per line, this is likely
  to upset linters

* Add link to an article that describes Vault cert-manager integration
* Fix custom containerd paths

For some reason, there are two almost identical kubelet containerd
flags and we have to set both:

```
$ snap logs k8s.kubelet -n 30000 | grep FLAG | grep containerd
      FLAG: --container-runtime-endpoint="/home/ubuntu/containerd/k8s-containerd/run/containerd/containerd.sock"
      FLAG: --containerd="/run/containerd/containerd.sock"
      FLAG: --containerd-namespace="k8s.io"
```

```
$ kubelet -h | grep containerd
      --container-runtime-endpoint string    The endpoint of container runtime service.
      --containerd string                    containerd endpoint
```

This change will:

* pass the missing containerd flag
* update the e2e test for custom containerd paths to check if the cluster
  actually becomes available after bootstrap
* update the dev doc to enable the net, dns and local storage features
  * do the same for the e2e test

* Update unit tests
---------

Co-authored-by: Mateo Florido <mateo.florido@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.