Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support version in watcher cleanup rollback #4176

Conversation

pchila
Copy link
Member

@pchila pchila commented Feb 1, 2024

What does this PR do?

This is a subset of a bigger PR #3950, so CI is not expected to be green, especially regarding the integration tests.

This PR adds new fields in the update marker so that the upgrade watcher Cleanup and Rollback processes can operate correctly with the new elastic-agent-<version>-<hash>.

There is some refactoring to make Cleanup() and Rollback() testable that involves injecting paths rather than obtaining through paths.* functions: this is needed to be able to write unit tests without calling paths.SetTop() which already created flakiness and introduced race conditions in our tests in the past

Why is it important?

Now the agent installation is correctly cleaned up leaving only the new or old version of agent installed, since the watcher is reading the correct paths from the new update marker

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Author's Checklist

  • [ ]

How to test this PR locally

Testing this PR is mostly the same as the test procedure for #4174, but now the watcher cleans the installation correctly
After the end of grace period the new agent directory should be the only left inside /opt/Elastic/Agent/data.
If we want to trigger a rollback, it's enough to kill the new agent a few times and then only the old agent installation will be left inside /opt/Elastic/Agent/data.

Known issues:

  • This has been tested only on Linux. MacOS and Windows fixes will come in a follow-up PR (we need small modifications to make this work mostly related to OS-dependent agent paths).

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team skip-changelog labels Feb 1, 2024
@pchila pchila self-assigned this Feb 1, 2024
@pchila pchila requested a review from a team as a code owner February 1, 2024 13:17
@pchila pchila requested review from michalpristas and blakerouse and removed request for a team February 1, 2024 13:17
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Contributor

mergify bot commented Feb 2, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b support-version-in-watcher-cleanup-rollback upstream/support-version-in-watcher-cleanup-rollback
git merge upstream/feature/add-agent-version-to-installation-directory-name
git push upstream support-version-in-watcher-cleanup-rollback

@pchila pchila force-pushed the support-version-in-watcher-cleanup-rollback branch from 0ec9173 to b17c713 Compare February 2, 2024 16:48
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new rollback and cleanup tests are excellent thanks, just one question on handling the "old" marker file format as well.

Comment on lines +505 to +506
err := markUpgrade(log, paths.DataFrom(topDir), newAgentVersion, newAgentHash, newAgentVersionedHome, oldAgentVersion, oldAgentHash, oldAgentVersionedHome, nil, nil)
require.NoError(t, err, "error writing fake update marker")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always creates the marker with the new fields you added, in the case where we come from an old agent, shouldn't it omit them to make the test 100% accurate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a boolean parameter for that in createUpdateMarker that blanks out the new fields to simulate an old upgrade marker just a few lines above this comment
https://github.com/elastic/elastic-agent/pull/4176/files/bf2e849b16f8852a59593e45b8ada9095e190ad1..b17c7132e2b6cd2ff157835379c833435f325fc4#diff-cc4058ffab0db853a72dc782e00a1a7c3c7b7314175bec4538b1f32e69332c66R497-R503

@pchila pchila mentioned this pull request Feb 3, 2024
3 tasks
Copy link
Contributor

mergify bot commented Feb 3, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b support-version-in-watcher-cleanup-rollback upstream/support-version-in-watcher-cleanup-rollback
git merge upstream/feature/add-agent-version-to-installation-directory-name
git push upstream support-version-in-watcher-cleanup-rollback

@pchila pchila force-pushed the support-version-in-watcher-cleanup-rollback branch from b17c713 to d04001c Compare February 3, 2024 16:00
@pchila pchila requested a review from cmacknz February 3, 2024 16:18
@ycombinator
Copy link
Contributor

Tested this PR locally (on MacOS).

Before upgrade

$ /opt/homebrew/bin/tree . -L 3
.
|-- LICENSE.txt
|-- NOTICE.txt
|-- README.md
|-- co.elastic.elastic-agent.err.log
|-- co.elastic.elastic-agent.out.log
|-- data
|   |-- agent.lock
|   `-- elastic-agent-8.13.0-SNAPSHOT-d04001
|       |-- components
|       |-- elastic-agent
|       |-- elastic-agent.app
|       |-- logs
|       |-- manifest.yaml
|       `-- package.version
|-- elastic-agent -> data/elastic-agent-8.13.0-SNAPSHOT-d04001/elastic-agent.app/Contents/MacOS/elastic-agent
|-- elastic-agent.reference.yml
|-- elastic-agent.yml
|-- fleet.enc
|-- fleet.enc.lock
`-- otel.yml

During upgrade

$ /opt/homebrew/bin/tree . -L 3
.
|-- LICENSE.txt
|-- NOTICE.txt
|-- README.md
|-- co.elastic.elastic-agent.err.log
|-- co.elastic.elastic-agent.out.log
|-- data
|   |-- agent.lock
|   |-- elastic-agent-8.13.0-SNAPSHOT-d04001
|   |   |-- components
|   |   |-- elastic-agent
|   |   |-- elastic-agent.app
|   |   |-- logs
|   |   |-- manifest.yaml
|   |   |-- package.version
|   |   `-- run
|   |-- elastic-agent-8.13.0-repackaged-SNAPSHOT-d04001
|   |   |-- components
|   |   |-- elastic-agent
|   |   |-- elastic-agent.app
|   |   |-- logs
|   |   |-- manifest.yaml
|   |   |-- package.version
|   |   `-- run
|   `-- tmp
|       |-- Hk6rvk9TDibMPcDvpl0jkLE-qDsHWVYL.sock
|       |-- akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock
|       |-- iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock
|       `-- xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock
|-- elastic-agent -> /Library/Elastic/Agent/data/elastic-agent-8.13.0-repackaged-SNAPSHOT-d04001/elastic-agent.app/Contents/MacOS/elastic-agent
|-- elastic-agent.reference.yml
|-- elastic-agent.sock
|-- elastic-agent.yml
|-- fleet.enc
|-- fleet.enc.lock
|-- otel.yml
`-- watcher.lock

After upgrade

$ pwd
/Library/Elastic/Agent
$ /opt/homebrew/bin/tree . -L 3
.
|-- LICENSE.txt
|-- NOTICE.txt
|-- README.md
|-- co.elastic.elastic-agent.err.log
|-- co.elastic.elastic-agent.out.log
|-- data
|   |-- agent.lock
|   |-- elastic-agent-8.13.0-repackaged-SNAPSHOT-d04001
|   |   |-- components
|   |   |-- elastic-agent
|   |   |-- elastic-agent.app
|   |   |-- logs
|   |   |-- manifest.yaml
|   |   |-- package.version
|   |   `-- run
|   `-- tmp
|       |-- Hk6rvk9TDibMPcDvpl0jkLE-qDsHWVYL.sock
|       |-- akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock
|       |-- iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock
|       `-- xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock
|-- elastic-agent -> /Library/Elastic/Agent/data/elastic-agent-8.13.0-repackaged-SNAPSHOT-d04001/elastic-agent.app/Contents/MacOS/elastic-agent
|-- elastic-agent.reference.yml
|-- elastic-agent.sock
|-- elastic-agent.yml
|-- fleet.enc
|-- fleet.enc.lock
|-- otel.yml
`-- watcher.lock

Directory cleanup (of old Agent files) is happening correctly after upgrade. The only thing that looks a bit odd is that the watcher.lock file is still present after the upgrade. But I'm not sure if this is also the case on main. Checking that now...

@ycombinator
Copy link
Contributor

The only thing that looks a bit odd is that the watcher.lock file is still present after the upgrade. But I'm not sure if this is also the case on main. Checking that now...

This issue exists when upgrading from 8.12.0 -> 8.13.0-SNAPSHOT built from main. It also exists when upgrading from 8.11.3 -> 8.12.0 and 8.10.0 -> 8.11.3. So it's definitely not an issue introduced by this PR. Moving on with code review...

@@ -81,7 +81,7 @@ func unzip(log *logger.Logger, archivePath, dataDir string) (UnpackResult, error
return UnpackResult{}, fmt.Errorf("parsing package manifest: %w", err)
}
pm.mappings = manifest.Package.PathMappings
versionedHome = path.Clean(pm.Map(manifest.Package.VersionedHome))
versionedHome = filepath.Clean(pm.Map(manifest.Package.VersionedHome))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one general nitpick about table-based testing. PR changes LGTM — love that we are now passing in topDir/dataDir and the unit testing that enabled.

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

6 New issues
0 Security Hotspots
50.8% 50.8% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@pchila pchila merged commit a831764 into elastic:feature/add-agent-version-to-installation-directory-name Feb 5, 2024
8 of 9 checks passed
pchila added a commit that referenced this pull request Feb 5, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
pchila added a commit that referenced this pull request Feb 6, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
pchila added a commit that referenced this pull request Feb 6, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
pchila added a commit that referenced this pull request Feb 7, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
pchila added a commit that referenced this pull request Feb 7, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
pchila added a commit that referenced this pull request Feb 7, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
pchila added a commit that referenced this pull request Feb 7, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
pchila added a commit that referenced this pull request Feb 7, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
pchila added a commit that referenced this pull request Feb 8, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
pchila added a commit that referenced this pull request Feb 12, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
pchila added a commit that referenced this pull request Feb 12, 2024
* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup
cmacknz added a commit that referenced this pull request Feb 12, 2024
* Use package manifest for install (#4173)

* use version in path for rpm and deb
* install elastic-agent remapping paths

* Use package manifests for upgrade (#4174)

* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly

* Support version in watcher cleanup rollback (#4176)

* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup

* Reintroduce upgrade version check (#4177)

* Fix shutdownCallback() directories
* reintroduce version check in upgrade

* Use watcher that supports paths in update marker (#4178)

* Invoke agent watcher capable of handling new paths

* Adapt TestShutdownCallback to run multiple testcases

* Fix shutdownCallback to work with version in path

* Fix MacOS relink step

* Add commit hash to manifest

* document manifest-aware upgrade
---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>

* small fixes based on PR feedback

* Add wait for watcher (#4229)

* Fix rollbackInstall order of execution and handle errors
* Add wait for watcher up and running

---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
@pchila pchila deleted the support-version-in-watcher-cleanup-rollback branch February 13, 2024 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request skip-changelog Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants