Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use package manifests for upgrade #4174

Conversation

pchila
Copy link
Member

@pchila pchila commented Feb 1, 2024

What does this PR do?

This is a subset of a bigger PR #3950, so CI is not expected to be green, especially regarding the integration tests.

This PR uses the package manifest included in elastic-agent packages to map the package contents onto disk when upgrading . In the context of #2579 the path mapping is used to install the agent we are upgrading to in a directory like data/elastic-agent-<package version>-<hash>.

The upgrade process now looks for the package manifest file and copies the files in the archive after mapping the name using the path mappings in the manifest. This means that also the new agent will end up in data/elastic-agent-<package version>-<hash>

Why is it important?

This PR is needed to ensure that new versions of agents get installed to a new path containing the version from a package data/elastic-agent-<package version>-<hash> that still uses the legacy structure data/elastic-agent-<hash> for backward compatibility. In practice it allows for upgrading to an agent version built from the same commit as the existing agent as long as the package version is different (the goal of #2579)

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Author's Checklist

  • [ ]

How to test this PR locally

To test this PR locally, we package elastic-agent twice from the same commit specifying a different AGENT_PACKAGE_VERSION like:

EXTERNAL=true SNAPSHOT=true PLATFORMS="linux/amd64" PACKAGES="tar.gz"  mage package

and

AGENT_PACKAGE_VERSION=8.13.0-repackaged EXTERNAL=true SNAPSHOT=true PLATFORMS="linux/amd64" PACKAGES="tar.gz"  mage package

Install the first agent package and verify that it's healthy and it runs from data/elastic-agent-8.13.0-SNAPSHOT-<hash> directory.
Upgrade this agent using the second package

elastic-agent upgrade 8.13.0-repackaged-SNAPSHOT --source-uri file://<directory where the package is located on disk> --skip-verify

Upgrade will unpack and store the new agent in data/elastic-agent-8.13.0-repackaged-<hash> and rotate the links.
Verify the agent directory structure contains the 2 distinct directories (you have to do it before the grace period ends).

Known issues:

  • When the grace period expires the watcher will break the agent install since it will clean anything that does not look like elastic-agent-<hash>, this is addressed already in a follow-up PR
  • An extra elastic-agent-<hash> directory is created during upgrade due to using the wrong path during copy of component states. This is addressed in a follow-up PR
  • This has been tested only on Linux. MacOS and Windows fixes will come in a follow-up PR (we need small modifications to make this work mostly related to OS-dependent agent paths).
  • To clean up the broken agent using uninstall (it can be done manually as well but I find this more convenient):
    1. remove the broken symlink in /opt/Elastic/Agent
    2. use the extracted package to force install the agent again ./elastic-agent install --force
    3. remove using uninstall sudo elastic-agent uninstall

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team skip-changelog labels Feb 1, 2024
@pchila pchila self-assigned this Feb 1, 2024
@pchila pchila requested a review from a team as a code owner February 1, 2024 09:04
@pchila pchila requested review from michalpristas and blakerouse and removed request for a team February 1, 2024 09:04
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pchila pchila requested review from ycombinator and cmacknz February 1, 2024 09:04
}
if err == nil {
// load manifest
defer manifestFile.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here would be nice to release manifest handler sooner

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding helper which parses it and returns structure as this is recurring or pathmapper constructor with filepath as arg

Copy link
Member Author

@pchila pchila Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed in #4177 using a different approach by reading metadata from package before start extracting

metadata, err := getPackageMetadataFromTar(archivePath)

}

if err := copyActionStore(u.log, newHash); err != nil {
//if strings.HasPrefix(release.Commit(), newHash) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this handled

Copy link
Member Author

@pchila pchila Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's in PR #4177

60357be#diff-42a05457d0720cf1e61083ad5d04a0b2956e6cfa0e0d018f2f815327cb0f1d2dR190-R220

with allowing installation of different versions with the same hash the check is now that the triplet (version, snapshot, hash) must be different between the 2 versions

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to have an integration test that intentionally tries to upgrade with a hash collision to test this?

Could you clone the package built for the current commit, edit the package version to be different, and then perform the upgrade?

You may not even need to clone it, you could install, then edit the package version, and tell the installed agent to upgrade to itself.

@pchila
Copy link
Member Author

pchila commented Feb 2, 2024

@cmacknz

Would it be possible to have an integration test that intentionally tries to upgrade with a hash collision to test this?

Could you clone the package built for the current commit, edit the package version to be different, and then perform the upgrade?

You may not even need to clone it, you could install, then edit the package version, and tell the installed agent to upgrade to itself.

I would need a new version both in package version and in the manifest (both version and mappings), probably I would also need to change the package name and top directory to make sure that it's identified as a different version by both the testing framework and the agent upgrade command, that's a lot of boilerplate code and it's hacky.

At this point an integration test would fail as the cleanup/rollback is still broken (we could try it once we have the whole code in the last PR or in a new one)

Copy link
Contributor

mergify bot commented Feb 2, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b use-package-manifests-for-upgrade upstream/use-package-manifests-for-upgrade
git merge upstream/feature/add-agent-version-to-installation-directory-name
git push upstream use-package-manifests-for-upgrade

@cmacknz
Copy link
Member

cmacknz commented Feb 2, 2024

I would need a new version both in package version and in the manifest (both version and mappings), probably I would also need to change the package name and top directory to make sure that it's identified as a different version by both the testing framework and the agent upgrade command, that's a lot of boilerplate code and it's hacky.

At this point an integration test would fail as the cleanup/rollback is still broken (we could try it once we have the whole code in the last PR or in a new one)

Agree it will be a bit hacky, at the same time this entire PR series has the ultimate goal of making upgrades work despite hash collisions, so I think we should have a test to prove that the entire end to end upgrade process works in this situation.

We don't need to do it as part of this PR. We can use #2780 to track the work since it is essentially that issue, we can reword it if necessary.

@pchila pchila force-pushed the use-package-manifests-for-upgrade branch from 73cd67d to bf2e849 Compare February 2, 2024 16:41
@pchila pchila requested a review from michalpristas February 2, 2024 17:25
@ycombinator
Copy link
Contributor

Followed the instructions in "how to test this PR locally" and upgrade was successful. Output of elastic-agent status before and after upgrade looks good, including healthy status and upgrade details. File/folder naming/structure in Agent installation location looks good too. 👍

Reviewing code now...

Comment on lines +97 to +106
var currentDir string
if currentVersionedHome != "" {
currentDir, err = filepath.Rel("data", currentVersionedHome)
if err != nil {
return fmt.Errorf("extracting elastic-agent path relative to data directory from %s: %w", currentVersionedHome, err)
}
} else {
currentDir = fmt.Sprintf("%s-%s", agentName, currentHash)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you extract this logic into its own function, you could write a unit test with a couple of test cases around it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return changeSymlinkInternal(log, symlinkPath, newPath)
}

func changeSymlinkInternal(log *logger.Logger, symlinkPath, newTarget string) error {
Copy link
Contributor

@ycombinator ycombinator Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why not just changeSymlink? The internal aspect is enforced by the fact that the function is unexported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function replaces ChangeSymlink in #4178 https://github.com/elastic/elastic-agent/pull/4178/files/6fd94ced07902a2a9baf79ce3e695251b8f63468..b60b518fcaa18ccac4e8801ff1f07090adc0690d#diff-e3bd2c5b20d7e38241b1e80a7fd53c3e2aba0aa4cc1a8d15ced461a13e588a45R22

At this point in time I still had references to the old ChangeSymlink that I wanted to remove so I thought of giving changeSymlinkInternal a very specific name ;)

err := os.RemoveAll(filepath.Join(paths.Top(), versionedHome))
if err != nil && !errors.Is(err, fs.ErrNotExist) {
// TODO should this be a warning or an error ?
log.Warnw("error rolling back install", "error.message", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think warning is fine. It's better than before this PR when we weren't even logging anything if os.RemoveAll returned an error.

Also, from my understanding, this function is being called when one of the upgrade steps fails and we want to clean up the target agent's files. The current (pre-upgrade) Agent is still running fine at this point and the callers of this function will return an error about the upgrade step that failed anyway. So I think just a warning here is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote error so we can also see it happening, but it also isn't fatal, so technically warning is fine.

I think the error message should note that the error was specifically from os.RemoveAll. That doesn't mean that changing the symlink failed, that is the fatal part of rollbackInstall. We could change this to "failed to remove %directory contents" instead.

If this does fail though, does the agent directory that was left behind just stick around forever? If it does that is something for us to follow up on separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmacknz
Any leftover directories other than the one we want to keep gets removed during watcher Cleanup() or Rollback() calls: this is the reason why at this stage watcher breaks the agent install (it still removes anything that is not elastic-agent-<hash we want to keep>).

About the error/warning and the message, it has been changed in PR #4178 https://github.com/elastic/elastic-agent/pull/4178/files/6fd94ced07902a2a9baf79ce3e695251b8f63468..b60b518fcaa18ccac4e8801ff1f07090adc0690d#diff-42a05457d0720cf1e61083ad5d04a0b2956e6cfa0e0d018f2f815327cb0f1d2dR377-R383
where the logs are both at warn level and the message explicitly say what went wrong

return fmt.Errorf("stat() directory %q: %w", dstPath, err)
} else {
// set the appropriate permissions
err = os.Chmod(dstPath, f.Mode().Perm()&0o770)
Copy link
Contributor

@ycombinator ycombinator Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: consistency.

Suggested change
err = os.Chmod(dstPath, f.Mode().Perm()&0o770)
err = os.Chmod(dstPath, f.Mode().Perm()&0770)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks like we could calculate f.Mode().Perm()&0770) once and just reuse in each of the following lines.

Copy link
Member Author

@pchila pchila Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is consistency with the old code, I tried to be consistent in using 0o770 format in my changes. Not sure if there's a guideline about one or the other. I will change it even if I find the 0o prefix more explicit for representing an octal

// remove any world permissions from the directory/file
_ = os.MkdirAll(filepath.Dir(path), f.Mode()&0770)
f, err := os.OpenFile(path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, f.Mode()&0770)
_ = os.MkdirAll(filepath.Dir(dstPath), 0770)
Copy link
Contributor

@ycombinator ycombinator Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a subtle change from before this PR, where we are now no longer doing a logical AND between f.Mode() and 0770. Just want to double check this change was intentional and not a miss.

Copy link
Member

@cmacknz cmacknz Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic around directory permissions needs some more comments.

I think the intent is that when we create the file we potentially create the containing directory as 0770 to make sure it exists, because we aren't guaranteed to have seen the directory yet when iterating over the contents.

In the case where we get to the directory that is above this, the intent is to set the permissions the way they were in the zip file, overwriting the permissions set here.

If we already saw the directory in the IsDir case above, this call does nothing per the documentation of MkdirAll.

There was a bug here and/or in the tar installer where we were create the directory with the permissions of the first file we saw that was in that directory, instead of applying the permissions of the directory as they were in the archive.

Copy link
Member Author

@pchila pchila Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is the same fix as the one done for .tar.gz extraction in #4100.

As Craig said, we create (not existing) containing directories for the file that we are extracting with 0770 here, then when we encounter them again while extracting, we set the correct permissions.
In the old code we set the same permissions we have for the file to its containing directories which poses a problem for access, modification and traversal since the file may be read-only or not executable.

Fun fact: the bug became apparent the moment I tried to run some unit tests on this code because we don't run those as root so permissions matter.

I will add more comments to explain why we had to change this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments in 9f2a4cc

return "", err
}

hash = string(hashBytes[:hashLen])
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here that we already loaded the hash, similar to what you did in the zip code path? Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 9f2a4cc

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM beside a few small things. Holding approval pending resolution of Shaunak's comments. If he approves feel free to merge.

func (tc *tarCloser) Close() error {
var err error
if tc.gzipReader != nil {
err = multierror.Append(err, tc.gzipReader.Close())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: couldn't this be errors.Join()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be, this was taken from the existing code which used multierror package. I will change it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 9945a14

@@ -169,46 +170,65 @@ func (u *Upgrader) Upgrade(ctx context.Context, version string, sourceURI string
}

det.SetState(details.StateExtracting)

newHash, err := u.unpack(version, archivePath, paths.Data())
// TODO: add check that no archive files end up in the current versioned home
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also addressed later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an idea for an additional protection of the current agent install against the issue we currently have on main where we extract first, ask questions check the hash later to see if the version we just extracted is the same as the current one. In the end I didn't implement this because for correctly generated packages the new version check is better than what was in place before

For details about the new version check see https://github.com/elastic/elastic-agent/pull/4174/files/ee9979ca0194eedb5c3d8e04e36389e5754ec287#r1474739456

err := os.RemoveAll(filepath.Join(paths.Top(), versionedHome))
if err != nil && !errors.Is(err, fs.ErrNotExist) {
// TODO should this be a warning or an error ?
log.Warnw("error rolling back install", "error.message", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote error so we can also see it happening, but it also isn't fatal, so technically warning is fine.

I think the error message should note that the error was specifically from os.RemoveAll. That doesn't mean that changing the symlink failed, that is the fatal part of rollbackInstall. We could change this to "failed to remove %directory contents" instead.

If this does fail though, does the agent directory that was left behind just stick around forever? If it does that is something for us to follow up on separately.

@pchila pchila requested review from cmacknz and ycombinator February 3, 2024 09:22
Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

10 New issues
0 Security Hotspots
56.9% 56.9% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@pchila pchila merged commit df122b7 into elastic:feature/add-agent-version-to-installation-directory-name Feb 3, 2024
8 of 9 checks passed
pchila added a commit that referenced this pull request Feb 5, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
pchila added a commit that referenced this pull request Feb 6, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
pchila added a commit that referenced this pull request Feb 6, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
pchila added a commit that referenced this pull request Feb 7, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
pchila added a commit that referenced this pull request Feb 7, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
pchila added a commit that referenced this pull request Feb 7, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
pchila added a commit that referenced this pull request Feb 7, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
pchila added a commit that referenced this pull request Feb 7, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
pchila added a commit that referenced this pull request Feb 8, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
pchila added a commit that referenced this pull request Feb 12, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
pchila added a commit that referenced this pull request Feb 12, 2024
* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly
cmacknz added a commit that referenced this pull request Feb 12, 2024
* Use package manifest for install (#4173)

* use version in path for rpm and deb
* install elastic-agent remapping paths

* Use package manifests for upgrade (#4174)

* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly

* Support version in watcher cleanup rollback (#4176)

* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup

* Reintroduce upgrade version check (#4177)

* Fix shutdownCallback() directories
* reintroduce version check in upgrade

* Use watcher that supports paths in update marker (#4178)

* Invoke agent watcher capable of handling new paths

* Adapt TestShutdownCallback to run multiple testcases

* Fix shutdownCallback to work with version in path

* Fix MacOS relink step

* Add commit hash to manifest

* document manifest-aware upgrade
---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>

* small fixes based on PR feedback

* Add wait for watcher (#4229)

* Fix rollbackInstall order of execution and handle errors
* Add wait for watcher up and running

---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
@pchila pchila deleted the use-package-manifests-for-upgrade branch February 13, 2024 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request skip-changelog Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants