-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use package manifests for upgrade #4174
Use package manifests for upgrade #4174
Conversation
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
} | ||
if err == nil { | ||
// load manifest | ||
defer manifestFile.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here would be nice to release manifest handler sooner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding helper which parses it and returns structure as this is recurring or pathmapper constructor with filepath as arg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already fixed in #4177 using a different approach by reading metadata from package before start extracting
metadata, err := getPackageMetadataFromTar(archivePath) |
} | ||
|
||
if err := copyActionStore(u.log, newHash); err != nil { | ||
//if strings.HasPrefix(release.Commit(), newHash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's in PR #4177
60357be#diff-42a05457d0720cf1e61083ad5d04a0b2956e6cfa0e0d018f2f815327cb0f1d2dR190-R220
with allowing installation of different versions with the same hash the check is now that the triplet (version, snapshot, hash) must be different between the 2 versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to have an integration test that intentionally tries to upgrade with a hash collision to test this?
Could you clone the package built for the current commit, edit the package version to be different, and then perform the upgrade?
You may not even need to clone it, you could install, then edit the package version, and tell the installed agent to upgrade to itself.
I would need a new version both in package version and in the manifest (both version and mappings), probably I would also need to change the package name and top directory to make sure that it's identified as a different version by both the testing framework and the agent upgrade command, that's a lot of boilerplate code and it's hacky. At this point an integration test would fail as the cleanup/rollback is still broken (we could try it once we have the whole code in the last PR or in a new one) |
This pull request is now in conflicts. Could you fix it? 🙏
|
Agree it will be a bit hacky, at the same time this entire PR series has the ultimate goal of making upgrades work despite hash collisions, so I think we should have a test to prove that the entire end to end upgrade process works in this situation. We don't need to do it as part of this PR. We can use #2780 to track the work since it is essentially that issue, we can reword it if necessary. |
73cd67d
to
bf2e849
Compare
Followed the instructions in "how to test this PR locally" and upgrade was successful. Output of Reviewing code now... |
var currentDir string | ||
if currentVersionedHome != "" { | ||
currentDir, err = filepath.Rel("data", currentVersionedHome) | ||
if err != nil { | ||
return fmt.Errorf("extracting elastic-agent path relative to data directory from %s: %w", currentVersionedHome, err) | ||
} | ||
} else { | ||
currentDir = fmt.Sprintf("%s-%s", agentName, currentHash) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you extract this logic into its own function, you could write a unit test with a couple of test cases around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are unit tests added in #4176 where the cleanup and rollback were fixed https://github.com/elastic/elastic-agent/pull/4176/files/bf2e849b16f8852a59593e45b8ada9095e190ad1..b17c7132e2b6cd2ff157835379c833435f325fc4#diff-cc4058ffab0db853a72dc782e00a1a7c3c7b7314175bec4538b1f32e69332c66
return changeSymlinkInternal(log, symlinkPath, newPath) | ||
} | ||
|
||
func changeSymlinkInternal(log *logger.Logger, symlinkPath, newTarget string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why not just changeSymlink
? The internal aspect is enforced by the fact that the function is unexported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function replaces ChangeSymlink
in #4178 https://github.com/elastic/elastic-agent/pull/4178/files/6fd94ced07902a2a9baf79ce3e695251b8f63468..b60b518fcaa18ccac4e8801ff1f07090adc0690d#diff-e3bd2c5b20d7e38241b1e80a7fd53c3e2aba0aa4cc1a8d15ced461a13e588a45R22
At this point in time I still had references to the old ChangeSymlink that I wanted to remove so I thought of giving changeSymlinkInternal a very specific name ;)
err := os.RemoveAll(filepath.Join(paths.Top(), versionedHome)) | ||
if err != nil && !errors.Is(err, fs.ErrNotExist) { | ||
// TODO should this be a warning or an error ? | ||
log.Warnw("error rolling back install", "error.message", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think warning is fine. It's better than before this PR when we weren't even logging anything if os.RemoveAll
returned an error.
Also, from my understanding, this function is being called when one of the upgrade steps fails and we want to clean up the target agent's files. The current (pre-upgrade) Agent is still running fine at this point and the callers of this function will return an error about the upgrade step that failed anyway. So I think just a warning here is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote error so we can also see it happening, but it also isn't fatal, so technically warning is fine.
I think the error message should note that the error was specifically from os.RemoveAll
. That doesn't mean that changing the symlink failed, that is the fatal part of rollbackInstall. We could change this to "failed to remove %directory contents" instead.
If this does fail though, does the agent directory that was left behind just stick around forever? If it does that is something for us to follow up on separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmacknz
Any leftover directories other than the one we want to keep gets removed during watcher Cleanup() or Rollback() calls: this is the reason why at this stage watcher breaks the agent install (it still removes anything that is not elastic-agent-<hash we want to keep>
).
About the error/warning and the message, it has been changed in PR #4178 https://github.com/elastic/elastic-agent/pull/4178/files/6fd94ced07902a2a9baf79ce3e695251b8f63468..b60b518fcaa18ccac4e8801ff1f07090adc0690d#diff-42a05457d0720cf1e61083ad5d04a0b2956e6cfa0e0d018f2f815327cb0f1d2dR377-R383
where the logs are both at warn level and the message explicitly say what went wrong
return fmt.Errorf("stat() directory %q: %w", dstPath, err) | ||
} else { | ||
// set the appropriate permissions | ||
err = os.Chmod(dstPath, f.Mode().Perm()&0o770) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: consistency.
err = os.Chmod(dstPath, f.Mode().Perm()&0o770) | |
err = os.Chmod(dstPath, f.Mode().Perm()&0770) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also looks like we could calculate f.Mode().Perm()&0770)
once and just reuse in each of the following lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is consistency with the old code, I tried to be consistent in using 0o770
format in my changes. Not sure if there's a guideline about one or the other. I will change it even if I find the 0o
prefix more explicit for representing an octal
// remove any world permissions from the directory/file | ||
_ = os.MkdirAll(filepath.Dir(path), f.Mode()&0770) | ||
f, err := os.OpenFile(path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, f.Mode()&0770) | ||
_ = os.MkdirAll(filepath.Dir(dstPath), 0770) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a subtle change from before this PR, where we are now no longer doing a logical AND between f.Mode()
and 0770
. Just want to double check this change was intentional and not a miss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic around directory permissions needs some more comments.
I think the intent is that when we create the file we potentially create the containing directory as 0770 to make sure it exists, because we aren't guaranteed to have seen the directory yet when iterating over the contents.
In the case where we get to the directory that is above this, the intent is to set the permissions the way they were in the zip file, overwriting the permissions set here.
If we already saw the directory in the IsDir
case above, this call does nothing per the documentation of MkdirAll
.
There was a bug here and/or in the tar installer where we were create the directory with the permissions of the first file we saw that was in that directory, instead of applying the permissions of the directory as they were in the archive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is the same fix as the one done for .tar.gz
extraction in #4100.
As Craig said, we create (not existing) containing directories for the file that we are extracting with 0770 here, then when we encounter them again while extracting, we set the correct permissions.
In the old code we set the same permissions we have for the file to its containing directories which poses a problem for access, modification and traversal since the file may be read-only or not executable.
Fun fact: the bug became apparent the moment I tried to run some unit tests on this code because we don't run those as root so permissions matter.
I will add more comments to explain why we had to change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comments in 9f2a4cc
return "", err | ||
} | ||
|
||
hash = string(hashBytes[:hashLen]) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment here that we already loaded the hash, similar to what you did in the zip code path? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 9f2a4cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM beside a few small things. Holding approval pending resolution of Shaunak's comments. If he approves feel free to merge.
func (tc *tarCloser) Close() error { | ||
var err error | ||
if tc.gzipReader != nil { | ||
err = multierror.Append(err, tc.gzipReader.Close()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: couldn't this be errors.Join()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be, this was taken from the existing code which used multierror
package. I will change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 9945a14
@@ -169,46 +170,65 @@ func (u *Upgrader) Upgrade(ctx context.Context, version string, sourceURI string | |||
} | |||
|
|||
det.SetState(details.StateExtracting) | |||
|
|||
newHash, err := u.unpack(version, archivePath, paths.Data()) | |||
// TODO: add check that no archive files end up in the current versioned home |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also addressed later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an idea for an additional protection of the current agent install against the issue we currently have on main where we extract first, ask questions check the hash later to see if the version we just extracted is the same as the current one. In the end I didn't implement this because for correctly generated packages the new version check is better than what was in place before
For details about the new version check see https://github.com/elastic/elastic-agent/pull/4174/files/ee9979ca0194eedb5c3d8e04e36389e5754ec287#r1474739456
err := os.RemoveAll(filepath.Join(paths.Top(), versionedHome)) | ||
if err != nil && !errors.Is(err, fs.ErrNotExist) { | ||
// TODO should this be a warning or an error ? | ||
log.Warnw("error rolling back install", "error.message", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote error so we can also see it happening, but it also isn't fatal, so technically warning is fine.
I think the error message should note that the error was specifically from os.RemoveAll
. That doesn't mean that changing the symlink failed, that is the fatal part of rollbackInstall. We could change this to "failed to remove %directory contents" instead.
If this does fail though, does the agent directory that was left behind just stick around forever? If it does that is something for us to follow up on separately.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
df122b7
into
elastic:feature/add-agent-version-to-installation-directory-name
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly
* Use package manifest for install (#4173) * use version in path for rpm and deb * install elastic-agent remapping paths * Use package manifests for upgrade (#4174) * map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly * Support version in watcher cleanup rollback (#4176) * extract directory paths to allow unit testing (instead of nested paths.* calls) * introduce tests for watcher cleanup and rollback * close data directory after cleanup * Reintroduce upgrade version check (#4177) * Fix shutdownCallback() directories * reintroduce version check in upgrade * Use watcher that supports paths in update marker (#4178) * Invoke agent watcher capable of handling new paths * Adapt TestShutdownCallback to run multiple testcases * Fix shutdownCallback to work with version in path * Fix MacOS relink step * Add commit hash to manifest * document manifest-aware upgrade --------- Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co> * small fixes based on PR feedback * Add wait for watcher (#4229) * Fix rollbackInstall order of execution and handle errors * Add wait for watcher up and running --------- Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
What does this PR do?
This is a subset of a bigger PR #3950, so CI is not expected to be green, especially regarding the integration tests.
This PR uses the package manifest included in elastic-agent packages to map the package contents onto disk when upgrading . In the context of #2579 the path mapping is used to install the agent we are upgrading to in a directory like
data/elastic-agent-<package version>-<hash>
.The upgrade process now looks for the package manifest file and copies the files in the archive after mapping the name using the path mappings in the manifest. This means that also the new agent will end up in
data/elastic-agent-<package version>-<hash>
Why is it important?
This PR is needed to ensure that new versions of agents get installed to a new path containing the version from a package
data/elastic-agent-<package version>-<hash>
that still uses the legacy structuredata/elastic-agent-<hash>
for backward compatibility. In practice it allows for upgrading to an agent version built from the same commit as the existing agent as long as the package version is different (the goal of #2579)Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testAuthor's Checklist
How to test this PR locally
To test this PR locally, we package elastic-agent twice from the same commit specifying a different AGENT_PACKAGE_VERSION like:
and
Install the first agent package and verify that it's healthy and it runs from
data/elastic-agent-8.13.0-SNAPSHOT-<hash>
directory.Upgrade this agent using the second package
Upgrade will unpack and store the new agent in
data/elastic-agent-8.13.0-repackaged-<hash>
and rotate the links.Verify the agent directory structure contains the 2 distinct directories (you have to do it before the grace period ends).
Known issues:
elastic-agent-<hash>
, this is addressed already in a follow-up PRelastic-agent-<hash>
directory is created during upgrade due to using the wrong path during copy of component states. This is addressed in a follow-up PR/opt/Elastic/Agent
./elastic-agent install --force
sudo elastic-agent uninstall
Related issues
Use cases
Screenshots
Logs
Questions to ask yourself