Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error uninstalling agent: [...] missing matching binary for [...] #4167

Open
strawgate opened this issue Jan 31, 2024 · 15 comments
Open

Error uninstalling agent: [...] missing matching binary for [...] #4167

strawgate opened this issue Jan 31, 2024 · 15 comments
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team

Comments

@strawgate
Copy link
Contributor

strawgate commented Jan 31, 2024

For confirmed bugs, please report:

  • Version: 8.13
  • Operating System: Windows Server
  • Steps to Reproduce:

CI is occasionally failing.

Agent installed on Windows via MSI with --delay-enroll --url=https://placeholder:443 --enrollment-token=token and then uninstalled with -f

Fails to uninstall and prints   | **Error: error uninstalling agent: error uninstalling components: failed to detect inputs and outputs: missing matching binary for C:\Program Files\Elastic\Agent\data\elastic-agent-9c40d6\components\apm-server.spec.yml**

More log context:

WARNING: Running command: c:\\Program Files\\Elastic\\Agent\\elastic-agent.exe uninstall -f
--
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Error fetching PID info for 0, skipping: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
  | error fetching status: OpenProcess failed for pid=0: The parameter is incorrect. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:173 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Error fetching PID info for 4, skipping: GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:173 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Non fatal error fetching PID some info for 116, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Non fatal error fetching PID some info for 364, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Non fatal error fetching PID some info for 496, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Non fatal error fetching PID some info for 572, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Non fatal error fetching PID some info for 580, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.571989 +0000 GMT m=+0.079888801 processes Non fatal error fetching PID some info for 712, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.5901418 +0000 GMT m=+0.098041301 processes Non fatal error fetching PID some info for 3136, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.6018655 +0000 GMT m=+0.109764701 processes Non fatal error fetching PID some info for 2008, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | Error: error uninstalling agent: error uninstalling components: failed to detect inputs and outputs: missing matching binary for C:\Program Files\Elastic\Agent\data\elastic-agent-9c40d6\components\apm-server.spec.yml
  | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html
  |  
  | Agent uninstall return code:1

Example CI Failure: https://buildkite.com/elastic/elastic-stack-installers/builds/3671#018d5f93-affb-4019-a534-b53fcf8fe1e9

It would be nice to figure out why this is failing but second best would be making this not fatally prevent uninstallation

@strawgate strawgate added the bug Something isn't working label Jan 31, 2024
@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jan 31, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pierrehilbert
Copy link
Contributor

Have you been able to reproduce this on Windows by using the zip installer and not the MSI one?
It would help to understand where the problem is coming from

@strawgate
Copy link
Contributor Author

I have not, the CI is just using the MSI installer, I have not used the Zip recently

@cmacknz
Copy link
Member

cmacknz commented Jan 31, 2024

We've never seen this failure in CI without the MSI.

The error is coming from the uninstall logic in

// Uninstall components first
if err := uninstallComponents(context.Background(), cfgFile, uninstallToken, log, pt); err != nil {
// If service status was running it was stopped to uninstall the components.
// If the components uninstall failed start the service again
if status == service.StatusRunning {
if startErr := svc.Start(); startErr != nil {
return aerrors.New(
err,
fmt.Sprintf("failed to restart service (%s), after failed components uninstall: %v", paths.ServiceName, startErr),
aerrors.M("service", paths.ServiceName))
}
}
return fmt.Errorf("error uninstalling components: %w", err)
}

Specifically it is coming from

specs, err := component.LoadRuntimeSpecs(paths.Components(), platform)
if err != nil {
return fmt.Errorf("failed to detect inputs and outputs: %w", err)
}

This leads us to

for path, spec := range specFiles {
binaryName := filepath.Base(path[:len(path)-len(specGlobPattern)+1])
binaryPath := path[:len(path)-len(specGlobPattern)+1]
if platform.OS == Windows {
binaryPath += ".exe"
}
if !opt.skipBinaryCheck {
info, err := os.Stat(binaryPath)
if errors.Is(err, os.ErrNotExist) {
return RuntimeSpecs{}, fmt.Errorf("missing matching binary for %s", path)
} else if err != nil {

This error means that apm-server.spec.yml exists but apm-server.exe does not.

The exact same function call that returns this error is made right when the agent starts up, so we must have had apm-server.exe at some point and now we don't.

specs, err := component.LoadRuntimeSpecs(paths.Components(), platform)
if err != nil {
return nil, nil, nil, fmt.Errorf("failed to detect inputs and outputs: %w", err)
}
log.With("inputs", specs.Inputs()).Info("Detected available inputs and outputs")

@cmacknz
Copy link
Member

cmacknz commented Jan 31, 2024

The exact same function call that returns this error is made right when the agent starts up, so we must have had apm-server.exe at some point and now we don't.

Alternatively the agent was never installed or is not running correctly and uninstall is failing with the same error.

@cmacknz
Copy link
Member

cmacknz commented Jan 31, 2024

A little earlier in the buidlkite logs I see an msiexec error for an install attempt but I can't tell exactly what caused it (garbage enrollment token and URL?):

VERBOSE: Invoking msiexec.exe /i C:\Users\buildkite\esi\bin\out\elastic\elastic-agent-8.13.0-SNAPSHOT-windows-x86_64.msi /qn /norestart INSTALLARGS="--delay-enroll --url=https://placeholder:443 --enrollment-token=token" /l*v C:\buildkite-agent\builds\bk-agent-prod-gcp-1706705535567506444\elastic\elastic-stack-installers\src\agent-qa\logs\20240131-012004-i.log

...
--
  | WARNING: Elastic Agent uninstall returned:
  | WARNING: Failed to parse msi log for errors
  | WARNING: Dumping full MSI Log
  | === Verbose logging started: 1/31/2024  13:20:04  Build type: SHIP UNICODE 5.00.10011.00  Calling process: C:\Windows\system32\msiexec.exe ===
  | MSI (c) (5C:5C) [13:20:04:649]: Resetting cached policy values
  | MSI (c) (5C:5C) [13:20:04:649]: Machine policy value 'Debug' is 0


Action start 13:20:05: UpgradeAction.
--
  | SFXCA: Extracting custom action to temporary directory: C:\Windows\Installer\MSI3432.tmp-\
  | SFXCA: Binding to CLR version v4.0.30319
  | Calling custom action BeatPackageCompiler!Elastic.PackageCompiler.Beats.AgentCustomAction.UpgradeAction
  | Detected an agent upgrade via MSI, which is not supported. Aborting.
  | CustomAction UpgradeAction returned actual error code 1603 (note this may not be 100% accurate if translation happened inside sandbox)
  | Action ended 13:20:05: UpgradeAction. Return value 3.
  | Action ended 13:20:05: INSTALL. Return value 3.
=== Logging stopped: 1/31/2024  13:20:05 ===
--
  | MSI (s) (38:E8) [13:20:05:260]: User policy value 'DisableRollback' is 0
  | MSI (s) (38:E8) [13:20:05:260]: Machine policy value 'DisableRollback' is 0
  | MSI (s) (38:E8) [13:20:05:260]: Incrementing counter to disable shutdown. Counter after increment: 0
  | MSI (s) (38:E8) [13:20:05:260]: Note: 1: 1402 2: HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Installer\Rollback\Scripts 3: 2
  | MSI (s) (38:E8) [13:20:05:260]: Note: 1: 1402 2: HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Installer\Rollback\Scripts 3: 2
  | MSI (s) (38:E8) [13:20:05:260]: Decrementing counter to disable shutdown. If counter >= 0, shutdown will be denied.  Counter after decrement: -1
  | MSI (s) (38:E8) [13:20:05:260]: Destroying RemoteAPI object.
  | MSI (s) (38:B4) [13:20:05:260]: Custom Action Manager thread ending.
  | MSI (c) (5C:5C) [13:20:05:260]: Decrementing counter to disable shutdown. If counter >= 0, shutdown will be denied.  Counter after decrement: -1
  | MSI (c) (5C:5C) [13:20:05:260]: MainEngineThread is returning 1603
  | === Verbose logging stopped: 1/31/2024  13:20:05 ===
  |  
  |  
  | VERBOSE: msiexec reports error 1603 = fatal error occurred during installation.

@cmacknz
Copy link
Member

cmacknz commented Jan 31, 2024

Do our tests check that elastic-agent status works and returns a healthy agent? I wonder if this is a problem with the install that uninstall detects, instead of an uninstall related bug.

@strawgate
Copy link
Contributor Author

strawgate commented Jan 31, 2024

Do our tests check that elastic-agent status works and returns a healthy agent? I wonder if this is a problem with the install that uninstall detects, instead of an uninstall related bug.

The tests use a garbage fleet url and enrollment token and then start the agent service and verify that the agent creates a log and that fleet enrollment fails with failed to perform delayed enrollment: fail to enroll: fail to execute request to fleet-server: lookup placeholder: no such host

Does fleet enrollment happen after verifying components?

There is a bug at the moment where the name of the test does not print before the test starts (will fix this in a PR today) but there are several tests which purposefully fail the installation to verify it fails when we expect it to fail.

This has occurred once in CI, I'll see if I can reproduce outside of CI.

@cmacknz
Copy link
Member

cmacknz commented Jan 31, 2024

Does fleet enrollment happen after verifying components?

Yes No, for both elastic-agent enroll and elastic-agent install --delay-enroll. In the delay enroll case the delayed enrollment is done quite early:

cfg, err = tryDelayEnroll(ctx, l, cfg, override)
if err != nil {
err = errors.New(err, "failed to perform delayed enrollment")
l.Error(err)
return err
}

Loading inputs happens when application.New() is called later on in the same file.

coord, configMgr, composable, err := application.New(ctx, l, baseLogger, logLvl, agentInfo, rex, tracer, testingMode, fleetInitTimeout, configuration.IsFleetServerBootstrap(cfg.Fleet), runAsOtel, modifiers...)
if err != nil {
return err
}
defer composable.Close()

@jlind23
Copy link
Contributor

jlind23 commented Feb 1, 2024

Do our tests check that elastic-agent status works and returns a healthy agent? I wonder if this is a problem with the install that uninstall detects, instead of an uninstall related bug.

You mean that a not properly installed agent can not be properly uninstalled? Shall we really fail the uninstall if an .exe that is supposed to be removed is not present anymore?

@pierrehilbert
Copy link
Contributor

I don't think we should, this is not a great flow for our users. I would rather be more tolerant when we are uninstalling the Agent but about it but if I remember well, we are using the same part of the code in both the installation and the uninstallation.

@cmacknz
Copy link
Member

cmacknz commented Feb 1, 2024

Yes the uninstall ideally would tolerate this, but also there should not be a code path anywhere in the agent or MSI that leads to the apm-server binary file being removed from disk unexpectedly.

Let's focus on figuring out what happened there first. This is the first time I have ever seen this error, and it correlates exactly with the introduction of our MSI, which is certainly suspicious.

@FlorianHeigl
Copy link

FlorianHeigl commented Feb 5, 2024

Do our tests check that elastic-agent status works and returns a healthy agent? I wonder if this is a problem with the install that uninstall detects, instead of an uninstall related bug.

You mean that a not properly installed agent can not be properly uninstalled? Shall we really fail the uninstall if an .exe that is supposed to be removed is not present anymore?

thou shalt not fail
(IOW an uninstaller should stick to the FCC rule (do not cause interference, do not get disturbed by interference)
(IOW you shouldn't have bugs, but you mustn't have a non-working uninstall. how it happened is less important for the SW lifecycle than not being able to handle it)

@leandrojmp
Copy link
Contributor

Hello,

Just curious, could this happen on 8.12.1 as well? I'm having some issues to uninstall an Agent on Windows that for some reason ended up on a broken install and now we cannot uninstall it nor install again.

I have a support ticket open, but I'm still waiting for some feedback from support.

@cmacknz
Copy link
Member

cmacknz commented Mar 5, 2024

Just curious, could this happen on 8.12.1 as well? I'm having some issues to uninstall an Agent on Windows that for some reason ended up on a broken install and now we cannot uninstall it nor install again.

I suppose it is possible, because we never actually found the root cause for this problem and therefore there was no fix for it. Likely instead of fixing it we should just consider this non-fatal as suggested in #4167 (comment) but we haven't done that yet.

The error here requires that the contents of the components sub-directory of the agent be removed before elastic-agent uninstall is executed. Likely something other than the elastic-agent uninstall command would have to remove parts of the agent installation directory to hit this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

7 participants