Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows uninstall can fail with Error: failed to remove installation directory or Access is denied #3342

Closed
amolnater-qasource opened this issue Sep 1, 2023 · 42 comments · Fixed by #4108
Assignees
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent Label for the Agent team

Comments

@amolnater-qasource
Copy link

Kibana Build details:

VERSION: 8.10.0 BC3
BUILD: 66261
COMMIT: 56348fa0ed0719679e24d6c58dc3dbee03928c4e
Artifact Link: https://staging.elastic.co/8.10.0-091ff659/downloads/beats/elastic-agent/elastic-agent-8.10.0-windows-x86_64.zip

Host OS and Browser version: Windows, All

Preconditions:

  1. 8.10.0 BC3 Kibana cloud environment should be available.
  2. Windows agent should be installed.

Steps to reproduce:

  1. Navigate to CLI.
  2. Under Elastic\Agent run .\elastic-agent.exe uninstall.
  3. Error: failed to remove installation directory on running uninstall command.
  4. Observe a log file gets retained in the installation directory.

Expected Result:
No errors should be observed on running agent uninstall command.

Screen Recording:
32
33

@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Elastic-Agent Label for the Agent team impact:medium labels Sep 1, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@amolnater-qasource
Copy link
Author

Uninstall command issue for BC2 reported at elastic/fleet-server#2935

@manishgupta-qasource
Copy link

Secondary review for this ticket is Done

@jlind23
Copy link
Contributor

jlind23 commented Sep 1, 2023

@pierrehilbert @cmacknz worth looking at this.

@pierrehilbert
Copy link
Contributor

@AndersonQ you worked on two PR related to this problem in the past days. Can you see where the problem could come from?

@harshitgupta-qasource
Copy link

Hi Team

While testing on 8.9.2 BC1 build, we observed this issue reproducible there too.

Observations:

  • Error failed to remove installation directory is observed on running agent uninstall command.

Build details:
VERSION: 8.9.2 BC1
BUILD: 64883
COMMIT: 21f3ebd6e951d102f41b0299c35e030c3c9e8eb6
Artifacts: https://staging.elastic.co/8.9.2-e4235bb7/summary-8.9.2.html

Screenshot
image

Please let us know if anything else is required from our end.

Thanks!

@AndersonQ
Copy link
Member

hey, sorry for the delay. The problem is the same, now at least we know the retry is happening and how it's happening.
The retry strategy comes from here https://github.com/golang/go/blob/master/src/testing/testing.go#L1257. However I think by now we need to rethink it.

Even though the logs are cut, the problem seems to be with the log file. I wonder if it's the agent or filebeat which is accessing the logfile.

@harshitgupta-qasource do you have those logs? Preferably in text form and in full. This screenshot cut out a rather important part of the error.

@amolnater-qasource
Copy link
Author

Hi @AndersonQ

Thank you for looking into this.
Please find the details here:
image
After uninstall is complete below log file retains
image

Pending Log file in the installed directory:
elastic-agent-20230904-5.ndjson.zip

Please let us know if anything else is required from our end.
Thanks!

@AndersonQ
Copy link
Member

@amolnater-qasource another question. Can you easily reproduce this error? If yes, could you explain how?

@amolnater-qasource
Copy link
Author

@AndersonQ yes, it's reproducible every time.

  • We install the agent on Windows 2022 server.
  • On running uninstall command from the installed directory, this error appears in CLI.

Thanks!

@AndersonQ AndersonQ self-assigned this Sep 4, 2023
@AndersonQ
Copy link
Member

then, it's a blocker for the release.
CC: @fearful-symmetry

@fearful-symmetry
Copy link
Contributor

Not sure what the process is for declaring a release blocker? I assume @cmacknz would know.

@cmacknz
Copy link
Member

cmacknz commented Sep 7, 2023

We have already fixed this in the latest BC, we just forgot to update the issue here.

@elastic/fleet-qasource-external please re-test this.

@cmacknz cmacknz closed this as completed Sep 7, 2023
@cmacknz cmacknz added QA:Ready For Testing Code is merged and ready for QA to validate QA:Needs Validation Needs validation by the QA Team labels Sep 7, 2023
@amolnater-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.10.0 BC6 kibana cloud environment and had below observations:

Observations:

  • No errors are observed on running agent uninstall command from the parent directory.

Screenshot:
image

Build details:
VERSION: 8.10.0 BC6
BUILD: 66340
COMMIT: 1b2de6dcb1eb017347a61372e209ec5211242ed5

Hence we are marking this issue as QA:Validated.

Thanks!

@amolnater-qasource amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate QA:Needs Validation Needs validation by the QA Team labels Sep 7, 2023
@harshitgupta-qasource
Copy link

harshitgupta-qasource commented Oct 6, 2023

Hi @AndersonQ @cmacknz

While testing the 8.10.3 BC1 build, we have found this issue reproducible there.

Observations:

  • We observed the error: failed to remove installation directory on running agent uninstall command and also a log file gets retained in the installation directory.
  • We have run the agent uninstallation command outside the directory.

Build details:
VERSION: 8.10.3 BC1
BUILD: 66480
COMMIT: 5aee3c4fba328838fcf0be6a3ff2248a4c0120dd
Artifacts: https://staging.elastic.co/8.10.3-a569781f/summary-8.10.3.html

Screen-shot
image
image

Hence, we are re-opening this issue.
Please let us know if anything else is required from our end.

Thanks!

@harshitgupta-qasource harshitgupta-qasource removed the QA:Validated Validated by the QA Team label Oct 6, 2023
@AndersonQ
Copy link
Member

I'm looking at it

@amitkanfer
Copy link
Contributor

This suggestion is low-hanging. Can we try it out?

@strawgate
Copy link
Contributor

strawgate commented Jan 12, 2024

tldr; The uninstall command (run from outside of the Agent directory) sometimes causes a log file to be created in the root of the agent directory, the contents of the log file are a Docker provider startup failure, this blocks the uninstall from cleaning up the agent directory and fails the uninstall.

I do think there are two issues here, one is uninstalling from the directory but there is also this: #3342 (comment) which occurs outside of the directory and I am able to reliably reproduce on GCP. This is also blocking elastic/elastic-stack-installers#220

On GCP i dont see the exact same error but it's the same part that's failing:

Running command: c:\\Program Files\\Elastic\\Agent\\elastic-agent.exe uninstall -f -v
--
  | Stopping service
  | Successfully stopped service
  | Stopping upgrade watcher; none found
  | Removing service
  | Successfully uninstalled service
  | Removing install directory
  | Failed to remove install directory
  | Failed to uninstall agent
  | Agent uninstall return code:1

I ran process monitor during a failed uninstall
Screenshot 2024-01-12 at 2 05 36 PM

which points to the same issue as reported here: #3342 (comment)

Looking at the WriteFile event I see a write length of 266 Bytes which implies that Elastic Agent is actually writing something to this log file:
Screenshot 2024-01-12 at 2 38 09 PM

This together seems to imply the elastic-agent uninstall command is writing a log file at the root of the agent folder (CreateFile/WriteFile), keeping the log file open, and then trying to delete the log file (SharingViolation), which is stopping it from being uninstalled. Once the uninstall fails it closes the log file (CloseFile) and exits.

Screenshot 2024-01-12 at 2 06 35 PM

Diving into the uninstall code, it looks like when elastic agent is running without a config that sets the logging directory it defaults to the directory the executable is in:

func uninstallComponents(ctx context.Context, cfgFile string, uninstallToken string, pt *progressbar.ProgressBar) error {
log, err := logger.NewWithLogpLevel("", logp.ErrorLevel, false)
if err != nil {
return fmt.Errorf("error creating logger: %w", err)
}

https://github.com/elastic/elastic-agent/blob/main/pkg/core/logger/logger.go#L55-L59

https://github.com/elastic/elastic-agent/blob/main/pkg/core/logger/logger.go#L132-L137

https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/paths/common.go#L183-L185

https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/paths/common.go#L62-L64

https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/paths/common.go#L239C2-L239C2

I wrote a watcher script to try to grab the ndjson log that is generated during the uninstall process:

# Watch for a file

while ($true) {
    if (-not (test-path "C:\Program Files\Elastic\Agent")) {
        continue;
    }
    $AgentLogs = @(get-childitem -path "C:\Program Files\Elastic\Agent" -filter "*.ndjson" -file)

    if ($AgentLogs.count -ne 0) {
        foreach ($AgentLog in $AgentLogs) {
            $Content = get-content $AgentLog

            write-host $Content
        }
    }
}

Apparently the "c:\\Program Files\\Elastic\\Agent\\elastic-agent.exe" uninstall -f -v command writes this to the log file on startup:
{"log.level":"info","@timestamp":"2024-01-12T20:16:47.703Z","log.logger":"composable.providers.docker","log.origin":{"file.name":"docker/docker.go","file.line":44},"message":"Docker provider skipped, unable to connect: protocol not available","ecs.version":"1.6.0"}

Which causes the log file created in C:\Program Files\Elastic\Agent to remain open and fail the uninstall. We will probably want to prevent providers from running when the uninstall command is used or avoid logging to a file during the uninstall process (only log on failure?).

@leehinman
Copy link
Contributor

And seems to imply the elastic-agent uninstall command is writing a log file at the root of the agent folder (CreateFile/WriteFile), keeping the log file open, and then trying to delete the log file (SharingViolation), which is stopping it from being uninstalled. Once the uninstall fails it closes the log file (CloseFile) and exits.

This might be a good reason to look at logging to the EventLog on Windows instead of to a File.

@cmacknz
Copy link
Member

cmacknz commented Jan 12, 2024

We should be logging the output of the uninstall command to the console and not a file. We can log to the event log, but then we'll need to instruct everyone on how to get the events back out of it for troubleshooting.

@leehinman
Copy link
Contributor

We should be logging the output of the uninstall command to the console and not a file. We can log to the event log, but then we'll need to instruct everyone on how to get the events back out of it for troubleshooting.

I like logging to stdout/stderr, but I really think we should do EventLog too. The default buffer size on cmd.exe is 50 lines, so I really don't want to get in the situation where the debug line we need has rolled off and we don't have "permanent" storage somewhere, even if it is a pain to get at. (although elastic-agent diagnostics should be able to pull from EventLog)

@cmacknz
Copy link
Member

cmacknz commented Jan 15, 2024

My proposal would be:

  1. Change this logger to write to console instead of a file. Also we need to document that uninstall can't log to file at all in the code for this reason.
  2. Create a follow up issue to start writing to the event log and have a command for dumping the contents to a file to make reading it easy.

@leehinman
Copy link
Contributor

My proposal would be:

1. Change this logger to write to console instead of a file. Also we need to document that uninstall can't log to file at all in the code for this reason.

we currently have the progress bar for uninstall on stdout. If we log to stderr this is going to get messy.

couple of other options

  1. file in system temp dir, and print where that file is during uninstall, successful uninstall could delete it.
  2. Use the Observer Output during uninstall, and only print the logs to stderr when there is an error. (we do this with some integration tests)

@cmacknz
Copy link
Member

cmacknz commented Jan 16, 2024

Initial preference is option 2 because it puts the error where we want to see it. I don't have anything against option 1 though.

@leehinman
Copy link
Contributor

So implemented the Observer Output, and it doesn't quite fix the problem, we can't delete the "Agent" directory because we are in the Agent directory. You get the same error on Windows if you try rmdir .

c:\Program Files\Elastic\Agent>elastic-agent.exe uninstall
Elastic Agent will be uninstalled from your system at C:\Program Files\Elastic\Agent. Do you want to continue? [Y/n]:y
[    ] Failed to uninstall agent  [5s] Error: error uninstalling agent: failed to remove installation directory (C:\Program Files\Elastic\Agent): timed out while removing "C:\\Program Files\\Elastic\\Agent". Last error: remove C:\Program Files\Elastic\Agent: The process cannot access the file because it is being used by another process.
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html

c:\Program Files\Elastic\Agent>dir
 Volume in drive C has no label.
 Volume Serial Number is 88EE-D48A

 Directory of c:\Program Files\Elastic\Agent

01/17/2024  11:34 PM    <DIR>          .
01/17/2024  11:34 PM    <DIR>          ..
               0 File(s)              0 bytes
               2 Dir(s)  108,212,609,024 bytes free

c:\Program Files\Elastic\Agent>

@strawgate
Copy link
Contributor

strawgate commented Jan 17, 2024

So implemented the Observer Output, and it doesn't quite fix the problem, we can't delete the "Agent" directory because we are in the Agent directory. You get the same error on Windows if you try rmdir .

c:\Program Files\Elastic\Agent>elastic-agent.exe uninstall
Elastic Agent will be uninstalled from your system at C:\Program Files\Elastic\Agent. Do you want to continue? [Y/n]:y
[    ] Failed to uninstall agent  [5s] Error: error uninstalling agent: failed to remove installation directory (C:\Program Files\Elastic\Agent): timed out while removing "C:\\Program Files\\Elastic\\Agent". Last error: remove C:\Program Files\Elastic\Agent: The process cannot access the file because it is being used by another process.
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html

c:\Program Files\Elastic\Agent>dir
 Volume in drive C has no label.
 Volume Serial Number is 88EE-D48A

 Directory of c:\Program Files\Elastic\Agent

01/17/2024  11:34 PM    <DIR>          .
01/17/2024  11:34 PM    <DIR>          ..
               0 File(s)              0 bytes
               2 Dir(s)  108,212,609,024 bytes free

c:\Program Files\Elastic\Agent>

Yeah, that will always be a (different) problem as the command prompt holds a lock on the directory preventing it from being deleted. I previously made reference to this issue containing 2 issues, the first is that you cannot run uninstall if you hold a lock on the root directory, the second is that agent sometimes opens and locks a log file. You're now running into the first issue now that you've resolved the second.

The solutions for this that I'm aware of are:

  1. Document that you should not invoke the agent uninstall command with the working directory set to any directory within Elastic\Agent
  2. Delete that directory on next startup via your recommendation here: Windows uninstall can fail with Error: failed to remove installation directory or Access is denied #3342 (comment)
  3. Warn the user when we detect that the uninstall command is running from the agent directory and that we won't be able to delete the root directory.
  4. Warn the user when we detect an active lock on files in the Agent directory

@amitkanfer
Copy link
Contributor

@leehinman - does it mean that with the current implementation, if a user calls uninstall from a different folder - there's no issue? This unblocks the MSI installer issue we thought we had (as we have full control of where we call the command)

@leehinman
Copy link
Contributor

@leehinman - does it mean that with the current implementation, if a user calls uninstall from a different folder - there's no issue? This unblocks the MSI installer issue we thought we had (as we have full control of where we call the command)

I'm taking "current implementation" to mean without the Observer Output that leaves a log file behind in the agent dir. In that case I think there is a race condition where you could run into the use by another process on the log file it is generating. With the Observer Output, we shouldn't have that race condition.

I'm going to try number 3 that @strawgate mentioned. I should be able to detect the current working dir during uninstall, and if that is within the agent install path immediately error out before any actual uninstall activity. That way the user would know immediately that it wouldn't work, and they don't end up with something half uninstalled. Unless there are objections?

@amitkanfer
Copy link
Contributor

Unless there are objections?

SGTM

@jlind23
Copy link
Contributor

jlind23 commented Jan 18, 2024

Unless there are objections?

Looks like the right way to move forward here.

@strawgate
Copy link
Contributor

strawgate commented Jan 18, 2024

I think I have a preference for this:

Delete that directory on next startup via your recommendation here: #3342 (comment)

i.e. have Agent Uninstall delete all the files but have Windows schedule the deletion of the Elastic/Agent directory itself on next startup

But am okay with the proposed path forward

@amolnater-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.13.0-SNAPSHOT kibana cloud environment and found it fixed now:

Observations:

  • No errors are observed on running agent uninstall command.

Screenshot:
image

Build details:
VERSION: 8.13.0
BUILD: 71179
COMMIT: b4d93fc145c3c09eb1096c610b7cd736f19f6a3a
Artifact Link: https://snapshots.elastic.co/8.13.0-por0bbe1/downloads/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-windows-x86_64.zip

Hence we are marking this issue as QA:Validated.

Thanks!

@amolnater-qasource amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent Label for the Agent team
Projects
None yet