Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beats shutdown issue #6875

Open
bjmcnic opened this issue Feb 14, 2025 · 10 comments
Open

Beats shutdown issue #6875

bjmcnic opened this issue Feb 14, 2025 · 10 comments
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@bjmcnic
Copy link
Contributor

bjmcnic commented Feb 14, 2025

For confirmed bugs, please report:

  • Version: At least 8.17.1 (likely earlier)-> 9.0.0-beta
  • Operating System: Windows Server (tested on 2019)
  • Steps to Reproduce:
    * Setup the Microsoft DNS server
    * Install Elastic Agent
    * Add the Microsoft DNS Server Integration with defaults
    * Stop the Elastic Agent service
    * Notice the Elastic-DNSServer-Analytical ETW trace remains
PS C:\> logman query -ets

Data Collector Set                      Type                          Status
-------------------------------------------------------------------------------
AppModel                                Trace                         Running
Elastic-DNSServer-Analytical            Trace                         Running
DiagLog                                 Trace                         Running
EventLog-Application                    Trace                         Running
EventLog-System                         Trace                         Running
NtfsLog                                 Trace                         Running
...

This is just a specific testable scenario that speaks to a potentially broader issue. The integration within the filebeat portion of the agentbeat.exe process has code to stop that trace upon the integration being stopped, but that code never runs when Agent stops. Additionally, the code does not run when Agent is left running, but the integration is removed.

Through debugging, it's been observed during the Elastic Agent service stop that the elastic-agent.exe service process calls NtTerminateProcess() upon the agentbeat.exe process hosting the filebeat integration. The agentbeat.exe process does not itself run its cleanup code and cleanly exit.

It would seem that IPC between the elastic-agent.exe process and the subordinate agentbeat.exe process is not occurring in a way to trigger clean shutdowns, at least in this instance.

@bjmcnic bjmcnic added the bug Something isn't working label Feb 14, 2025
@jlind23 jlind23 added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 17, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@blakerouse
Copy link
Contributor

The Elastic Agent calls proc.Kill where proc is *os.Process on Windows. - https://github.com/elastic/elastic-agent/blob/main/pkg/core/process/cmd.go#L35

There is no Stop on golang, and on Windows golang only supported signal is Kill which calls syscall.TerminateProcess

@jlind23
Copy link
Contributor

jlind23 commented Feb 26, 2025

@blakerouse does it mean this is an expected behavior and not a bug?

@blakerouse
Copy link
Contributor

@jlind23 The behavior we have is clearly defined. I do believe there is an open question if we should use a different way of shutting down the process on Windows to ensure that the cleanups are ran.

If the cleanup are not ran then it is a bug. The cleanups should be ran.

@belimawr
Copy link
Contributor

The documentation for os.Process.Kill states:

Kill causes the Process to exit immediately. Kill does not wait until the Process has actually exited. This only kills the Process itself, not any other processes it may have started.

It seems we need a more graceful way to shutdown the process, so all the cleanups actually run. As Blake mentioned, Windows does not support the interrupt, which makes things more complicated there.

@cmacknz
Copy link
Member

cmacknz commented Feb 26, 2025

We could have the STOPPED expected state in the control protocol imply the Beat should shutdown gracefully. It looks like there is some intent to do this in https://github.com/elastic/beats/blob/c4054adec82dc0a4626fc50e03066bb87d4cd35e/x-pack/libbeat/management/managerV2.go#L543-L577

OpAMP is using https://learn.microsoft.com/en-us/windows/console/generateconsolectrlevent to send syscall.CTRL_BREAK_EVENT as a replacement for SIGINT on Unix.

@blakerouse
Copy link
Contributor

We could have the STOPPED expected state in the control protocol imply the Beat should shutdown gracefully. It looks like there is some intent to do this in https://github.com/elastic/beats/blob/c4054adec82dc0a4626fc50e03066bb87d4cd35e/x-pack/libbeat/management/managerV2.go#L543-L577

@bjmcnic Elastic Agent does tell the component to stop the unit. I would expect at that point it would cleanup.

@belimawr
Copy link
Contributor

belimawr commented Feb 27, 2025

We could have the STOPPED expected state in the control protocol imply the Beat should shutdown gracefully. It looks like there is some intent to do this in https://github.com/elastic/beats/blob/c4054adec82dc0a4626fc50e03066bb87d4cd35e/x-pack/libbeat/management/managerV2.go#L543-L577

Indeed, there is some intent, but if the Elastic-Agent sends a kill signal (like here: https://github.com/elastic/elastic-agent/blob/main/pkg/core/process/cmd.go#L35) before the Beat has time to fully clean up/gracefully exit the issue still persists.

@cmacknz
Copy link
Member

cmacknz commented Feb 27, 2025

Also, it's possible there's a bug in the way the Beat shuts itself down, or that it does it in a way that doesn't allow cleanup to happen. Cancelling the root context causing all in progress operations to abort, even the ones doing cleanup, for one possibility.

@blakerouse
Copy link
Contributor

I would expect this to be an issue on Windows for beats in general. If I was to start the beat and then stop it from command-line would it clean up? If I stop it from task manager would it clean up? If I stop it from service manager would it clean up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

6 participants