Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set SkipDaemonRestart in tryDelayEnroll #4042

Merged
merged 2 commits into from
Jan 9, 2024

Conversation

leehinman
Copy link
Contributor

What does this PR do?

This PR skips trying to restart the elastic-agent process during a delayed enroll. Instead the process exits and is restarted by the service manager.

This is necessary because the code to restart the elastic-agent process requires that the control server be up, and it isn't at this point. This restores previous behavior, where the attempt to restart the elastic-agent process was non-fatal.

Why is it important?

bug that prevents elastic-agent from starting when --delay-enroll is used

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Author's Checklist

  • [ ]

How to test this PR locally

Without PR

  1. On Windows, install elastic-agent with --delay-enroll option
  2. attempt to start service with Services control panel
  3. notice in logs that restart fails 4 times and elastic-agent status will fail

With PR

  1. On Windows, install elastic-agent with --delay-enroll option
  2. attempt to start service with Services control panel
  3. elastic-agent will start normally and elastic-agent status will work

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@leehinman leehinman requested a review from a team as a code owner January 8, 2024 18:43
@leehinman leehinman added bug Something isn't working backport-v8.12.0 Automated backport with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jan 8, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Comment on lines 504 to 506
// enrollCmd daemonReloadWithBackoff is broken, setting
// SkipDaemonRestart to true avoids running that code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit vague, how is it broken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made an issue and added link to it in the comment.

@cmacknz
Copy link
Member

cmacknz commented Jan 8, 2024

The test failure is the usual Last error: remove C:\\Program Files\\Elastic\\Agent\\data\\elastic-agent-c08136\\elastic-agent.exe: Access is denied.

I tested this in an Ubuntu VM and it worked as expected:

{"log.level":"info","@timestamp":"2024-01-08T19:29:27.845Z","log.origin":{"file.name":"cmd/run.go","file.line":170},"message":"Elastic Agent started","log":{"source":"elastic-agent"},"process.pid":1773,"agent.version":"8.13.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-01-08T19:29:27.892Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":496},"message":"Starting enrollment to URL: https://2e6c294352624beda2f6830fada574fb.fleet.eastus2.staging.azure.foundit.no:443/","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-01-08T19:29:29.308Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":289},"message":"Elastic Agent has been enrolled; start Elastic Agent","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-01-08T19:29:29.308Z","log.origin":{"file.name":"cmd/run.go","file.line":529},"message":"Successfully performed delayed enrollment of this Elastic Agent.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Jan 8, 2024

I do think we should add an integration test for this, just not sure if it makes sense as follow up or not.

We could modify an existing integration test that enrolls into Fleet to use --delay-enroll instead. For example this is a good candidate:

policy, err := tools.InstallAgentWithPolicy(
ctx,
t,
installOpts,
agentFixture,
info.KibanaClient,
createPolicyReq)
require.NoError(t, err)
t.Logf("created policy: %s", policy.ID)
check.ConnectedToFleet(ctx, t, agentFixture, 5*time.Minute)

All we need to do to test this is start using the --delay-enroll argument at install and then manually start the service. On Linux this is just running systemctl start elastic-agent. We don't need to reboot the VM.

@leehinman thoughts on adding a test here or is it better as a follow up issue to make sure we get the bug fixed in time for 8.12?

@leehinman
Copy link
Contributor Author

@leehinman thoughts on adding a test here or is it better as a follow up issue to make sure we get the bug fixed in time for 8.12?

Let me see if I can get an integration test running by ~noon tomorrow. If not we can merge to make sure we get in before 8.12.

this skips trying to restart the process which is currently broken
because the control daemon hasn't been started yet.  This restores the
previous behavior where the restart failing wasn't fatal.
@leehinman
Copy link
Contributor Author

@leehinman thoughts on adding a test here or is it better as a follow up issue to make sure we get the bug fixed in time for 8.12?

Let me see if I can get an integration test running by ~noon tomorrow. If not we can merge to make sure we get in before 8.12.

Integration test is in PR now. If you comment out the options.SkipDaemonRestart = true change in run.go it fails, and it succeeds with the proposed change. Just Linux for right now.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test!

@cmacknz
Copy link
Member

cmacknz commented Jan 9, 2024

buildkite test it

1 similar comment
@cmacknz
Copy link
Member

cmacknz commented Jan 9, 2024

buildkite test it

@cmacknz
Copy link
Member

cmacknz commented Jan 9, 2024

https://proxy.golang.org/github.com/magefile/mage/@v/v1.14.0.info": dial tcp: lookup proxy.golang.org: getaddrinfow: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server.

🤦

@cmacknz
Copy link
Member

cmacknz commented Jan 9, 2024

buildkite test it

Copy link

Quality Gate failed Quality Gate failed

Failed conditions

0.0% 0.0% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@cmacknz
Copy link
Member

cmacknz commented Jan 9, 2024

Only failure is a known issue, force merging:

Last error: remove C:\\Program Files\\Elastic\\Agent\\data\\elastic-agent-c6ce80\\elastic-agent.exe: Access is denied.

@cmacknz cmacknz merged commit ba0ea0b into elastic:main Jan 9, 2024
mergify bot pushed a commit that referenced this pull request Jan 9, 2024
* set SkipDaemonRestart in tryDelayEnroll

this skips trying to restart the process which is currently broken
because the control daemon hasn't been started yet.  This restores the
previous behavior where the restart failing wasn't fatal.

* add integration test

(cherry picked from commit ba0ea0b)
cmacknz pushed a commit that referenced this pull request Jan 10, 2024
* set SkipDaemonRestart in tryDelayEnroll

this skips trying to restart the process which is currently broken
because the control daemon hasn't been started yet.  This restores the
previous behavior where the restart failing wasn't fatal.

* add integration test

(cherry picked from commit ba0ea0b)

Co-authored-by: Lee E Hinman <57081003+leehinman@users.noreply.github.com>
cmacknz pushed a commit that referenced this pull request Jan 17, 2024
* set SkipDaemonRestart in tryDelayEnroll

this skips trying to restart the process which is currently broken
because the control daemon hasn't been started yet.  This restores the
previous behavior where the restart failing wasn't fatal.

* add integration test

(cherry picked from commit ba0ea0b)

Co-authored-by: Lee E Hinman <57081003+leehinman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.12.0 Automated backport with mergify bug Something isn't working skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
3 participants