Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Agent shows missed check-ins and degraded status after NTP Sync and manual time adjustments. #5363

Closed
harshitgupta-qasource opened this issue Aug 27, 2024 · 10 comments
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@harshitgupta-qasource
Copy link

Kibana Build details:

VERSION: 8.16.0 SNAPSHOT
BUILD: 77679
COMMIT: 6b091fe3b410eaae9d4805c0a3c0ea6168bf66b0

VERSION: 8.15.1 SNAPSHOT
BUILD: 76479
COMMIT: d3293f748cb6d5a16fcc398cf0253fa2c4cc1552

Host OS Windows Server 2022

Preconditions:

  1. 8.16.0 SNAPSHOT Kibana Cloud environment should be available.
  2. 8.15.1 / 8.16.0 Windows agent should be installed.

Steps to reproduce:

  1. On the Windows VM, navigate to Windows settings and turn on NTP sync.
  2. Install the elastic-agent on the Windows machine.
  3. Every 10 seconds, manually increase the system time by 1 minute.
  4. Continue this process for 2 minutes.
  5. Observe the logs for missed check-ins and notice the component status changes from Healthy to Degraded.

Expected:
Windows Agent shouldn't display missed check-ins and degraded status after NTP Sync and manual time adjustments.

Screenshot:
8.15.1
Image

8.16.0
Image

Agents Logs
elastic-agent-diagnostics-2024-08-27T11-06-19Z-00.zip

Feature Ticket:
#5284

@harshitgupta-qasource harshitgupta-qasource added bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Aug 27, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@harshitgupta-qasource
Copy link
Author

@amolnater-qasource Kindly review

@amolnater-qasource amolnater-qasource changed the title [Fleet]: Windows Agent shows missed check-ins and degraded status after NTP Sync and manual time adjustments. Windows Agent shows missed check-ins and degraded status after NTP Sync and manual time adjustments. Aug 27, 2024
@amolnater-qasource
Copy link

Secondary review for this ticket is done.

@pierrehilbert
Copy link
Contributor

@leehinman could it be related to your lately changes?

@leehinman
Copy link
Contributor

can we re-test?

1 missed check-in isn't a problem, if we are getting multiple such that we hit lines like:

{"log.level":"error","@timestamp":"2024-08-27T20:20:40.278Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed winlog-default (HEALTHY->FAILED): Failed: pid '3792' missed 3 check-ins and will be killed","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"FAILED"},"unit":{"id":"winlog-default","type":"output","state":"FAILED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

That is the real problem.

I tested on 8.14.3 with NTP enabled and if I enroll and start agent, and then in a PowerShell run as Admin run the following script:

while (1 -eq 1) {
    Set-Date -Date (Get-Date).AddSeconds(97)
	start-sleep -seconds 11
}

We will see lots of lines about 1 missed check-ins in the logs, and usually within 10 minutes I will see the "missed 3 check-ins" and the process is restarted.

When I switch to 8.16.0 that is newer than commit 129c8c458f231b97d66af860a3dfe2b7a5113d03 I don't see any missed check-ins

@harshitgupta-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.16.0 SNAPSHOT and 8.14.3 kibana cloud environment and has below observations

Observations:

  • We ran the same script in our Powershell admin for both 8.14.3 and 8.16.0 agents.

  • Observed error logs for both the agents: [elastic_agent][error] Component state changed log-default (DEGRADED->FAILED): Failed: pid '1632' missed 3 check-ins and will be killed

  • We have shared the logs below for your reference:

Agents Logs

Screen Shot:

  • 8.14.3
    Image

  • 8.16.0
    Image

Build details:
VERSION: 8.16.0
BUILD: 77768
COMMIT: ecec57ca52a3b00c6a2ab2cf36b2ec9a7c4d1981
ARTIFACT: https://snapshots.elastic.co/8.16.0-4647f73c/downloads/beats/elastic-agent/elastic-agent-8.16.0-SNAPSHOT-windows-x86_64.zip

Please let us know if we are missing anything here.

Thanks!

@leehinman
Copy link
Contributor

The 8.16.0-SNAPSHOT that was used was built from commit 2676a3f372b3aeca96c266fcb18d633a6c416496, this is from both the .build_hash.txt file and the elastic-agent-2676a3 directory in the package. The problem with that is, that git commit is from Sun Jul 28 17:33:31 2024 +0200, the commit that has the fix is 129c8c458f231b97d66af860a3dfe2b7a5113d03 from Fri Aug 16 08:52:10 2024 -0500.

So can we re-test with a more recent 8.16.0-SNAPSHOT elastic-agent package?

@harshitgupta-qasource
Copy link
Author

harshitgupta-qasource commented Sep 3, 2024

Hi Team,

We have re-validated this issue on the latest 8.16.0 SNAPSHOT Kibana cloud environment and had below observation:

Observations:

  • No new check-in errors were observed on the latest artifact.
  • After approximately 5 minutes, we stopped the time forwarding script and then updated the logging level to debug, but no new debug logs were generated.

Agent Logs

** UPDATE: **

  • We observed that debug logs are generated after sometime.
  • After Sometime debug logs are generated.

elastic-agent-diagnostics-2024-09-03T13-14-30Z-00 (1).zip

Could you please confirm if this is expected?

Build details:
VERSION: 8.16.0 SNAPSHOT
BUILD: 77913
COMMIT: f2aba4624160124344e98dac19d5eefd83fa79ce

Screen-Shot:
Image
Image

Kindly let us know if we are missing anything here.

Thanks

@leehinman
Copy link
Contributor

diagnostics look normal to me. I think the issue with the logs showing up is that the time has been increased, so the log is in the future as far as the log viewer is concerned.

@harshitgupta-qasource
Copy link
Author

Thanks for the update.
We are Closing this issue and marking this as QA:Validated

@harshitgupta-qasource harshitgupta-qasource added the QA:Validated Validated by the QA Team label Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

5 participants