-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet]: Sles 15 Linux agents gets unhealthy with Endpoint Security integration on 7.17.27 #6519
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
@amolnater-qasource Kindly review |
Secondary review for this ticket is Done. |
@nfritts @norrietaylor according to the Elastic Agent diag it seems like endpoint is being degraded, can someone on your end take a look please? |
Looks like endpoint report this error: |
Hmm, I'm not sure that's the root cause. We are for some reason trying to connect to endpoint to get monitoring data the same way we do for Beats, AFAIK endpoint has never exposed a monitoring socket like that. I suspect that log is a symptom of something else. |
My initial thought is that we may have ended up out of sync on pipe/named socket bootstrapping? Endpoint was hoping to merge (but hasn't merged to 7.17 yet effectively a backport of the change we made for 8.15 with the bootstrap process to move it off of a localhost socket. The endpoint PR isn't merged yet https://github.com/elastic/endpoint-dev/pull/15344 Has Agent made changes in anticipation of changing the bootstrap? (I did a quick search but didn't see anything that stood out) If so, then we're out of sync and either the agent change will have to be reverted or we'll have to get the endpoint change merged before things will work. |
These are the changes merged between 7.17.26 and 7.17.27, not sure what could have caused this. @harshitgupta-qasource this problem was not there in 7.17.26 right? |
@harshitgupta-qasource what was the system integration version you were using? |
@harshitgupta-qasource I tried reproducing this issue using a 7.17.27 deployment and a 7.17.27 BC1 elastic agent on ubuntu 22.04 but I cannot reproduce the agent being unhealthy. I created a new empty policy and enrolled an elastic agent After the agent was healthy I added System Integration v. 1.11.1 as shipped by 7.17.27 cloud stack Waited a few minutes for the agent to become unhealthy but it didn't happen after a few minutes, so I added the defend integration to the same policy Agent is still healthy after ~20 mins from the start of my test. How long would it take for the agent to become unhealthy in your test ? |
Just adding my 2 cents here it seems latest system version with support for 7.17 was 1.15.1 (https://github.com/elastic/integrations/pull/3509/files#diff-d4cd9d386b49496970c932d312ae09b5a2acc2c3f85f75a7819064d67634248b) so it could be worth trying an update if necessary |
Please gather an endpoint diagnostic package from the Ubuntu host.
And, upload here. Thanks. I suspect that the kernel of the Ubuntu system has moved beyond the support within 7.17 and it's failing to install event sources. |
Hi Team, Sorry for the confusion. We have revalidated this issue with Fresh VMs and had below observations:
So we have updated the ticket accordingly. Screen-Shot: Agents Logs elastic-agent-diagnostics-2025-01-15T06-12-03Z-00.zip We have observed error: command not found for sudo /opt/Elastic/Endpoint/elastic-endpoint diagnostics. Please find below manually collected endpoint logs: Please let us know if we are missing anything here. |
Tested on SLES15SP5 Captured elastic-agent diagnostics and endpoint logs ( Please look below ⏬ 🙂 |
Tested on SLES15SP6 x86_64 and I can reproduce the issue with elastic-agent 7.17.27 BC1 going unhealthy when Elastic Defend integration is added. Captured elastic-agent diagnostics and endpoint logs ( elastic-endpoint-logs-sles15sp6.tar.gz What we can see in endpoint logs is that there are some error when processing the new policy {"@timestamp":"2025-01-15T08:05:13.499493341Z","agent":{"id":"9d0a45be-3a48-452e-886a-412cf18f4498","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":93,"name":"PerfWatcher.cpp"}}},"message":"PerfWatcher.cpp:93 Failed to write: (r:kprobes/elasticendpoint_TCP_SENDPAGE_RET_probe tcp_sendpage rv=$retval)","process":{"pid":3959,"thread":{"id":4167}}}
{"@timestamp":"2025-01-15T08:05:13.499574564Z","agent":{"id":"9d0a45be-3a48-452e-886a-412cf18f4498","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":93,"name":"PerfWatcher.cpp"}}},"message":"PerfWatcher.cpp:93 Failed to write: (-:kprobes/elasticendpoint_TCP_SENDPAGE_RET_probe)","process":{"pid":3959,"thread":{"id":4167}}}
{"@timestamp":"2025-01-15T08:05:13.70158269Z","agent":{"id":"9d0a45be-3a48-452e-886a-412cf18f4498","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"warning","origin":{"file":{"line":85,"name":"Tux_HostIsolation.cpp"}}},"message":"error talking to the kernel (rtnetlink_send)\n","process":{"pid":3959,"thread":{"id":4167}}}
{"@timestamp":"2025-01-15T08:05:13.762387268Z","agent":{"id":"9d0a45be-3a48-452e-886a-412cf18f4498","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":2032,"name":"Config.cpp"}}},"message":"Config.cpp:2032 Initial configuration application failed","process":{"pid":3959,"thread":{"id":4167}}}
{"@timestamp":"2025-01-15T08:05:13.763514527Z","agent":{"id":"9d0a45be-3a48-452e-886a-412cf18f4498","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":429,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:429 Failed to apply new policy from Agent.","process":{"pid":3959,"thread":{"id":4167}}} We can find the same errors in the diagnostics attached by @harshitgupta-qasource @nfritts @nicholasberlin could you have a look at this please? It seems that endpoint configuration is failing on this specific SLES version. /cc @jlind23 |
Thanks for the testing all! Here's a PR to fix the bug: https://github.com/elastic/endpoint-dev/pull/15600 |
FYI, PR was merged, the fix will be in the next release of 7.17 |
Closing this as fixed then. |
Hi Team, Observations:
Build details: Hence, we are marking this issue as QA:Validated. Thanks!! |
Kibana Build details:
Host OS and Browser version: [Sles15]
Preconditions:
Steps to reproduce:
Expected:
Screenshot:



Agents Logs:
elastic-agent-diagnostics-2025-01-13T05-29-48Z-00.zip
The text was updated successfully, but these errors were encountered: