Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet]: Sles 15 Linux agents gets unhealthy with Endpoint Security integration on 7.17.27 #6519

Closed
harshitgupta-qasource opened this issue Jan 13, 2025 · 19 comments
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@harshitgupta-qasource
Copy link

harshitgupta-qasource commented Jan 13, 2025

Kibana Build details:

VERSION: 7.17.27
BUILD 47755
COMMIT 828e49db669c29d8cc4f3a30f6abe5e8f69a4290
Artifact: https://staging.elastic.co/7.17.27-b47ca93f/summary-7.17.27.html#elastic-agent-package

Host OS and Browser version: [Sles15]

Preconditions:

  1. 7.17.27 BC1 Kibana Cloud environment should be available.

Steps to reproduce:

  1. Navigate to the Agents Tab
  2. Now add endpoint security integration to the agent and Go to the Endpoint Tab
  3. Observe that the Linux agent goes to unhealthy state .

Expected:

  • Sles 15 Linux agents should be healthy with Endpoint Security integration on 7.17.27

Screenshot:
Image
Image
Image

Agents Logs:

elastic-agent-diagnostics-2025-01-13T05-29-48Z-00.zip

@harshitgupta-qasource harshitgupta-qasource added bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jan 13, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@harshitgupta-qasource
Copy link
Author

@amolnater-qasource Kindly review

@amolnater-qasource
Copy link

Secondary review for this ticket is Done.

@jlind23
Copy link
Contributor

jlind23 commented Jan 13, 2025

@nfritts @norrietaylor according to the Elastic Agent diag it seems like endpoint is being degraded, can someone on your end take a look please?

@jlind23
Copy link
Contributor

jlind23 commented Jan 13, 2025

Looks like endpoint report this error:
error: 'Get "http://unix/": dial unix /opt/Elastic/Agent/data/tmp/default/endpoint-security/endpoint-security.sock: connect: no such file or directory'

@cmacknz
Copy link
Member

cmacknz commented Jan 13, 2025

Hmm, I'm not sure that's the root cause. We are for some reason trying to connect to endpoint to get monitoring data the same way we do for Beats, AFAIK endpoint has never exposed a monitoring socket like that. I suspect that log is a symptom of something else.

@nfritts
Copy link

nfritts commented Jan 14, 2025

My initial thought is that we may have ended up out of sync on pipe/named socket bootstrapping?

Endpoint was hoping to merge (but hasn't merged to 7.17 yet effectively a backport of the change we made for 8.15 with the bootstrap process to move it off of a localhost socket.

The endpoint PR isn't merged yet https://github.com/elastic/endpoint-dev/pull/15344

Has Agent made changes in anticipation of changing the bootstrap? (I did a quick search but didn't see anything that stood out) If so, then we're out of sync and either the agent change will have to be reverted or we'll have to get the endpoint change merged before things will work.

@jlind23
Copy link
Contributor

jlind23 commented Jan 14, 2025

These are the changes merged between 7.17.26 and 7.17.27, not sure what could have caused this.
Image

@harshitgupta-qasource this problem was not there in 7.17.26 right?

@jlind23
Copy link
Contributor

jlind23 commented Jan 14, 2025

@harshitgupta-qasource what was the system integration version you were using?

@pchila
Copy link
Member

pchila commented Jan 14, 2025

@harshitgupta-qasource I tried reproducing this issue using a 7.17.27 deployment and a 7.17.27 BC1 elastic agent on ubuntu 22.04 but I cannot reproduce the agent being unhealthy.

I created a new empty policy and enrolled an elastic agent

After the agent was healthy I added System Integration v. 1.11.1 as shipped by 7.17.27 cloud stack

Image

Waited a few minutes for the agent to become unhealthy but it didn't happen after a few minutes, so I added the defend integration to the same policy

Image

Agent is still healthy after ~20 mins from the start of my test.

Image

How long would it take for the agent to become unhealthy in your test ?
If I understood correctly you saw the agent unhealthy with just the System integration, correct ?
Is there any difference between my test steps and yours that could lead to a different result ?

@marc-gr
Copy link
Contributor

marc-gr commented Jan 14, 2025

Just adding my 2 cents here it seems latest system version with support for 7.17 was 1.15.1 (https://github.com/elastic/integrations/pull/3509/files#diff-d4cd9d386b49496970c932d312ae09b5a2acc2c3f85f75a7819064d67634248b) so it could be worth trying an update if necessary

@nicholasberlin
Copy link

Please gather an endpoint diagnostic package from the Ubuntu host.

$ sudo /opt/Elastic/Endpoint/elastic-endpoint diagnostics

And, upload here. Thanks.

I suspect that the kernel of the Ubuntu system has moved beyond the support within 7.17 and it's failing to install event sources.

@harshitgupta-qasource
Copy link
Author

Hi Team,

Sorry for the confusion.

We have revalidated this issue with Fresh VMs and had below observations:

  • The issue is reproducible on SLES 15 with Endpoint Security and not reproducible with only System integration.

  • The issue is also not observed on Ubuntu 20 and as Ubuntu 22 is not supported so we have excluded the same.

  • It appears that the issue observed in our previous test was due to a faulty VM.

So we have updated the ticket accordingly.
Build details:
VERSION: 7.17.27
BUILD 47755
COMMIT 828e49db669c29d8cc4f3a30f6abe5e8f69a429

Screen-Shot:

  • Agents Tab
    Image

  • Endpoint Security policy
    Image

Agents Logs

elastic-agent-diagnostics-2025-01-15T06-12-03Z-00.zip

We have observed error: command not found for sudo /opt/Elastic/Endpoint/elastic-endpoint diagnostics.

Please find below manually collected endpoint logs:

Endpoint-logs.zip

Please let us know if we are missing anything here.
Thank you

@harshitgupta-qasource harshitgupta-qasource changed the title [Fleet]: Linux agents gets unhealthy with system integration on 7.17.27 [Fleet]: Sles 15 Linux agents gets unhealthy with Endpoint Security integration on 7.17.27 Jan 15, 2025
@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Jan 15, 2025

Tested on SLES15SP5 x86/64 and I can't reproduce the issue with elastic-agent 7.17.27 BC1. Both elastic-agent and elastic-endpoint report Healthy

Image

Captured elastic-agent diagnostics and endpoint logs (elastic-endpoint diagnostics is not a valid command in 7.17.27)
agent-diagnostics-sles15sp5.zip
endpoint-log-sles15sp5.log

Please look below ⏬ 🙂

@pchila
Copy link
Member

pchila commented Jan 15, 2025

Tested on SLES15SP6 x86_64 and I can reproduce the issue with elastic-agent 7.17.27 BC1 going unhealthy when Elastic Defend integration is added.

Image

Captured elastic-agent diagnostics and endpoint logs (elastic-endpoint diagnostics is not a valid command in 7.17.27)

elastic-endpoint-logs-sles15sp6.tar.gz
diagnostics-elastic-agent-sles15sp6.zip

What we can see in endpoint logs is that there are some error when processing the new policy

{"@timestamp":"2025-01-15T08:05:13.499493341Z","agent":{"id":"9d0a45be-3a48-452e-886a-412cf18f4498","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":93,"name":"PerfWatcher.cpp"}}},"message":"PerfWatcher.cpp:93 Failed to write: (r:kprobes/elasticendpoint_TCP_SENDPAGE_RET_probe tcp_sendpage rv=$retval)","process":{"pid":3959,"thread":{"id":4167}}}
{"@timestamp":"2025-01-15T08:05:13.499574564Z","agent":{"id":"9d0a45be-3a48-452e-886a-412cf18f4498","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":93,"name":"PerfWatcher.cpp"}}},"message":"PerfWatcher.cpp:93 Failed to write: (-:kprobes/elasticendpoint_TCP_SENDPAGE_RET_probe)","process":{"pid":3959,"thread":{"id":4167}}}

{"@timestamp":"2025-01-15T08:05:13.70158269Z","agent":{"id":"9d0a45be-3a48-452e-886a-412cf18f4498","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"warning","origin":{"file":{"line":85,"name":"Tux_HostIsolation.cpp"}}},"message":"error talking to the kernel (rtnetlink_send)\n","process":{"pid":3959,"thread":{"id":4167}}}

{"@timestamp":"2025-01-15T08:05:13.762387268Z","agent":{"id":"9d0a45be-3a48-452e-886a-412cf18f4498","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":2032,"name":"Config.cpp"}}},"message":"Config.cpp:2032 Initial configuration application failed","process":{"pid":3959,"thread":{"id":4167}}}

{"@timestamp":"2025-01-15T08:05:13.763514527Z","agent":{"id":"9d0a45be-3a48-452e-886a-412cf18f4498","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":429,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:429 Failed to apply new policy from Agent.","process":{"pid":3959,"thread":{"id":4167}}}

We can find the same errors in the diagnostics attached by @harshitgupta-qasource

@nfritts @nicholasberlin could you have a look at this please? It seems that endpoint configuration is failing on this specific SLES version.
Let me know if you want to transfer the issue on some other repo.

/cc @jlind23

@nicholasberlin
Copy link

Thanks for the testing all!

Here's a PR to fix the bug: https://github.com/elastic/endpoint-dev/pull/15600

@nicholasberlin
Copy link

FYI, PR was merged, the fix will be in the next release of 7.17

@jlind23
Copy link
Contributor

jlind23 commented Jan 30, 2025

Closing this as fixed then.

@jlind23 jlind23 closed this as completed Jan 30, 2025
@amolnater-qasource amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Jan 31, 2025
@amolnater-qasource
Copy link

Hi Team,
We have revalidated this issue on latest 7.17.28-SNAPSHOT and found this fixed now.

Observations:

  • Sles15 Linux agent remains healthy with Endpoint Security integration.

Screenshots:
Image
Image

Build details:
Artifact:
https://snapshots.elastic.co/7.17.28-e3615826/downloads/beats/elastic-agent/elastic-agent-7.17.28-SNAPSHOT-linux-x86_64.tar.gz

Hence, we are marking this issue as QA:Validated.

Thanks!!

@amolnater-qasource amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

10 participants