Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows agent gets unhealthy temporarily on updating output to Kafka and then gets Healthy. #6800

Open
amolnater-qasource opened this issue Feb 11, 2025 · 4 comments
Labels
bug Something isn't working impact:medium Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amolnater-qasource
Copy link

Kibana Build details:

VERSION: 9.0.0 beta1 BC2
BUILD: 83474
COMMIT: 88aa3d3604b2c71e998595c3208b3d82cef24d2a

https://staging.elastic.co/9.0.0-beta1-191089b1/summary-9.0.0-beta1.html

Preconditions:

  1. 9.0.0 beta BC1 Kibana cloud environment should be available.
  2. An agent should be installed.

Steps to reproduce:

  1. Setup a valid Kafka output.
  2. Install a Windows agent with System and Elastic Defend integration.
  3. Now select Kafka output as an output for integration under agent policy.
  4. Observe Windows agent gets unhealthy.
  5. After sometime the Windows agent gets back Healthy and data is also generated under Kafka output.

Note:

  • The issue is consistently reproducible on Windows agent.

Expected Result:
Windows agent should remain healthy on updating output to Kafka.

Logs:

elastic-agent-diagnostics-2025-02-11T06-40-01Z-00.zip

Agent json:

EC2AMAZ-DUAF4CI-agent-details.zip

Screenshot:

Image

@amolnater-qasource amolnater-qasource added bug Something isn't working impact:medium Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Feb 11, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@amolnater-qasource
Copy link
Author

@muskangulati-qasource Please review.

@jlind23
Copy link
Contributor

jlind23 commented Feb 11, 2025

Looks like it is coming from the fact that we have the metric monitoring agent unit that goes to unhealthy as it fails fetching data while endpoint is being stopped:

`{"log.level":"warn","@timestamp":"2025-02-11T06:31:18.433Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":699},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (HEALTHY->DEGRADED): Error fetching data for metricset http.json: error making http request: Get \"http://npipe/stats\": open \\\\.\\pipe\\Qz3GgkK41DY59a6_d0X9G3K62V5LvsiW.sock: The system cannot find the file specified.","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-02-11T06:31:19.313Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.newConnInfoServer.func1","file.name":"runtime/conn_info_server.go","file.line":56},"message":"failed accept conn info connection: use of closed network connection","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-02-11T06:31:19.313Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.(*serviceRuntime).stop","file.name":"runtime/service.go","file.line":373},"message":"stopping endpoint service runtime","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-02-11T06:31:19.313Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.(*serviceRuntime).stop","file.name":"runtime/service.go","file.line":389},"message":"endpoint service has checked in, send stopping state to service","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-02-11T06:31:19.313Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.(*serviceRuntime).stop","file.name":"runtime/service.go","file.line":397},"message":"uninstall endpoint service","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-02-11T06:31:20.420Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":68},"message":"2025-02-11 06:31:20: info: Main.cpp:273 Process machine 0x8664. Native machine 0x8664.","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-02-11T06:31:20.420Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":68},"message":"2025-02-11 06:31:20: info: Main.cpp:467 Executing uninstall","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-02-11T06:31:20.426Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":68},"message":"2025-02-11 06:31:20: debug: VaultLib.cpp:207 Vault initialized with existing seed file","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-02-11T06:31:20.446Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":68},"message":"2025-02-11 06:31:20: debug: VaultLib.cpp:614 Successfully read vault key: config","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-02-11T06:31:20.451Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":68},"message":"2025-02-11 06:31:20: debug: ECSUtilities.cpp:497 Tamper protection disabled","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-02-11T06:31:20.451Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":68},"message":"2025-02-11 06:31:20: info: InstallLib.cpp:1173 Skipping uninstall token validation as tamper protection is not enabled.","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-02-11T06:31:20.451Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":68},"message":"2025-02-11 06:31:20: debug: Service.cpp:814 PPL is supported. This process is unprotected. (TrustLevelSid: absent)","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-02-11T06:31:20.452Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":68},"message":"2025-02-11 06:31:20: info: Util.cpp:787 Sending service command to facilitate uninstall","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-02-11T06:31:20.534Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":68},"message":"2025-02-11 06:31:20: info: Util.cpp:814 Service command to facilitate uninstall succeeded","context":"command output","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-02-11T06:31:41.155Z","message":"Non-zero metrics in the last 30s","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.origin":{"file.line":192,"file.name":"log/log.go","function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot"},"service.name":"filebeat","monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cpu":{"system":{"ticks":562,"time":{"ms":109}},"total":{"ticks":1327,"time":{"ms":203},"value":1327},"user":{"ticks":765,"time":{"ms":94}}},"info":{"ephemeral_id":"30a52344-9f72-4a98-99c2-98825e31b212","uptime":{"ms":362000},"version":"9.0.0-beta1"},"memstats":{"gc_next":75016032,"memory_alloc":67938896,"memory_sys":4456448,"memory_total":128954184,"rss":152424448},"runtime":{"goroutines":60}},"filebeat":{"harvester":{"closed":1,"open_files":2,"running":2,"started":1}},"libbeat":{"config":{"module":{"running":2,"starts":1,"stops":1}},"output":{"events":{"acked":1204,"active":0,"batches":2,"total":1204},"read":{"bytes":997,"errors":2},"write":{"bytes":162075,"latency":{"histogram":{"count":9,"max":752,"mean":316.44444444444446,"median":212,"min":201,"p75":419.5,"p95":752,"p99":752,"p999":752,"stddev":177.6789308531988}}}},"pipeline":{"clients":2,"events":{"active":0,"filtered":34,"published":1203,"total":1237},"queue":{"acked":1204,"added":{"bytes":2208792,"events":1203},"consumed":{"bytes":2210597,"events":1204},"filled":{"bytes":0,"events":0,"pct":0},"max_bytes":0,"max_events":3200,"removed":{"bytes":2210597,"events":1204}}}},"registrar":{"states":{"current":0}},"system":{"handles":{"open":2}}}},"log.logger":"monitoring","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-02-11T06:31:41.363Z","message":"Non-zero metrics in the last 30s","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.logger":"monitoring","log.origin":{"file.line":192,"file.name":"log/log.go","function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot"},"service.name":"metricbeat","monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cpu":{"system":{"ticks":312},"total":{"ticks":765,"value":765},"user":{"ticks":453}},"info":{"ephemeral_id":"f3c21b9a-9015-4122-b9ec-da68751ea098","uptime":{"ms":361421},"version":"9.0.0-beta1"},"memstats":{"gc_next":75214048,"memory_alloc":36488056,"memory_total":69299080,"rss":123437056},"runtime":{"goroutines":66}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":6,"starts":3,"stops":3}},"output":{"events":{"acked":3,"active":0,"batches":1,"total":3},"read":{"bytes":397,"errors":1},"write":{"bytes":1810,"latency":{"histogram":{"count":7,"max":301,"mean":200.71428571428572,"median":206,"min":73,"p75":213,"p95":301,"p99":301,"p999":301,"stddev":61.64811267117889}}}},"pipeline":{"clients":6,"events":{"active":0,"published":3,"total":3},"queue":{"acked":3,"added":{"bytes":4609,"events":3},"consumed":{"bytes":4609,"events":3},"filled":{"bytes":0,"events":0,"pct":0},"max_bytes":0,"max_events":3200,"removed":{"bytes":4609,"events":3}}}},"metricbeat":{"beat":{"stats":{"consecutive_failures":3,"events":3,"failures":3}}},"registrar":{"states":{"current":0}}}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-02-11T06:31:41.523Z","message":"Non-zero metrics in the last 30s","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"log.logger":"monitoring","log.origin":{"file.line":192,"file.name":"log/log.go","function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot"},"service.name":"metricbeat","monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cpu":{"system":{"ticks":375,"time":{"ms":16}},"total":{"ticks":953,"time":{"ms":32},"value":953},"user":{"ticks":578,"time":{"ms":16}}},"info":{"ephemeral_id":"12ee1d09-a525-462b-a209-80cc51d2d542","uptime":{"ms":360871},"version":"9.0.0-beta1"},"memstats":{"gc_next":76344296,"memory_alloc":39419888,"memory_total":77635856,"rss":126078976},"runtime":{"goroutines":86}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":10,"starts":5,"stops":5}},"output":{"events":{"acked":5,"active":0,"batches":1,"total":5},"read":{"bytes":397,"errors":1},"write":{"bytes":2102,"latency":{"histogram":{"count":7,"max":222,"mean":188.28571428571428,"median":203,"min":71,"p75":212,"p95":222,"p99":222,"p999":222,"stddev":48.331339391195634}}}},"pipeline":{"clients":10,"events":{"active":0,"published":5,"total":5},"queue":{"acked":5,"added":{"bytes":7738,"events":5},"consumed":{"bytes":7738,"events":5},"filled":{"bytes":0,"events":0,"pct":0},"max_bytes":0,"max_events":3200,"removed":{"bytes":7738,"events":5}}}},"metricbeat":{"http":{"json":{"consecutive_failures":5,"events":5,"failures":5}}},"registrar":{"states":{"current":0}},"system":{"handles":{"open":-2}}}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-02-11T06:31:42.597Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":699},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (DEGRADED->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"HEALTHY","old_state":"DEGRADED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-02-11T06:32:11.155Z","message":"Non-zero metrics in the last 30s","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cpu":{"system":{"ticks":593,"time":{"ms":31}},"total":{"ticks":1421,"time":{"ms":94},"value":1421},"user":{"ticks":828,"time":{"ms":63}}},"info":{"ephemeral_id":"30a52344-9f72-4a98-99c2-98825e31b212","uptime":{"ms":391999},"version":"9.0.0-beta1"},"memstats":{"gc_next":75255016,"memory_alloc":34824344,"memory_total":131475384,"rss":153874432},"runtime":{"goroutines":62}},"filebeat":{"harvester":{"open_files":2,"running":2}},"libbeat":{"config":{"module":{"running":2}},"output":{"events":{"acked":23,"active":0,"batches":2,"total":23},"read":{"bytes":797,"errors":1},"write":{"bytes":6177,"latency":{"histogram":{"count":11,"max":752,"mean":298.54545454545456,"median":212,"min":201,"p75":373,"p95":752,"p99":752,"p999":752,"stddev":165.1959517643784}}}},"pipeline":{"clients":2,"events":{"active":6,"filtered":4,"published":29,"total":33},"queue":{"acked":23,"added":{"bytes":53070,"events":29},"consumed":{"bytes":42117,"events":23},"filled":{"bytes":10953,"events":6,"pct":0.001875},"max_bytes":0,"max_events":3200,"removed":{"bytes":42117,"events":23}}}},"registrar":{"states":{"current":0}},"system":{"handles":{"open":1}}}},"log.logger":"monitoring","log.origin":{"file.line":192,"file.name":"log/log.go","function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot"},"service.name":"filebeat","ecs.version":"1.6.0"}

Looks like some sort of a race condition. Is this specific to 9.0.0 or also reproducible on other releases too?
@leehinman @pkoutsovasilis does this error ring a bell on your end?

@muskangulati-qasource
Copy link

Secondary review is Done for this ticket!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

4 participants