Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux agent gets offline on setting invalid Remote Elasticsearch output and then updating to valid output. #6784

Open
amolnater-qasource opened this issue Feb 10, 2025 · 11 comments
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amolnater-qasource
Copy link

Kibana Build details:

VERSION: 9.0.0 beta1 BC2
BUILD: 83474
COMMIT: 88aa3d3604b2c71e998595c3208b3d82cef24d2a

https://staging.elastic.co/9.0.0-beta1-191089b1/summary-9.0.0-beta1.html

Preconditions:

  1. 9.0.0 beta BC1 Kibana cloud environment should be available.
  2. An agent should be installed.

Steps to reproduce:

  1. Setup a valid remote Elasticsearch.
  2. Install a linux agent with System and Elastic Defend integration.
  3. Now select Remote Elasticsearch output as an output for integration under agent policy.
  4. Remove anything from the url of the output say 443 to 44, to make it invalid.
  5. Wait for sometime and observe the linux agent gets outdated.
  6. Now add 443 again observe agent remains permanently offline.

Workaround:

  • We manually reboot the VM, to get the agent Healthy.

Note:

  • In current testing the issue is not reproducible on Windows.
  • Windows agent gets unhealthy with invalid output and gets back healthy with valid output.

Expected Result:
Linux agent gets should get Healthy on setting invalid Remote Elasticsearch output and then updating to valid output.

Logs:

elastic-agent-diagnostics-2025-02-10T12-02-16Z-00.zip

Screenshot:

Image

@amolnater-qasource amolnater-qasource added bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Feb 10, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@amolnater-qasource
Copy link
Author

@muskangulati-qasource Please review.

@muskangulati-qasource
Copy link

Secondary review is Done for this ticket!

@cmacknz
Copy link
Member

cmacknz commented Feb 10, 2025

Looks like sub-processes missing checkins caused this:

{"log.level":"warn","@timestamp":"2025-02-10T10:01:44.150Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":681},"message":"Component state changed filestream-monitoring (HEALTHY->DEGRADED): Degraded: pid '1139' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

{"log.level":"warn","@timestamp":"2025-02-10T10:01:44.150Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":681},"message":"Component state changed beat/metrics-monitoring (HEALTHY->DEGRADED): Degraded: pid '1148' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Feb 10, 2025

Actually this is about agent being offline, for that I see connection refused errors lookup 2f7e5c358a004b91ae7101156ee82b7a.fleet.us-west2.gcp.elastic-cloud.com on [::1]:53: read udp [::1]:38622->[::1]:53: read: connection

{"log.level":"warn","@timestamp":"2025-02-10T09:59:45.671Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":186},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: requester 0/1 to host https://2f7e5c358a004b91ae7101156ee82b7a.fleet.us-west2.gcp.elastic-cloud.com:443/ errored: Post \"https://2f7e5c358a004b91ae7101156ee82b7a.fleet.us-west2.gcp.elastic-cloud.com:443/api/fleet/agents/d480e246-85d8-4d28-8da7-bfbfb6a5ca7a/checkin?\": lookup 2f7e5c358a004b91ae7101156ee82b7a.fleet.us-west2.gcp.elastic-cloud.com on [::1]:53: read udp [::1]:38622->[::1]:53: read: connection refused"},"request_duration_ns":572632,"failed_checkins":1,"retry_after_ns":67258945810,"ecs.version":"1.6.0"}

It does look like it comes back afterwards:

{"log.level":"warn","@timestamp":"2025-02-10T10:00:54.143Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":219},"message":"Checkin request to fleet-server succeeded after 1 failures","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Feb 10, 2025

Is this reproducible?

@amolnater-qasource
Copy link
Author

Hi @cmacknz

Thank you for looking into this issue.

We have revalidated this issue with 2 linux and 1 Windows agent, and we are able to reproduce issue on all 3 agents including Windows agent too.

Screenshot:

Image

Image

Logs:
Windows:

elastic-agent-diagnostics-2025-02-11T05-27-02Z-00.zip

Linux 1:

elastic-agent-diagnostics-2025-02-11T05-37-35Z-00.zip

Linux2:

elastic-agent-diagnostics-2025-02-11T05-37-50Z-00.zip

Please let us know if anything else is required from our end.
Thanks!!

@cmacknz
Copy link
Member

cmacknz commented Feb 11, 2025

The agents all look online from the agent's perspective. Can you get me the agent details documents for the offline agents?

I am beginning to wonder if this is a Fleet bug.

@amolnater-qasource
Copy link
Author

Hi @cmacknz
Please find below elastic-agent.json and fresh logs for the linux agents.

Agent 1:
ip-172-31-19-114-agent-details.zip
Image
elastic-agent-diagnostics-2025-02-12T07-47-45Z-00.zip

Agent 2:
ip-172-31-88-136-agent-details.zip
Image
elastic-agent-diagnostics-2025-02-12T07-48-03Z-00.zip

Screenshot:

Image

Additionally, we do observe inconsistencies during these tests:

  • Most of the time this reported issue is reproduced.
  • Sometimes Agent unknowingly remains unhealthy on invalid-valid transitioning of output.
  • Sometimes no issue is observed.

Please let us know if anything else is required from our end.
Thanks!!

@cmacknz
Copy link
Member

cmacknz commented Feb 12, 2025

Picking out the second agent arbitrarily. Fleet thinks the last checkin was 2025-02-12T06:58:43Z, that the agent is offline, and that it has an outdated policy revision of 47.

  "id": "de610f57-d1b1-4539-9ebb-d917bf31cfb8",
  "type": "PERMANENT",
  "active": true,
  "enrolled_at": "2025-02-11T05:15:06Z",
  "access_api_key_id": "wc1u85QB30oqMm2ZtjCd",
  "policy_id": "1505ddac-69ef-4b18-ac01-4e20c059d016",
  "last_checkin": "2025-02-12T06:58:43Z",
  "last_checkin_status": "online",
  "last_checkin_message": "Running",
  "policy_revision": 47,
  "packages": [],

  "unhealthy_reason": [
    "other"
  ],
  "status": "offline",
  "metrics": {
    "cpu_avg": 0.00172,
    "memory_size_byte_avg": 223114068
  }

The agent diagnostics show it is on revision 47, that it thinks it's still connected to Fleet, and it is still running with no error logs related to the Fleet connection. So it should be online, but Fleet thinks it isn't.

The agent document in .fleet-agents doesn't tell us when the state to offline was changed, and the agent doesn't log or contain metrics about whether it is successfully checking in with Fleet. It only logs errors.

@cmacknz
Copy link
Member

cmacknz commented Feb 12, 2025

The last interaction with Fleet for this agent is:

{
  "log.level": "info",
  "@timestamp": "2025-02-12T06:58:01.642Z",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/actions/handlers.(*PolicyChangeHandler).applyLoggingConfig",
    "file.name": "handlers/handler_action_policy_change.go",
    "file.line": 405
  },
  "message": "Setting fallback log level <nil> from policy",
  "log": {
    "source": "elastic-agent"
  },
  "ecs.version": "1.6.0"
}
{
  "log.level": "info",
  "@timestamp": "2025-02-12T06:58:01.657Z",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade.(*Upgrader).Reload",
    "file.name": "upgrade/upgrade.go",
    "file.line": 124
  },
  "message": "Source URI changed from \"https://staging.elastic.co/9.0.0-beta1-191089b1/downloads/\" to \"https://staging.elastic.co/9.0.0-beta1-191089b1/downloads/\"",
  "log": {
    "source": "elastic-agent"
  },
  "ecs.version": "1.6.0"
}
{
  "log.level": "info",
  "@timestamp": "2025-02-12T06:58:01.657Z",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/monitoring/reload.(*ServerReloader).Stop",
    "file.name": "reload/reload.go",
    "file.line": 74
  },
  "message": "Stopping monitoring server",
  "log": {
    "source": "elastic-agent"
  },
  "ecs.version": "1.6.0"
}
{
  "log.level": "info",
  "@timestamp": "2025-02-12T06:58:01.657Z",
  "log.logger": "api",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent-libs/api.(*Server).Start.func1",
    "file.name": "api/server.go",
    "file.line": 90
  },
  "message": "Stats endpoint (127.0.0.1:6791) finished: accept tcp 127.0.0.1:6791: use of closed network connection",
  "log": {
    "source": "elastic-agent"
  },
  "ecs.version": "1.6.0"
}
{
  "log.level": "info",
  "@timestamp": "2025-02-12T06:58:01.657Z",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/monitoring/reload.(*ServerReloader).Start",
    "file.name": "reload/reload.go",
    "file.line": 54
  },
  "message": "Starting monitoring server with cfg &config.MonitoringConfig{Enabled:true, MonitorLogs:true, MonitorMetrics:true, MetricsPeriod:\"\", FailureThreshold:(*uint)(nil), LogMetrics:true, HTTP:(*config.MonitoringHTTPConfig)(0xc001d226c0), Namespace:\"sles15\", Pprof:(*config.PprofConfig)(nil), MonitorTraces:true, APM:config.APMConfig{Environment:\"\", APIKey:\"\", SecretToken:\"\", Hosts:[]string(nil), GlobalLabels:map[string]string(nil), TLS:config.APMTLS{SkipVerify:false, ServerCertificate:\"\", ServerCA:\"\"}, SamplingRate:(*float32)(nil)}, Diagnostics:config.Diagnostics{Uploader:config.Uploader{MaxRetries:10, InitDur:1000000000, MaxDur:600000000000}, Limit:config.Limit{Interval:60000000000, Burst:1}}}",
  "log": {
    "source": "elastic-agent"
  },
  "ecs.version": "1.6.0"
}
{
  "log.level": "info",
  "@timestamp": "2025-02-12T06:58:01.657Z",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/monitoring.NewServer.exposeMetricsEndpoint.func1",
    "file.name": "monitoring/server.go",
    "file.line": 96
  },
  "message": "creating monitoring API with cfg api.Config{Enabled:true, Host:\"http://localhost:6791\", Port:6791, User:\"\", SecurityDescriptor:\"\", Timeout:5000000000}",
  "log": {
    "source": "elastic-agent"
  },
  "ecs.version": "1.6.0"
}
{
  "log.level": "info",
  "@timestamp": "2025-02-12T06:58:01.657Z",
  "log.logger": "api",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent-libs/api.(*Server).Start",
    "file.name": "api/server.go",
    "file.line": 85
  },
  "message": "Starting stats endpoint",
  "log": {
    "source": "elastic-agent"
  },
  "ecs.version": "1.6.0"
}
{
  "log.level": "info",
  "@timestamp": "2025-02-12T06:58:01.657Z",
  "log.logger": "api",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent-libs/api.(*Server).Start.func1",
    "file.name": "api/server.go",
    "file.line": 87
  },
  "message": "Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791)",
  "log": {
    "source": "elastic-agent"
  },
  "ecs.version": "1.6.0"
}
{
  "log.level": "info",
  "@timestamp": "2025-02-12T06:58:01.677Z",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate",
    "file.name": "coordinator/coordinator.go",
    "file.line": 1614
  },
  "message": "component model updated",
  "log": {
    "source": "elastic-agent"
  },
  "changes": {
    "components": {
      "updated": [
        "endpoint-02d7481a-1e96-4c82-9a2d-83602acbecf2: [(endpoint-02d7481a-1e96-4c82-9a2d-83602acbecf2-17d173ab-7ad1-480b-aac2-3cea51066f58: updated) (endpoint-02d7481a-1e96-4c82-9a2d-83602acbecf2: updated)]",
        "log-02d7481a-1e96-4c82-9a2d-83602acbecf2: [(log-02d7481a-1e96-4c82-9a2d-83602acbecf2: updated) (log-02d7481a-1e96-4c82-9a2d-83602acbecf2-logfile-system-0cc91647-bfef-46ea-b8b9-e4357b847562: updated)]",
        "system/metrics-02d7481a-1e96-4c82-9a2d-83602acbecf2: [(system/metrics-02d7481a-1e96-4c82-9a2d-83602acbecf2-system/metrics-system-0cc91647-bfef-46ea-b8b9-e4357b847562: updated) (system/metrics-02d7481a-1e96-4c82-9a2d-83602acbecf2: updated)]"
      ],
      "count": 7
    },
    "outputs": {}
  },
  "ecs.version": "1.6.0"
}
{
  "log.level": "info",
  "@timestamp": "2025-02-12T06:58:01.677Z",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).refreshComponentModel",
    "file.name": "coordinator/coordinator.go",
    "file.line": 1407
  },
  "message": "Updating running component model",
  "log": {
    "source": "elastic-agent"
  },
  "ecs.version": "1.6.0"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

4 participants