-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux agent gets offline on setting invalid Remote Elasticsearch output and then updating to valid output. #6784
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
@muskangulati-qasource Please review. |
Secondary review is Done for this ticket! |
Looks like sub-processes missing checkins caused this: {"log.level":"warn","@timestamp":"2025-02-10T10:01:44.150Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":681},"message":"Component state changed filestream-monitoring (HEALTHY->DEGRADED): Degraded: pid '1139' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2025-02-10T10:01:44.150Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":681},"message":"Component state changed beat/metrics-monitoring (HEALTHY->DEGRADED): Degraded: pid '1148' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"} |
Actually this is about agent being offline, for that I see
It does look like it comes back afterwards: {"log.level":"warn","@timestamp":"2025-02-10T10:00:54.143Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":219},"message":"Checkin request to fleet-server succeeded after 1 failures","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} |
Is this reproducible? |
Hi @cmacknz Thank you for looking into this issue. We have revalidated this issue with 2 linux and 1 Windows agent, and we are able to reproduce issue on all 3 agents including Windows agent too. Screenshot: Logs: elastic-agent-diagnostics-2025-02-11T05-27-02Z-00.zip Linux 1: elastic-agent-diagnostics-2025-02-11T05-37-35Z-00.zip Linux2: elastic-agent-diagnostics-2025-02-11T05-37-50Z-00.zip Please let us know if anything else is required from our end. |
The agents all look online from the agent's perspective. Can you get me the agent details documents for the offline agents? I am beginning to wonder if this is a Fleet bug. |
Hi @cmacknz Agent 1: Agent 2: Screenshot: Additionally, we do observe inconsistencies during these tests:
Please let us know if anything else is required from our end. |
Picking out the second agent arbitrarily. Fleet thinks the last checkin was "id": "de610f57-d1b1-4539-9ebb-d917bf31cfb8",
"type": "PERMANENT",
"active": true,
"enrolled_at": "2025-02-11T05:15:06Z",
"access_api_key_id": "wc1u85QB30oqMm2ZtjCd",
"policy_id": "1505ddac-69ef-4b18-ac01-4e20c059d016",
"last_checkin": "2025-02-12T06:58:43Z",
"last_checkin_status": "online",
"last_checkin_message": "Running",
"policy_revision": 47,
"packages": [],
"unhealthy_reason": [
"other"
],
"status": "offline",
"metrics": {
"cpu_avg": 0.00172,
"memory_size_byte_avg": 223114068
} The agent diagnostics show it is on revision 47, that it thinks it's still connected to Fleet, and it is still running with no error logs related to the Fleet connection. So it should be online, but Fleet thinks it isn't. The agent document in .fleet-agents doesn't tell us when the state to offline was changed, and the agent doesn't log or contain metrics about whether it is successfully checking in with Fleet. It only logs errors. |
The last interaction with Fleet for this agent is: {
"log.level": "info",
"@timestamp": "2025-02-12T06:58:01.642Z",
"log.origin": {
"function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/actions/handlers.(*PolicyChangeHandler).applyLoggingConfig",
"file.name": "handlers/handler_action_policy_change.go",
"file.line": 405
},
"message": "Setting fallback log level <nil> from policy",
"log": {
"source": "elastic-agent"
},
"ecs.version": "1.6.0"
}
{
"log.level": "info",
"@timestamp": "2025-02-12T06:58:01.657Z",
"log.origin": {
"function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade.(*Upgrader).Reload",
"file.name": "upgrade/upgrade.go",
"file.line": 124
},
"message": "Source URI changed from \"https://staging.elastic.co/9.0.0-beta1-191089b1/downloads/\" to \"https://staging.elastic.co/9.0.0-beta1-191089b1/downloads/\"",
"log": {
"source": "elastic-agent"
},
"ecs.version": "1.6.0"
}
{
"log.level": "info",
"@timestamp": "2025-02-12T06:58:01.657Z",
"log.origin": {
"function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/monitoring/reload.(*ServerReloader).Stop",
"file.name": "reload/reload.go",
"file.line": 74
},
"message": "Stopping monitoring server",
"log": {
"source": "elastic-agent"
},
"ecs.version": "1.6.0"
}
{
"log.level": "info",
"@timestamp": "2025-02-12T06:58:01.657Z",
"log.logger": "api",
"log.origin": {
"function": "github.com/elastic/elastic-agent-libs/api.(*Server).Start.func1",
"file.name": "api/server.go",
"file.line": 90
},
"message": "Stats endpoint (127.0.0.1:6791) finished: accept tcp 127.0.0.1:6791: use of closed network connection",
"log": {
"source": "elastic-agent"
},
"ecs.version": "1.6.0"
}
{
"log.level": "info",
"@timestamp": "2025-02-12T06:58:01.657Z",
"log.origin": {
"function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/monitoring/reload.(*ServerReloader).Start",
"file.name": "reload/reload.go",
"file.line": 54
},
"message": "Starting monitoring server with cfg &config.MonitoringConfig{Enabled:true, MonitorLogs:true, MonitorMetrics:true, MetricsPeriod:\"\", FailureThreshold:(*uint)(nil), LogMetrics:true, HTTP:(*config.MonitoringHTTPConfig)(0xc001d226c0), Namespace:\"sles15\", Pprof:(*config.PprofConfig)(nil), MonitorTraces:true, APM:config.APMConfig{Environment:\"\", APIKey:\"\", SecretToken:\"\", Hosts:[]string(nil), GlobalLabels:map[string]string(nil), TLS:config.APMTLS{SkipVerify:false, ServerCertificate:\"\", ServerCA:\"\"}, SamplingRate:(*float32)(nil)}, Diagnostics:config.Diagnostics{Uploader:config.Uploader{MaxRetries:10, InitDur:1000000000, MaxDur:600000000000}, Limit:config.Limit{Interval:60000000000, Burst:1}}}",
"log": {
"source": "elastic-agent"
},
"ecs.version": "1.6.0"
}
{
"log.level": "info",
"@timestamp": "2025-02-12T06:58:01.657Z",
"log.origin": {
"function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/monitoring.NewServer.exposeMetricsEndpoint.func1",
"file.name": "monitoring/server.go",
"file.line": 96
},
"message": "creating monitoring API with cfg api.Config{Enabled:true, Host:\"http://localhost:6791\", Port:6791, User:\"\", SecurityDescriptor:\"\", Timeout:5000000000}",
"log": {
"source": "elastic-agent"
},
"ecs.version": "1.6.0"
}
{
"log.level": "info",
"@timestamp": "2025-02-12T06:58:01.657Z",
"log.logger": "api",
"log.origin": {
"function": "github.com/elastic/elastic-agent-libs/api.(*Server).Start",
"file.name": "api/server.go",
"file.line": 85
},
"message": "Starting stats endpoint",
"log": {
"source": "elastic-agent"
},
"ecs.version": "1.6.0"
}
{
"log.level": "info",
"@timestamp": "2025-02-12T06:58:01.657Z",
"log.logger": "api",
"log.origin": {
"function": "github.com/elastic/elastic-agent-libs/api.(*Server).Start.func1",
"file.name": "api/server.go",
"file.line": 87
},
"message": "Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791)",
"log": {
"source": "elastic-agent"
},
"ecs.version": "1.6.0"
}
{
"log.level": "info",
"@timestamp": "2025-02-12T06:58:01.677Z",
"log.origin": {
"function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate",
"file.name": "coordinator/coordinator.go",
"file.line": 1614
},
"message": "component model updated",
"log": {
"source": "elastic-agent"
},
"changes": {
"components": {
"updated": [
"endpoint-02d7481a-1e96-4c82-9a2d-83602acbecf2: [(endpoint-02d7481a-1e96-4c82-9a2d-83602acbecf2-17d173ab-7ad1-480b-aac2-3cea51066f58: updated) (endpoint-02d7481a-1e96-4c82-9a2d-83602acbecf2: updated)]",
"log-02d7481a-1e96-4c82-9a2d-83602acbecf2: [(log-02d7481a-1e96-4c82-9a2d-83602acbecf2: updated) (log-02d7481a-1e96-4c82-9a2d-83602acbecf2-logfile-system-0cc91647-bfef-46ea-b8b9-e4357b847562: updated)]",
"system/metrics-02d7481a-1e96-4c82-9a2d-83602acbecf2: [(system/metrics-02d7481a-1e96-4c82-9a2d-83602acbecf2-system/metrics-system-0cc91647-bfef-46ea-b8b9-e4357b847562: updated) (system/metrics-02d7481a-1e96-4c82-9a2d-83602acbecf2: updated)]"
],
"count": 7
},
"outputs": {}
},
"ecs.version": "1.6.0"
}
{
"log.level": "info",
"@timestamp": "2025-02-12T06:58:01.677Z",
"log.origin": {
"function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).refreshComponentModel",
"file.name": "coordinator/coordinator.go",
"file.line": 1407
},
"message": "Updating running component model",
"log": {
"source": "elastic-agent"
},
"ecs.version": "1.6.0"
} |
Kibana Build details:
https://staging.elastic.co/9.0.0-beta1-191089b1/summary-9.0.0-beta1.html
Preconditions:
Steps to reproduce:
Workaround:
Note:
Expected Result:
Linux agent gets should get Healthy on setting invalid Remote Elasticsearch output and then updating to valid output.
Logs:
elastic-agent-diagnostics-2025-02-10T12-02-16Z-00.zip
Screenshot:
The text was updated successfully, but these errors were encountered: