-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky Test]: TestSetLogLevelFleetManaged – agent never communicated agent-specific log level "debug" to Fleet #5193
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
another occurrence of it here |
another occurrence of it here |
another occurrence here |
Looking at the test failure from Pavel above, we failed with the following (using JSON logs since those have timestamps to correlate with agent logs): {"Time":"2025-02-28T14:22:47.630203766Z","Output":" log_level_test.go:147: Fleet metadata log level for agent \"241e557a-1e55-4ffd-9aa2-e5f27b576b0a\": \"error\" agent log level: \"debug\"\n"}
{"Time":"2025-02-28T14:22:48.555136985Z","Output":" log_level_test.go:141: \n"}
{"Time":"2025-02-28T14:22:48.555151261Z","Output":" \tError Trace:\t/opt/buildkite-agent/builds/bk-agent-prod-aws-1740750815706106367/elastic/elastic-agent/testing/integration/log_level_test.go:141\n"}
{"Time":"2025-02-28T14:22:48.55515913Z","Output":" \t \t\t\t\t/opt/buildkite-agent/builds/bk-agent-prod-aws-1740750815706106367/elastic/elastic-agent/testing/integration/log_level_test.go:77\n"}
{"Time":"2025-02-28T14:22:48.555163257Z","Output":" \tError: \tCondition never satisfied\n"}
{"Time":"2025-02-28T14:22:48.555169238Z","Output":" \tTest: \tTestSetLogLevelFleetManaged\n"}
{"Time":"2025-02-28T14:22:48.555173398Z","Output":" \tMessages: \tagent never communicated agent-specific log level \"debug\" to Fleet\n"} That is this block, where we send a SETTINGS action with log level set to Debug and then poll for the agent policy to reflect it in Fleet. elastic-agent/testing/integration/log_level_test.go Lines 126 to 149 in d668787
The first thing I noticed is that his has a 30s timeout, so if this is dependent on a checkin happening, just waiting 30s doesn't guarantee that. The checkin interval is 5m. The test fails at
{"log.level":"debug","@timestamp":"2025-02-28T14:22:20.045Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/dispatcher.(*ActionDispatcher).Dispatch","file.name":"dispatcher/dispatcher.go","file.line":165},"message":"Successfully dispatched action: 'id: 9857f355-073e-4272-94af-5ec6fcd41e7c, type: SETTINGS, log_level: debug'","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-02-28T14:22:49.626Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/actions/handlers.(*Settings).Handle","file.name":"handlers/handler_action_settings.go","file.line":44},"message":"handlerSettings: action 'id: cecf373b-d729-4381-abab-c3630498a5e0, type: SETTINGS, log_level: ' received","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-02-28T14:22:50.652Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).Run","file.name":"fleet/fleet_gateway.go","file.line":152},"message":"FleetGateway calling Checkin API","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-02-28T14:22:50.652Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":182},"message":"Checking started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-02-28T14:22:50.682Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/actions/handlers.(*Settings).handleLogLevel","file.name":"handlers/handler_action_settings.go","file.line":85},"message":"Settings action done, setting agent log level to policy default error","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} Looking at all of that, the two concrete things to do are:
|
Yesterday I analyzed the diagnostic dump from https://buildkite.com/elastic/elastic-agent/builds/18036#0195619b-5176-4126-bb4e-65061f96216c/375-1104 and discussed it with @pkoutsovasilis . {"log.level":"debug","@timestamp":"2025-03-04T15:10:46.578Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/fleetapi/acker/lazy.(*Acker).Commit","file.name":"lazy/lazy_acker.go","file.line":83},"message":"lazy acker: ack batch: [id: 6fcd825e-ae0f-4d81-840f-872c609a1f16, type: SETTINGS, log_level: ]","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.578Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/fleetapi/acker/fleet.(*Acker).AckBatch","file.name":"fleet/fleet_acker.go","file.line":81},"message":"fleet acker: ackbatch, actions: []fleetapi.Action{(*fleetapi.ActionSettings)(0xc0018b3950)}","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.579Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/fleetapi/acker/fleet.(*Acker).AckBatch","file.name":"fleet/fleet_acker.go","file.line":99},"message":"fleet acker: ackbatch, events: []fleetapi.AckEvent{fleetapi.AckEvent{EventType:\"ACTION_RESULT\", SubType:\"ACKNOWLEDGED\", Timestamp:\"2025-03-04T15:10:46.579+00:00\", ActionID:\"6fcd825e-ae0f-4d81-840f-872c609a1f16\", AgentID:\"a1caa943-a790-4e85-bddf-55c4e39bab12\", Message:\"Action \\\"6fcd825e-ae0f-4d81-840f-872c609a1f16\\\" of type \\\"SETTINGS\\\" acknowledged.\", Payload:json.RawMessage(nil), Data:json.RawMessage(nil), ActionInputType:\"\", ActionData:json.RawMessage(nil), ActionResponse:map[string]interface {}(nil), StartedAt:\"\", CompletedAt:\"\", Error:\"\"}}","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.579Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/fleetapi/acker/fleet.(*Acker).AckBatch","file.name":"fleet/fleet_acker.go","file.line":110},"message":"1 actions with ids '6fcd825e-ae0f-4d81-840f-872c609a1f16' acknowledging","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.579Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/remote.(*Client).Send","file.name":"remote/client.go","file.line":172},"message":"Request method: POST, path: /api/fleet/agents/a1caa943-a790-4e85-bddf-55c4e39bab12/acks, reqID: 01JNGVB6SKZA4MRQCJQZ1KBZN7","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.579Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/remote.(*Client).Send","file.name":"remote/client.go","file.line":186},"message":"Creating new request to request URL https://8485e14a0ed85dfa290ba7f51561b40d.fleet.us-west2.gcp.elastic-cloud.com:443/api/fleet/agents/a1caa943-a790-4e85-bddf-55c4e39bab12/acks?","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.048Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).Run","file.name":"fleet/fleet_gateway.go","file.line":162},"message":"FleetGateway calling Checkin API","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.049Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":192},"message":"Checking started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.267Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).execute","file.name":"fleet/fleet_gateway.go","file.line":345},"message":"using previously saved ack token: cadbefe9-6aed-4241-860f-9a5ef1233647","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.267Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).execute","file.name":"fleet/fleet_gateway.go","file.line":354},"message":"correcting agent loglevel from info to debug using coordinator state","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.267Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/remote.(*Client).Send","file.name":"remote/client.go","file.line":172},"message":"Request method: POST, path: /api/fleet/agents/a1caa943-a790-4e85-bddf-55c4e39bab12/checkin, reqID: 01JNGVB7F3QHBAEN9X2GH533Z0","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.267Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/remote.(*Client).Send","file.name":"remote/client.go","file.line":186},"message":"Creating new request to request URL https://8485e14a0ed85dfa290ba7f51561b40d.fleet.us-west2.gcp.elastic-cloud.com:443/api/fleet/agents/a1caa943-a790-4e85-bddf-55c4e39bab12/checkin?","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-03-04T15:10:47.396Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/actions/handlers.(*Settings).handleLogLevel","file.name":"handlers/handler_action_settings.go","file.line":85},"message":"Settings action done, setting agent log level to policy default error","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-03-04T15:11:18.196Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1744},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{},"outputs":{}},"ecs.version":"1.6.0"} we can see that:
So the test success is definitely depending on the timing of elastic-agent checkins and how those interleave with handling incoming actions, we cannot set a shorter checkin time as #2257 has not been implemented and #2408 has been closed (this is the necessary parameter we need to send to fleet to change the default duration of the fleet long poll). I am going to open a PR to increase the timeouts for the test assertions to 6 minutes (5 minutes + 20% margin) and those should take care of the timing variations we may encounter |
@cmacknz there is no mistery SETTINGS action that I can see here, it comes from
|
Failing test case
TestSetLogLevelFleetManaged
Error message
agent never communicated agent-specific log level "debug" to Fleet
Build
https://buildkite.com/elastic/elastic-agent-extended-testing/builds/1538#0190cc1c-3e41-42d1-82ba-856491de639e
OS
Linux
Stacktrace and notes
The text was updated successfully, but these errors were encountered: