Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test]: TestSetLogLevelFleetManaged – agent never communicated agent-specific log level "debug" to Fleet #5193

Closed
rdner opened this issue Jul 24, 2024 · 8 comments · Fixed by #7190
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@rdner
Copy link
Member

rdner commented Jul 24, 2024

Failing test case

TestSetLogLevelFleetManaged

Error message

agent never communicated agent-specific log level "debug" to Fleet

Build

https://buildkite.com/elastic/elastic-agent-extended-testing/builds/1538#0190cc1c-3e41-42d1-82ba-856491de639e

OS

Linux

Stacktrace and notes

=== RUN   TestSetLogLevelFleetManaged
    log_level_test.go:372: Creating Agent policy...
    log_level_test.go:376: Creating Agent enrollment API key...
    log_level_test.go:51: Created policy {ID:367ab2ce-b79d-425f-9a7c-38aaad26ded0 Name:testloglevel-policy-d6a4c63a-6a14-4b29-9414-92359c132692 Namespace:default Description:Test Log Level Policy d6a4c63a-6a14-4b29-9414-92359c132692 MonitoringEnabled:[] DataOutputID: MonitoringOutputID: FleetServerHostID: DownloadSourceID: UnenrollTimeout:0 InactivityTImeout:1209600 AgentFeatures:[] IsProtected:false}
    log_level_test.go:53: Getting default Fleet Server URL...
    fixture_install.go:164: [test TestSetLogLevelFleetManaged] Inside fixture install function
    fixture_install.go:192: [test TestSetLogLevelFleetManaged] Inside fixture installNoPkgManager function
    fetcher.go:95: Using existing artifact elastic-agent-8.16.0-SNAPSHOT-linux-x86_64.tar.gz
    fixture.go:282: Extracting artifact elastic-agent-8.16.0-SNAPSHOT-linux-x86_64.tar.gz to /tmp/TestSetLogLevelFleetManaged1237235037/001
    fixture.go:300: Completed extraction of artifact elastic-agent-8.16.0-SNAPSHOT-linux-x86_64.tar.gz to /tmp/TestSetLogLevelFleetManaged1237235037/001
    fixture.go:906: Components were not modified from the fetched artifact
    fixture.go:657: >> running binary with: [/tmp/TestSetLogLevelFleetManaged1237235037/001/elastic-agent-8.16.0-SNAPSHOT-linux-x86_64/elastic-agent install --force --non-interactive --unprivileged --url https://5eac59fc5a6a4375a5a027ffaaab6551.fleet.us-west2.gcp.elastic-cloud.com:443 --enrollment-token aWQ1UXpKQUI0dlg5RXViVjlyelY6eHIwQTd1WUZTRy1lYUhFUUxNMnU4Zw==]
    fixture.go:657: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    log_level_test.go:210: error fetching agent status: error running command (output: {
            "info": {
                "id": "7db96287-d7d0-4643-81bc-ab6c5fe3c48c",
                "version": "8.16.0",
                "commit": "acfed1974a4dcb265783d5a177b94e97cd3a5720",
                "build_time": "2024-07-19 17:54:23 +0000 UTC",
                "snapshot": true,
                "pid": 30524,
                "unprivileged": true
            },
            "state": 5,
            "message": "Re-executing",
            "components": [],
            "FleetState": 6,
            "FleetMessage": "Not enrolled into Fleet"
        }
        ): exit status 1
    fixture.go:657: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    log_level_test.go:210: error fetching agent status: could not unmarshal agent status output: error: exit status 1, output: Error: failed to communicate with Elastic Agent daemon: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /opt/Elastic/Agent/elastic-agent.sock: connect: connection refused"
        For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.16/fleet-troubleshooting.html
        
        invalid character 'E' looking for beginning of value
    fixture.go:657: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    log_level_test.go:210: error fetching agent status: error running command (output: {
            "info": {
                "id": "4b52c472-655d-4f2f-b2dd-585b66e076e2",
                "version": "8.16.0",
                "commit": "acfed1974a4dcb265783d5a177b94e97cd3a5720",
                "build_time": "2024-07-19 17:54:23 +0000 UTC",
                "snapshot": true,
                "pid": 30524,
                "unprivileged": true
            },
            "state": 0,
            "message": "Waiting for initial configuration and composable variables",
            "components": [],
            "FleetState": 0,
            "FleetMessage": ""
        }
        ): exit status 1
    fixture.go:657: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    log_level_test.go:210: error fetching agent status: error running command (output: {
            "info": {
                "id": "4b52c472-655d-4f2f-b2dd-585b66e076e2",
                "version": "8.16.0",
                "commit": "acfed1974a4dcb265783d5a177b94e97cd3a5720",
                "build_time": "2024-07-19 17:54:23 +0000 UTC",
                "snapshot": true,
                "pid": 30524,
                "unprivileged": true
            },
            "state": 0,
            "message": "Waiting for initial configuration and composable variables",
            "components": [],
            "FleetState": 0,
            "FleetMessage": ""
        }
        ): exit status 1
    fixture.go:657: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    log_level_test.go:210: error fetching agent status: error running command (output: {
            "info": {
                "id": "4b52c472-655d-4f2f-b2dd-585b66e076e2",
                "version": "8.16.0",
                "commit": "acfed1974a4dcb265783d5a177b94e97cd3a5720",
                "build_time": "2024-07-19 17:54:23 +0000 UTC",
                "snapshot": true,
                "pid": 30524,
                "unprivileged": true
            },
            "state": 0,
            "message": "Waiting for initial configuration and composable variables",
            "components": [],
            "FleetState": 2,
            "FleetMessage": "Connected"
        }
        ): exit status 1
    fixture.go:657: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    log_level_test.go:219: agent status: {"info":{"id":"4b52c472-655d-4f2f-b2dd-585b66e076e2","version":"8.16.0","commit":"acfed1974a4dcb265783d5a177b94e97cd3a5720","build_time":"2024-07-19 17:54:23 +0000 UTC","snapshot":true,"pid":30524,"unprivileged":true},"state":2,"message":"Running","components":[],"FleetState":2,"FleetMessage":"Connected","upgrade_details":null}
    fixture.go:657: >> running binary with: [/opt/Elastic/Agent/elastic-agent inspect]
    fixture.go:657: >> running binary with: [/opt/Elastic/Agent/elastic-agent inspect]
    log_level_test.go:89: Setting policy log level to "error"
    log_level_test.go:115: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "info" policy log level: "error"
    log_level_test.go:115: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "info" policy log level: "error"
    log_level_test.go:115: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "info" policy log level: "error"
    log_level_test.go:115: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "info" policy log level: "error"
    log_level_test.go:115: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "info" policy log level: "error"
    log_level_test.go:115: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "info" policy log level: "error"
    log_level_test.go:115: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "info" policy log level: "error"
    log_level_test.go:115: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" policy log level: "error"
    log_level_test.go:123: Setting agent log level to "debug"
    log_level_test.go:248: Updating agent-specific log level to "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:146: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" agent log level: "debug"
    log_level_test.go:140: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/log_level_test.go:140
        	            				/home/ubuntu/agent/testing/integration/log_level_test.go:76
        	Error:      	Condition never satisfied
        	Test:       	TestSetLogLevelFleetManaged
        	Messages:   	agent never communicated agent-specific log level "debug" to Fleet
    log_level_test.go:151: Clearing agent log level, expecting log level to revert back to "error"
    log_level_test.go:248: Updating agent-specific log level to ""
    log_level_test.go:174: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" policy log level: "error"
    log_level_test.go:179: Clearing policy log level, expecting log level to revert back to "info"
    log_level_test.go:202: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" initial log level: "info"
    log_level_test.go:202: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" initial log level: "info"
    log_level_test.go:202: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" initial log level: "info"
    log_level_test.go:202: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "error" initial log level: "info"
    log_level_test.go:202: Fleet metadata log level for agent "4b52c472-655d-4f2f-b2dd-585b66e076e2": "info" initial log level: "info"
    fixture_install.go:265: [test TestSetLogLevelFleetManaged] Inside fixture cleanup function
    fixture_install.go:281: collecting diagnostics; test failed
    fixture.go:657: >> running binary with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestSetLogLevelFleetManaged-diagnostics-2024-07-19T18:47:37Z.zip]
    fixture.go:657: >> running binary with: [/opt/Elastic/Agent/elastic-agent uninstall --force]
    fixture.go:1034: Dumping running processes in /home/ubuntu/agent/build/diagnostics/TEST-TestSetLogLevelFleetManaged-linux-amd64-ProcessDump-cleanup.json
--- FAIL: TestSetLogLevelFleetManaged (87.81s)
@rdner rdner added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Jul 24, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jul 26, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis
Copy link
Contributor

another occurrence of it here

@pkoutsovasilis
Copy link
Contributor

another occurrence of it here

@pazone
Copy link
Contributor

pazone commented Feb 28, 2025

another occurrence here

@cmacknz
Copy link
Member

cmacknz commented Mar 4, 2025

Looking at the test failure from Pavel above, we failed with the following (using JSON logs since those have timestamps to correlate with agent logs):

{"Time":"2025-02-28T14:22:47.630203766Z","Output":"    log_level_test.go:147: Fleet metadata log level for agent \"241e557a-1e55-4ffd-9aa2-e5f27b576b0a\": \"error\" agent log level: \"debug\"\n"}
{"Time":"2025-02-28T14:22:48.555136985Z","Output":"    log_level_test.go:141: \n"}
{"Time":"2025-02-28T14:22:48.555151261Z","Output":"        \tError Trace:\t/opt/buildkite-agent/builds/bk-agent-prod-aws-1740750815706106367/elastic/elastic-agent/testing/integration/log_level_test.go:141\n"}
{"Time":"2025-02-28T14:22:48.55515913Z","Output":"        \t            \t\t\t\t/opt/buildkite-agent/builds/bk-agent-prod-aws-1740750815706106367/elastic/elastic-agent/testing/integration/log_level_test.go:77\n"}
{"Time":"2025-02-28T14:22:48.555163257Z","Output":"        \tError:      \tCondition never satisfied\n"}
{"Time":"2025-02-28T14:22:48.555169238Z","Output":"        \tTest:       \tTestSetLogLevelFleetManaged\n"}
{"Time":"2025-02-28T14:22:48.555173398Z","Output":"        \tMessages:   \tagent never communicated agent-specific log level \"debug\" to Fleet\n"}

That is this block, where we send a SETTINGS action with log level set to Debug and then poll for the agent policy to reflect it in Fleet.

err = updateAgentLogLevel(ctx, t, info.KibanaClient, agentID, agentLogLevel)
require.NoError(t, err, "error updating agent log level")
// TODO re-enable inspect assertion after https://github.com/elastic/elastic-agent/issues/4870 is solved
//assert.Eventuallyf(t, func() bool {
// actualAgentLogLevel, err := getLogLevelFromInspectOutput(ctx, f)
// if err != nil {
// t.Logf("error getting log level from agent: %v", err)
// return false
// }
// t.Logf("Agent log level: %q, expected level: %q", actualAgentLogLevel, agentLogLevel)
// return actualAgentLogLevel == agentLogLevel
//}, 2*time.Minute, time.Second, "agent never received agent-specific log level %q", agentLogLevel)
// assert Fleet eventually receives the new log level from agent through checkin
assert.Eventuallyf(t, func() bool {
fleetMetadataLogLevel, err := getLogLevelFromFleetMetadata(ctx, t, info.KibanaClient, agentID)
if err != nil {
t.Logf("error getting log level for agent %q from Fleet metadata: %v", agentID, err)
return false
}
t.Logf("Fleet metadata log level for agent %q: %q agent log level: %q", agentID, fleetMetadataLogLevel, agentLogLevel)
return fleetMetadataLogLevel == agentLogLevel
}, 30*time.Second, time.Second, "agent never communicated agent-specific log level %q to Fleet", agentLogLevel)

The first thing I noticed is that his has a 30s timeout, so if this is dependent on a checkin happening, just waiting 30s doesn't guarantee that. The checkin interval is 5m.

The test fails at 2025-02-28T14:22:48.55. Looking at the agent logs around this time I see agent does update to the debug log level as expected, I also have the following observations:

  1. The agent gets a SETTINGS action to change to the debug level at time 2025-02-28T14:22:20.045Z:
{"log.level":"debug","@timestamp":"2025-02-28T14:22:20.045Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/dispatcher.(*ActionDispatcher).Dispatch","file.name":"dispatcher/dispatcher.go","file.line":165},"message":"Successfully dispatched action: 'id: 9857f355-073e-4272-94af-5ec6fcd41e7c, type: SETTINGS, log_level: debug'","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
  1. The agent then gets another SETTINGS action to clear the log level at 2025-02-28T14:22:49.626Z. Where did this come from? The test doesn't send it from what I can see.
{"log.level":"debug","@timestamp":"2025-02-28T14:22:49.626Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/actions/handlers.(*Settings).Handle","file.name":"handlers/handler_action_settings.go","file.line":44},"message":"handlerSettings: action 'id: cecf373b-d729-4381-abab-c3630498a5e0, type: SETTINGS, log_level: ' received","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
  1. Then agent then actually checks in again at 2025-02-28T14:22:50 after the test failed at 2025-02-28T14:22:48.55 and shows the log level has been cleared by some mystery SETTINGS action.
{"log.level":"debug","@timestamp":"2025-02-28T14:22:50.652Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).Run","file.name":"fleet/fleet_gateway.go","file.line":152},"message":"FleetGateway calling Checkin API","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-02-28T14:22:50.652Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":182},"message":"Checking started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-02-28T14:22:50.682Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/actions/handlers.(*Settings).handleLogLevel","file.name":"handlers/handler_action_settings.go","file.line":85},"message":"Settings action done, setting agent log level to policy default error","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

Looking at all of that, the two concrete things to do are:

  1. We either need to speed up the checkin interval in tests via configuration, or we add an RPC to force a checkin. We should also start logging when checkins happen even when not at the debug level.
  2. We need to understand where this second SETTINGS action the test didn't send came from as that clears the log level. @pchila any ideas since you wrote most of the code here originally?

@pchila
Copy link
Member

pchila commented Mar 5, 2025

Yesterday I analyzed the diagnostic dump from https://buildkite.com/elastic/elastic-agent/builds/18036#0195619b-5176-4126-bb4e-65061f96216c/375-1104 and discussed it with @pkoutsovasilis .
We can see that the failure is indeed caused by timing of checkins of agent to fleet as those can last up to 5 minutes. Looking at this logs fragment

{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.578Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/fleetapi/acker/lazy.(*Acker).Commit","file.name":"lazy/lazy_acker.go","file.line":83},"message":"lazy acker: ack batch: [id: 6fcd825e-ae0f-4d81-840f-872c609a1f16, type: SETTINGS, log_level: ]","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.578Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/fleetapi/acker/fleet.(*Acker).AckBatch","file.name":"fleet/fleet_acker.go","file.line":81},"message":"fleet acker: ackbatch, actions: []fleetapi.Action{(*fleetapi.ActionSettings)(0xc0018b3950)}","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.579Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/fleetapi/acker/fleet.(*Acker).AckBatch","file.name":"fleet/fleet_acker.go","file.line":99},"message":"fleet acker: ackbatch, events: []fleetapi.AckEvent{fleetapi.AckEvent{EventType:\"ACTION_RESULT\", SubType:\"ACKNOWLEDGED\", Timestamp:\"2025-03-04T15:10:46.579+00:00\", ActionID:\"6fcd825e-ae0f-4d81-840f-872c609a1f16\", AgentID:\"a1caa943-a790-4e85-bddf-55c4e39bab12\", Message:\"Action \\\"6fcd825e-ae0f-4d81-840f-872c609a1f16\\\" of type \\\"SETTINGS\\\" acknowledged.\", Payload:json.RawMessage(nil), Data:json.RawMessage(nil), ActionInputType:\"\", ActionData:json.RawMessage(nil), ActionResponse:map[string]interface {}(nil), StartedAt:\"\", CompletedAt:\"\", Error:\"\"}}","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.579Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/fleetapi/acker/fleet.(*Acker).AckBatch","file.name":"fleet/fleet_acker.go","file.line":110},"message":"1 actions with ids '6fcd825e-ae0f-4d81-840f-872c609a1f16' acknowledging","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.579Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/remote.(*Client).Send","file.name":"remote/client.go","file.line":172},"message":"Request method: POST, path: /api/fleet/agents/a1caa943-a790-4e85-bddf-55c4e39bab12/acks, reqID: 01JNGVB6SKZA4MRQCJQZ1KBZN7","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:46.579Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/remote.(*Client).Send","file.name":"remote/client.go","file.line":186},"message":"Creating new request to request URL https://8485e14a0ed85dfa290ba7f51561b40d.fleet.us-west2.gcp.elastic-cloud.com:443/api/fleet/agents/a1caa943-a790-4e85-bddf-55c4e39bab12/acks?","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.048Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).Run","file.name":"fleet/fleet_gateway.go","file.line":162},"message":"FleetGateway calling Checkin API","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.049Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":192},"message":"Checking started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.267Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).execute","file.name":"fleet/fleet_gateway.go","file.line":345},"message":"using previously saved ack token: cadbefe9-6aed-4241-860f-9a5ef1233647","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.267Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).execute","file.name":"fleet/fleet_gateway.go","file.line":354},"message":"correcting agent loglevel from info to debug using coordinator state","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.267Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/remote.(*Client).Send","file.name":"remote/client.go","file.line":172},"message":"Request method: POST, path: /api/fleet/agents/a1caa943-a790-4e85-bddf-55c4e39bab12/checkin, reqID: 01JNGVB7F3QHBAEN9X2GH533Z0","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2025-03-04T15:10:47.267Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/remote.(*Client).Send","file.name":"remote/client.go","file.line":186},"message":"Creating new request to request URL https://8485e14a0ed85dfa290ba7f51561b40d.fleet.us-west2.gcp.elastic-cloud.com:443/api/fleet/agents/a1caa943-a790-4e85-bddf-55c4e39bab12/checkin?","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-03-04T15:10:47.396Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/actions/handlers.(*Settings).handleLogLevel","file.name":"handlers/handler_action_settings.go","file.line":85},"message":"Settings action done, setting agent log level to policy default error","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-03-04T15:11:18.196Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1744},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{},"outputs":{}},"ecs.version":"1.6.0"}

we can see that:

  • the SETTINGS action with an empty log level is coming in at 2025-03-04T15:10:46.172Z ready to be dispatched
  • at 2025-03-04T15:10:47.049Z a new Fleet check-in is initiated (at this point the log level communicated by agent is debug since the SETTINGS action has not been handled yet as confirmed by the log {"log.level":"debug","@timestamp":"2025-03-04T15:10:47.267Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).execute","file.name":"fleet/fleet_gateway.go","file.line":354},"message":"correcting agent loglevel from info to debug using coordinator state","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"})
  • While check-in is in progress, at 2025-03-04T15:10:47.396Z the agent handles the SETTINGS action setting the log level back to the policy setting, as we can see in {"log.level":"info","@timestamp":"2025-03-04T15:10:47.396Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/actions/handlers.(*Settings).handleLogLevel","file.name":"handlers/handler_action_settings.go","file.line":85},"message":"Settings action done, setting agent log level to policy default error","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} but Fleet long poll can last up to 5 minutes as we are not pushing any actions for the agent or policy updates

So the test success is definitely depending on the timing of elastic-agent checkins and how those interleave with handling incoming actions, we cannot set a shorter checkin time as #2257 has not been implemented and #2408 has been closed (this is the necessary parameter we need to send to fleet to change the default duration of the fleet long poll).

I am going to open a PR to increase the timeouts for the test assertions to 6 minutes (5 minutes + 20% margin) and those should take care of the timing variations we may encounter

@pchila
Copy link
Member

pchila commented Mar 5, 2025

3. Then agent then actually checks in again at 2025-02-28T14:22:50 after the test failed at 2025-02-28T14:22:48.55 and shows the log level has been cleared by some mystery SETTINGS action.

@cmacknz there is no mistery SETTINGS action that I can see here, it comes from

err = updateAgentLogLevel(ctx, t, info.KibanaClient, agentID, "")
to clear the agent-specific setting and verify that the agent falls back to the policy log level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants