[Flaky Test]: Kubernetes platform-related integration test failures #7060

swiatekm · 2025-02-27T11:04:21Z

We suspect that various K8s integration test failures caused by problems with Kubernetes itself, are in fact caused by resource contention due to running too many kind clusters on a single host. This issue exists to group these tests, so we can avoid the overhead of trying to debug them individually.

The short-term solution to this problem is using a larger instance for existing tests. The long-term solution is running these tests in parallel with finer-grained concurrency.

elasticmachine · 2025-02-27T11:04:23Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

swiatekm · 2025-02-27T11:05:24Z

#7048 probably belongs here. The problem manifests as mysterious 403 responses from the API Server, but is most likely caused by the API Server not having enough resources.

swiatekm · 2025-02-27T11:07:19Z

In #7050, we encountered the following:

=== Failed
=== FAIL: testing/integration TestOtelKubeStackHelm/helm_kube-stack_operator_standalone_agent_kubernetes_privileged (334.06s)
    otel_helm_test.go:122:
        	Error Trace:	/opt/buildkite-agent/builds/bk-agent-prod-gcp-1740576462608313681/elastic/elastic-agent-extended-testing/testing/integration/otel_helm_test.go:122
        	            				/opt/buildkite-agent/builds/bk-agent-prod-gcp-1740576462608313681/elastic/elastic-agent-extended-testing/testing/integration/otel_helm_test.go:96
        	Error:      	Condition never satisfied
        	Test:       	TestOtelKubeStackHelm/helm_kube-stack_operator_standalone_agent_kubernetes_privileged
        	Messages:   	at least 4 agent containers should be checked
=== FAIL: testing/integration TestOtelKubeStackHelm (342.70s)

in https://buildkite.com/elastic/elastic-agent-extended-testing/builds/7694#01954270-d4c5-4926-8442-4861f27bedea.

swiatekm · 2025-02-27T11:12:39Z

In #7044, just a plain runner failure without any additional information:

>>> (kubernetes-amd64-1289-complete-wolfi-kubernetes) Failed to execute tests on instance: %!s(<nil>)

in https://buildkite.com/elastic/elastic-agent-extended-testing/builds/7684#01954204-430a-4656-864b-2496f3cc9ef8/56.

pkoutsovasilis · 2025-02-27T11:14:15Z

thanks you for capturing all of this info @swiatekm , cc @dliappis @pazone

dliappis · 2025-02-27T11:43:53Z

The short-term solution to this problem is using a larger instance for existing tests. The long-term solution is running these tests in parallel with finer-grained concurrency.

Thanks for bubbling this up @swiatekm ! re: the long-term solution, I guess this fit nicely with the desire/effort to split the k8s tests to separate steps (originally intended to improve speed, but obviously there's a flakiness issue too).

For the short-term solution, we looked at it yesterday with @pkoutsovasilis , it's currently using c2-standard-16 which is pretty beefy 16vCPUs / 64Gi ram. We can certainly go for a larger one, but do you feel we could benefit from more resources, or is the problem more OS related e.g. running out of file descriptors? i.e do you know what resources are missing?

pazone · 2025-02-27T11:52:01Z

Agreed here. By breaking k8s tests into separate machines, we can adjust the machine size per each k8s version. I'd like to start the implementation

swiatekm · 2025-02-27T15:21:55Z

For the short-term solution, we looked at it yesterday with @pkoutsovasilis , it's currently using c2-standard-16 which is pretty beefy 16vCPUs / 64Gi ram. We can certainly go for a larger one, but do you feel we could benefit from more resources, or is the problem more OS related e.g. running out of file descriptors? i.e do you know what resources are missing?

I originally thought we were running more kind clusters than we are (5). That machine should definitely be enough for our current tests in terms of CPU and memory. Maybe there's contention for other OS resources, but that might be challenging to debug. @pkoutsovasilis suggested that, since we currently run the tests sequentially anyway, we could destroy each cluster after we're done running tests on it. Right now, they all persist until the cleanup step.

dliappis · 2025-02-28T16:08:50Z

I originally thought we were running more kind clusters than we are (5). That machine should definitely be enough for our current tests in terms of CPU and memory. Maybe there's contention for other OS resources, but that might be challenging to debug. @pkoutsovasilis suggested that, since we currently run the tests sequentially anyway, we could destroy each cluster after we're done running tests on it. Right now, they all persist until the cleanup step.

I'd favor this approach in the short term: since tests are running sequentially, let's also start the kind clusters sequentially, i.e. launch kind v1 -> run tests -> teardown kindv1, launch kind v2 -> run tests -> teardown kind v2 etc.

swiatekm · 2025-03-03T10:42:41Z

In #7050, we encountered the following:

=== Failed
=== FAIL: testing/integration TestOtelKubeStackHelm/helm_kube-stack_operator_standalone_agent_kubernetes_privileged (334.06s)
    otel_helm_test.go:122:
        	Error Trace:	/opt/buildkite-agent/builds/bk-agent-prod-gcp-1740576462608313681/elastic/elastic-agent-extended-testing/testing/integration/otel_helm_test.go:122
        	            				/opt/buildkite-agent/builds/bk-agent-prod-gcp-1740576462608313681/elastic/elastic-agent-extended-testing/testing/integration/otel_helm_test.go:96
        	Error:      	Condition never satisfied
        	Test:       	TestOtelKubeStackHelm/helm_kube-stack_operator_standalone_agent_kubernetes_privileged
        	Messages:   	at least 4 agent containers should be checked
=== FAIL: testing/integration TestOtelKubeStackHelm (342.70s)

in https://buildkite.com/elastic/elastic-agent-extended-testing/builds/7694#01954270-d4c5-4926-8442-4861f27bedea.

This turned out to be an issue with the PR changes, not a flaky test.

swiatekm added flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Feb 27, 2025

swiatekm mentioned this issue Mar 3, 2025

Recreate kind cluster after each K8s test #7111

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky Test]: Kubernetes platform-related integration test failures #7060

[Flaky Test]: Kubernetes platform-related integration test failures #7060

swiatekm commented Feb 27, 2025

elasticmachine commented Feb 27, 2025

swiatekm commented Feb 27, 2025

swiatekm commented Feb 27, 2025

swiatekm commented Feb 27, 2025 •

edited

Loading

pkoutsovasilis commented Feb 27, 2025

dliappis commented Feb 27, 2025 •

edited

Loading

pazone commented Feb 27, 2025

swiatekm commented Feb 27, 2025

dliappis commented Feb 28, 2025

swiatekm commented Mar 3, 2025

[Flaky Test]: Kubernetes platform-related integration test failures #7060

[Flaky Test]: Kubernetes platform-related integration test failures #7060

Comments

swiatekm commented Feb 27, 2025

elasticmachine commented Feb 27, 2025

swiatekm commented Feb 27, 2025

swiatekm commented Feb 27, 2025

swiatekm commented Feb 27, 2025 • edited Loading

pkoutsovasilis commented Feb 27, 2025

dliappis commented Feb 27, 2025 • edited Loading

pazone commented Feb 27, 2025

swiatekm commented Feb 27, 2025

dliappis commented Feb 28, 2025

swiatekm commented Mar 3, 2025

swiatekm commented Feb 27, 2025 •

edited

Loading

dliappis commented Feb 27, 2025 •

edited

Loading