Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test]: Kubernetes platform-related integration test failures #7060

Open
swiatekm opened this issue Feb 27, 2025 · 10 comments
Open

[Flaky Test]: Kubernetes platform-related integration test failures #7060

swiatekm opened this issue Feb 27, 2025 · 10 comments
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@swiatekm
Copy link
Contributor

We suspect that various K8s integration test failures caused by problems with Kubernetes itself, are in fact caused by resource contention due to running too many kind clusters on a single host. This issue exists to group these tests, so we can avoid the overhead of trying to debug them individually.

The short-term solution to this problem is using a larger instance for existing tests. The long-term solution is running these tests in parallel with finer-grained concurrency.

@swiatekm swiatekm added flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Feb 27, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@swiatekm
Copy link
Contributor Author

#7048 probably belongs here. The problem manifests as mysterious 403 responses from the API Server, but is most likely caused by the API Server not having enough resources.

@swiatekm
Copy link
Contributor Author

In #7050, we encountered the following:

=== Failed
=== FAIL: testing/integration TestOtelKubeStackHelm/helm_kube-stack_operator_standalone_agent_kubernetes_privileged (334.06s)
    otel_helm_test.go:122:
        	Error Trace:	/opt/buildkite-agent/builds/bk-agent-prod-gcp-1740576462608313681/elastic/elastic-agent-extended-testing/testing/integration/otel_helm_test.go:122
        	            				/opt/buildkite-agent/builds/bk-agent-prod-gcp-1740576462608313681/elastic/elastic-agent-extended-testing/testing/integration/otel_helm_test.go:96
        	Error:      	Condition never satisfied
        	Test:       	TestOtelKubeStackHelm/helm_kube-stack_operator_standalone_agent_kubernetes_privileged
        	Messages:   	at least 4 agent containers should be checked
=== FAIL: testing/integration TestOtelKubeStackHelm (342.70s)

in https://buildkite.com/elastic/elastic-agent-extended-testing/builds/7694#01954270-d4c5-4926-8442-4861f27bedea.

@swiatekm
Copy link
Contributor Author

swiatekm commented Feb 27, 2025

In #7044, just a plain runner failure without any additional information:

>>> (kubernetes-amd64-1289-complete-wolfi-kubernetes) Failed to execute tests on instance: %!s(<nil>)

in https://buildkite.com/elastic/elastic-agent-extended-testing/builds/7684#01954204-430a-4656-864b-2496f3cc9ef8/56.

@pkoutsovasilis
Copy link
Contributor

thanks you for capturing all of this info @swiatekm , cc @dliappis @pazone

@dliappis
Copy link
Contributor

dliappis commented Feb 27, 2025

The short-term solution to this problem is using a larger instance for existing tests. The long-term solution is running these tests in parallel with finer-grained concurrency.

Thanks for bubbling this up @swiatekm ! re: the long-term solution, I guess this fit nicely with the desire/effort to split the k8s tests to separate steps (originally intended to improve speed, but obviously there's a flakiness issue too).

For the short-term solution, we looked at it yesterday with @pkoutsovasilis , it's currently using c2-standard-16 which is pretty beefy 16vCPUs / 64Gi ram. We can certainly go for a larger one, but do you feel we could benefit from more resources, or is the problem more OS related e.g. running out of file descriptors? i.e do you know what resources are missing?

@pazone
Copy link
Contributor

pazone commented Feb 27, 2025

Agreed here. By breaking k8s tests into separate machines, we can adjust the machine size per each k8s version. I'd like to start the implementation

@swiatekm
Copy link
Contributor Author

For the short-term solution, we looked at it yesterday with @pkoutsovasilis , it's currently using c2-standard-16 which is pretty beefy 16vCPUs / 64Gi ram. We can certainly go for a larger one, but do you feel we could benefit from more resources, or is the problem more OS related e.g. running out of file descriptors? i.e do you know what resources are missing?

I originally thought we were running more kind clusters than we are (5). That machine should definitely be enough for our current tests in terms of CPU and memory. Maybe there's contention for other OS resources, but that might be challenging to debug. @pkoutsovasilis suggested that, since we currently run the tests sequentially anyway, we could destroy each cluster after we're done running tests on it. Right now, they all persist until the cleanup step.

@dliappis
Copy link
Contributor

I originally thought we were running more kind clusters than we are (5). That machine should definitely be enough for our current tests in terms of CPU and memory. Maybe there's contention for other OS resources, but that might be challenging to debug. @pkoutsovasilis suggested that, since we currently run the tests sequentially anyway, we could destroy each cluster after we're done running tests on it. Right now, they all persist until the cleanup step.

I'd favor this approach in the short term: since tests are running sequentially, let's also start the kind clusters sequentially, i.e. launch kind v1 -> run tests -> teardown kindv1, launch kind v2 -> run tests -> teardown kind v2 etc.

@swiatekm
Copy link
Contributor Author

swiatekm commented Mar 3, 2025

In #7050, we encountered the following:

=== Failed
=== FAIL: testing/integration TestOtelKubeStackHelm/helm_kube-stack_operator_standalone_agent_kubernetes_privileged (334.06s)
    otel_helm_test.go:122:
        	Error Trace:	/opt/buildkite-agent/builds/bk-agent-prod-gcp-1740576462608313681/elastic/elastic-agent-extended-testing/testing/integration/otel_helm_test.go:122
        	            				/opt/buildkite-agent/builds/bk-agent-prod-gcp-1740576462608313681/elastic/elastic-agent-extended-testing/testing/integration/otel_helm_test.go:96
        	Error:      	Condition never satisfied
        	Test:       	TestOtelKubeStackHelm/helm_kube-stack_operator_standalone_agent_kubernetes_privileged
        	Messages:   	at least 4 agent containers should be checked
=== FAIL: testing/integration TestOtelKubeStackHelm (342.70s)

in https://buildkite.com/elastic/elastic-agent-extended-testing/builds/7694#01954270-d4c5-4926-8442-4861f27bedea.

This turned out to be an issue with the PR changes, not a flaky test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

5 participants