-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky Test]: Kubernetes platform-related integration test failures #7060
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
#7048 probably belongs here. The problem manifests as mysterious 403 responses from the API Server, but is most likely caused by the API Server not having enough resources. |
In #7050, we encountered the following:
|
In #7044, just a plain runner failure without any additional information:
|
Thanks for bubbling this up @swiatekm ! re: the long-term solution, I guess this fit nicely with the desire/effort to split the k8s tests to separate steps (originally intended to improve speed, but obviously there's a flakiness issue too). For the short-term solution, we looked at it yesterday with @pkoutsovasilis , it's currently using c2-standard-16 which is pretty beefy 16vCPUs / 64Gi ram. We can certainly go for a larger one, but do you feel we could benefit from more resources, or is the problem more OS related e.g. running out of file descriptors? i.e do you know what resources are missing? |
Agreed here. By breaking k8s tests into separate machines, we can adjust the machine size per each k8s version. I'd like to start the implementation |
I originally thought we were running more kind clusters than we are (5). That machine should definitely be enough for our current tests in terms of CPU and memory. Maybe there's contention for other OS resources, but that might be challenging to debug. @pkoutsovasilis suggested that, since we currently run the tests sequentially anyway, we could destroy each cluster after we're done running tests on it. Right now, they all persist until the cleanup step. |
I'd favor this approach in the short term: since tests are running sequentially, let's also start the kind clusters sequentially, i.e. launch kind v1 -> run tests -> teardown kindv1, launch kind v2 -> run tests -> teardown kind v2 etc. |
This turned out to be an issue with the PR changes, not a flaky test. |
We suspect that various K8s integration test failures caused by problems with Kubernetes itself, are in fact caused by resource contention due to running too many kind clusters on a single host. This issue exists to group these tests, so we can avoid the overhead of trying to debug them individually.
The short-term solution to this problem is using a larger instance for existing tests. The long-term solution is running these tests in parallel with finer-grained concurrency.
The text was updated successfully, but these errors were encountered: