Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(al2): add retries to cluster-active waiter #2096

Merged
merged 1 commit into from
Dec 18, 2024

Conversation

msarfaty
Copy link
Contributor

@msarfaty msarfaty commented Dec 18, 2024

Issue #, if available: None Active (closed: #803)

Description of changes:

this checks the return code for aws eks wait cluster-active in the bootstrap script.

we've noticed that if a lot of nodes are booting at the same time, this can get backed off via AWS:

Waiter ClusterActive failed: An error occurred (TooManyRequestsException): Too Many Requests
Exited with error on line 286

since errexit is set, this automatically fails the bootstrap script. With this change, I'm intending for this loop to check both returns from wait and describe cluster to avoid throttling/errors from both.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

  • I created an AMI via packer in our account via make k8s=1.29
  • I used that AMI for a test node and the nodes joined successfully

testing a failure scenario would take a little more time and the code change is minimal, but I can try to replicate it if desired (probably easiest by creating an AMI where I set $rc to 286 manually on the first attempt between the wait / describe calls)

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

@cartermckinnon
Copy link
Member

cartermckinnon commented Dec 18, 2024

Change looks reasonable to me. Worth noting that you can avoid this category of issue if you pass your cluster details to bootstrap.sh, instead of relying on the eks:DescribeCluster fallback. We don't recommend using that in production. 👍

/ci

Copy link
Contributor

@cartermckinnon roger that! I've dispatched a workflow. 👍

Copy link
Contributor

@cartermckinnon the workflow that you requested has completed. 🎉

AMI variantBuildTest
1.24 / al2success ✅success ✅
1.25 / al2success ✅success ✅
1.26 / al2success ✅success ✅
1.27 / al2success ✅success ✅
1.28 / al2success ✅success ✅
1.29 / al2success ✅success ✅
1.30 / al2success ✅success ✅
1.31 / al2success ✅success ✅

@cartermckinnon cartermckinnon merged commit 3b667df into awslabs:main Dec 18, 2024
11 checks passed
@cartermckinnon cartermckinnon changed the title fix: check return code of aws eks wait call in bootstrap script fix(al2): add retries to cluster-active waiter Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add retry logic on aws eks wait cluster-active to /etc/eks/bootstrap.sh
2 participants