fix(al2): add retries to cluster-active waiter #2096

msarfaty · 2024-12-18T20:35:06Z

Issue #, if available: None Active (closed: #803)

Description of changes:

this checks the return code for aws eks wait cluster-active in the bootstrap script.

we've noticed that if a lot of nodes are booting at the same time, this can get backed off via AWS:

Waiter ClusterActive failed: An error occurred (TooManyRequestsException): Too Many Requests
Exited with error on line 286

since errexit is set, this automatically fails the bootstrap script. With this change, I'm intending for this loop to check both returns from wait and describe cluster to avoid throttling/errors from both.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

I created an AMI via packer in our account via make k8s=1.29
I used that AMI for a test node and the nodes joined successfully

testing a failure scenario would take a little more time and the code change is minimal, but I can try to replicate it if desired (probably easiest by creating an AMI where I set $rc to 286 manually on the first attempt between the wait / describe calls)

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

cartermckinnon · 2024-12-18T21:22:36Z

Change looks reasonable to me. Worth noting that you can avoid this category of issue if you pass your cluster details to bootstrap.sh, instead of relying on the eks:DescribeCluster fallback. We don't recommend using that in production. 👍

/ci

github-actions · 2024-12-18T21:23:06Z

@cartermckinnon roger that! I've dispatched a workflow. 👍

github-actions · 2024-12-18T22:19:56Z

@cartermckinnon the workflow that you requested has completed. 🎉

AMI variant	Build	Test
1.24 / al2	success ✅	success ✅
1.25 / al2	success ✅	success ✅
1.26 / al2	success ✅	success ✅
1.27 / al2	success ✅	success ✅
1.28 / al2	success ✅	success ✅
1.29 / al2	success ✅	success ✅
1.30 / al2	success ✅	success ✅
1.31 / al2	success ✅	success ✅

fix: check return code of aws eks wait call

9aedb83

cartermckinnon approved these changes Dec 18, 2024

View reviewed changes

cartermckinnon merged commit 3b667df into awslabs:main Dec 18, 2024
11 checks passed

cartermckinnon changed the title ~~fix: check return code of aws eks wait call in bootstrap script~~ fix(al2): add retries to cluster-active waiter Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(al2): add retries to cluster-active waiter #2096

fix(al2): add retries to cluster-active waiter #2096

msarfaty commented Dec 18, 2024 •

edited

Loading

cartermckinnon commented Dec 18, 2024 •

edited

Loading

github-actions bot commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

fix(al2): add retries to cluster-active waiter #2096

fix(al2): add retries to cluster-active waiter #2096

Conversation

msarfaty commented Dec 18, 2024 • edited Loading

cartermckinnon commented Dec 18, 2024 • edited Loading

github-actions bot commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

msarfaty commented Dec 18, 2024 •

edited

Loading

cartermckinnon commented Dec 18, 2024 •

edited

Loading