-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert Fix #2 for Minimal All Reduce PR #18757
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jvegaTT
approved these changes
Mar 7, 2025
interleaved all gather works with good PCC. Next step: add reduction kernel (dataflow + compute) wip reduction stuff. #0: added noc semaphore multicast in writer, seeing a hang on noc_sem_wait in reduction Add fix for reduction worker hang. Add reduction and output cb. Currently, the reduction kernel does a copy into the output cb (temporary). Next: add reduction compute kernel. All-reduce for FF1/FF3 works Add support for reshard. TODO: add support to drop padding from input tensor. Add support for unpadded shapes. Remove dprints. Fix bug in mcast bbox. Fix QKV output num cores. #0: multi-link support added for 3 all_reduce. Link=3 fails with kernel error Cannot add semaphore on core (x=0,y=0). Max number of semaphores (8) reached./build_metal.sh --debug Link=2 hangs at reduction signal wait only for second worker #0: multi-link=3 works Add cleanup for multi-link. Rebase and fix/cleanup stuff. Clean up pytest and enable trace. Adding gsem fix for multi-iter. #0: added api to subtract corerangesets #0: updated choose_worker_cores function to omit reserved_cores #0: fix placement of link worker cores Add support for input shard not divisble by output shard. added all reduce into llama ccl perf test and added proper measurement of e2e trace perf Extend llama sharded all gather for LN updated perf target and packet size for best perf Add rebase fix. Add persistent intermediate tensor. Add test infra for loopback inputs. Fix hashing and bug in input split allocation.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem description
This is a duplicate PR for #18217 which got reverted due to clang-tidy failures.
The issue is now fixed.
Checklist