Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert Fix #2 for Minimal All Reduce PR #18757

Merged
merged 20 commits into from
Mar 7, 2025
Merged

Revert Fix #2 for Minimal All Reduce PR #18757

merged 20 commits into from
Mar 7, 2025

Conversation

kpaigwar
Copy link
Contributor

@kpaigwar kpaigwar commented Mar 7, 2025

Problem description

This is a duplicate PR for #18217 which got reverted due to clang-tidy failures.
The issue is now fixed.

Checklist

@kpaigwar kpaigwar changed the title Ar persist Revert Fix #2 for Minimal All Reduce PR Mar 7, 2025
@tt-rkim tt-rkim marked this pull request as ready for review March 7, 2025 21:02
avoraTT and others added 16 commits March 7, 2025 16:28
interleaved all gather works with good PCC. Next step: add reduction kernel (dataflow + compute)

wip reduction stuff.

#0: added noc semaphore multicast in writer, seeing a hang on noc_sem_wait in reduction

Add fix for reduction worker hang.

Add reduction and output cb. Currently, the reduction kernel does a copy into the output cb (temporary). Next: add reduction compute kernel.

All-reduce for FF1/FF3 works

Add support for reshard. TODO: add support to drop padding from input tensor.

Add support for unpadded shapes.

Remove dprints.

Fix bug in mcast bbox. Fix QKV output num cores.

#0: multi-link support added for 3 all_reduce. Link=3 fails with kernel error Cannot add semaphore on core (x=0,y=0). Max number of semaphores (8) reached./build_metal.sh --debug Link=2 hangs at reduction signal wait only for second worker

#0: multi-link=3 works

Add cleanup for multi-link.

Rebase and fix/cleanup stuff.

Clean up pytest and enable trace.

Adding gsem fix for multi-iter.

#0: added api to subtract corerangesets

#0: updated choose_worker_cores function to omit reserved_cores

#0: fix placement of link worker cores

Add support for input shard not divisble by output shard.

added all reduce into llama ccl perf test and added proper measurement of e2e trace perf

Extend llama sharded all gather for LN

updated perf target and packet size for best perf

Add rebase fix.

Add persistent intermediate tensor.

Add test infra for loopback inputs. Fix hashing and bug in input split allocation.
@tt-rkim tt-rkim merged commit ebaa92d into main Mar 7, 2025
14 checks passed
@tt-rkim tt-rkim deleted the ar_persist branch March 7, 2025 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants