Revert Fix #2 for Minimal All Reduce PR #18757

kpaigwar · 2025-03-07T00:14:41Z

Problem description

This is a duplicate PR for #18217 which got reverted due to clang-tidy failures.
The issue is now fixed.

Checklist

All post commit CI passes
TG Nightly passes
TG Frequent (passes, but llama3 PCC failure, seems unrelated)
T3k frequent (passed with same failures on main)
T3k unit (passed fabric tests. other tests have same failures as on main)

interleaved all gather works with good PCC. Next step: add reduction kernel (dataflow + compute) wip reduction stuff. #0: added noc semaphore multicast in writer, seeing a hang on noc_sem_wait in reduction Add fix for reduction worker hang. Add reduction and output cb. Currently, the reduction kernel does a copy into the output cb (temporary). Next: add reduction compute kernel. All-reduce for FF1/FF3 works Add support for reshard. TODO: add support to drop padding from input tensor. Add support for unpadded shapes. Remove dprints. Fix bug in mcast bbox. Fix QKV output num cores. #0: multi-link support added for 3 all_reduce. Link=3 fails with kernel error Cannot add semaphore on core (x=0,y=0). Max number of semaphores (8) reached./build_metal.sh --debug Link=2 hangs at reduction signal wait only for second worker #0: multi-link=3 works Add cleanup for multi-link. Rebase and fix/cleanup stuff. Clean up pytest and enable trace. Adding gsem fix for multi-iter. #0: added api to subtract corerangesets #0: updated choose_worker_cores function to omit reserved_cores #0: fix placement of link worker cores Add support for input shard not divisble by output shard. added all reduce into llama ccl perf test and added proper measurement of e2e trace perf Extend llama sharded all gather for LN updated perf target and packet size for best perf Add rebase fix. Add persistent intermediate tensor. Add test infra for loopback inputs. Fix hashing and bug in input split allocation.

…r function

kpaigwar changed the title ~~Ar persist~~ Revert Fix #2 for Minimal All Reduce PR Mar 7, 2025

tt-rkim marked this pull request as ready for review March 7, 2025 21:02

tt-rkim requested review from ayerofieiev-tt, dmakoviichuk-tt, sminakov-tt, uaydonat, cglagovichTT, yieldthought, mtairum, SeanNijjar, jvegaTT and tt-aho as code owners March 7, 2025 21:02

jvegaTT approved these changes Mar 7, 2025

View reviewed changes

kpaigwar force-pushed the ar_persist branch from 400c7e5 to 34b5fba Compare March 7, 2025 21:27

avoraTT and others added 16 commits March 7, 2025 16:28

wip cleanup.

8a4baf9

Add more validation for all reduce. More cleanup.

43d62d4

change single mcast to multple mcast without affecting pf grid

b3f7685

remove all_gather include in all_reduce and add separate choose_worke…

b3e59e3

…r function

Separate out loopback testing.

df8d698

Add rebase fix to cmakelists.

5514ce4

updating perf target.

79fff1c

Add device kernel profiling for ccls.

86144e1

enabled multi-cast reserve and modified noc write to local l1 write

24ef16b

fix rebase errors

6e5242c

Add superset infra on test side.

98c5127

#0: fix mcast noc coord order and remove hard-coded noc0

6448f56

Address comments. Remove hard coded.

d3aced0

Add LM Head shape for testing.

d843d07

Added non-perf AR/AG unit tests to TG Nightly CI

95d6d30

avoraTT and others added 4 commits March 7, 2025 16:28

Address pr comments.

e22d9b2

fix symlink

d70518f

added missing namespace tt::tt_metal:: in all reduce PR

95f3460

more namespace missing

0054fb7

kpaigwar force-pushed the ar_persist branch from 34b5fba to 0054fb7 Compare March 7, 2025 21:28

tt-rkim merged commit ebaa92d into main Mar 7, 2025
14 checks passed

tt-rkim deleted the ar_persist branch March 7, 2025 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert Fix #2 for Minimal All Reduce PR #18757

Revert Fix #2 for Minimal All Reduce PR #18757

kpaigwar commented Mar 7, 2025 •

edited by avoraTT

Loading

Revert Fix #2 for Minimal All Reduce PR #18757

Revert Fix #2 for Minimal All Reduce PR #18757

Conversation

kpaigwar commented Mar 7, 2025 • edited by avoraTT Loading

Problem description

Checklist

kpaigwar commented Mar 7, 2025 •

edited by avoraTT

Loading