feat(models): multibackend all_to_all wrapper #95

cathalobrien · 2025-01-27T14:23:55Z

Small PR to fallback to support alltoall when using the Gloo backend to torch.distributed. This PR is needed to be able to run the transformer model on CPU. for 99.9% of users running on GPUs with the NCCL background, this change should not effect them

Gloo does not offer an alltoall primitive, as shown here

This PR implements am all_to_all fallback for Gloo, using the 'Linear Shift' algorithm from Hoffman and Rünger, 2013. Because of syntax for torch.dist changing in torch 2.6, older versions of torch are not supported.

cathalobrien · 2025-02-04T15:36:06Z

I written a script here to benchmark and test the correctness of this gloo alltoall with nccl.

python alltoall_test.py
Running on all-to-all on CPU over 4 members
Input list total size is 64.0MB
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                    c10d::send         0.14%       1.722ms         0.14%       1.722ms     573.918us             3
                     gloo:send         0.00%       0.000us             0       13.540s        4.513s             3
                   c10d::recv_         0.00%      36.513us         0.00%      36.513us      12.171us             3
                     gloo:recv         0.00%       0.000us             0       18.687s        6.229s             3
            cudaGetDeviceCount         0.12%       1.441ms         0.12%       1.441ms     720.274us             2
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.230s

Running on all-to-all on GPU over 4 members
Input list total size is 64.0MB
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                        c10d::alltoall_         0.02%      71.892us        80.85%     272.351ms     272.351ms       0.000us         0.00%     129.453ms     129.453ms             1
                                     record_param_comms        49.68%     167.358ms        80.83%     272.284ms     136.142ms      64.727ms        99.45%     129.453ms      64.727ms             2
                                  cudaStreamIsCapturing         0.00%       3.088us         0.00%       3.088us       3.088us       0.000us         0.00%       0.000us       0.000us             1
                                        cudaEventRecord         0.00%      12.800us         0.00%      12.800us       2.133us       0.000us         0.00%       0.000us       0.000us             6
                                    cudaStreamWaitEvent         0.00%       9.823us         0.00%       9.823us       1.965us       0.000us         0.00%       0.000us       0.000us             5
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 336.872ms
Self CUDA time total: 65.085ms

Does CPU and GPU output match? True

CPU runtime looks about an order of magnitude slower at modest input sizes, but the output is correct.

multibackend alltoall wrapper

9abad18

github-actions bot added the models label Jan 27, 2025

cathalobrien changed the title ~~multibackend alltoall wrapper~~ feat(models): multibackend all_to_all wrapper Jan 27, 2025

cathalobrien added 2 commits January 28, 2025 12:03

change how backend is selected

5096bbc

update changelog

30ea60e

HCookie assigned cathalobrien Jan 30, 2025

cathalobrien added 2 commits February 4, 2025 15:20

fixed alltoall alg

abc5bcf

forgot to remove print

55de6a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(models): multibackend all_to_all wrapper #95

feat(models): multibackend all_to_all wrapper #95

cathalobrien commented Jan 27, 2025 •

edited

Loading

cathalobrien commented Feb 4, 2025

feat(models): multibackend all_to_all wrapper #95

Are you sure you want to change the base?

feat(models): multibackend all_to_all wrapper #95

Conversation

cathalobrien commented Jan 27, 2025 • edited Loading

cathalobrien commented Feb 4, 2025

cathalobrien commented Jan 27, 2025 •

edited

Loading