Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(models): multibackend all_to_all wrapper #95

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

cathalobrien
Copy link
Contributor

@cathalobrien cathalobrien commented Jan 27, 2025

Small PR to fallback to support alltoall when using the Gloo backend to torch.distributed. This PR is needed to be able to run the transformer model on CPU. for 99.9% of users running on GPUs with the NCCL background, this change should not effect them

Gloo does not offer an alltoall primitive, as shown here

This PR implements am all_to_all fallback for Gloo, using the 'Linear Shift' algorithm from Hoffman and Rünger, 2013. Because of syntax for torch.dist changing in torch 2.6, older versions of torch are not supported.

@cathalobrien cathalobrien changed the title multibackend alltoall wrapper feat(models): multibackend all_to_all wrapper Jan 27, 2025
@cathalobrien
Copy link
Contributor Author

I written a script here to benchmark and test the correctness of this gloo alltoall with nccl.

python alltoall_test.py
Running on all-to-all on CPU over 4 members
Input list total size is 64.0MB
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                    c10d::send         0.14%       1.722ms         0.14%       1.722ms     573.918us             3
                     gloo:send         0.00%       0.000us             0       13.540s        4.513s             3
                   c10d::recv_         0.00%      36.513us         0.00%      36.513us      12.171us             3
                     gloo:recv         0.00%       0.000us             0       18.687s        6.229s             3
            cudaGetDeviceCount         0.12%       1.441ms         0.12%       1.441ms     720.274us             2
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.230s

Running on all-to-all on GPU over 4 members
Input list total size is 64.0MB
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                        c10d::alltoall_         0.02%      71.892us        80.85%     272.351ms     272.351ms       0.000us         0.00%     129.453ms     129.453ms             1
                                     record_param_comms        49.68%     167.358ms        80.83%     272.284ms     136.142ms      64.727ms        99.45%     129.453ms      64.727ms             2
                                  cudaStreamIsCapturing         0.00%       3.088us         0.00%       3.088us       3.088us       0.000us         0.00%       0.000us       0.000us             1
                                        cudaEventRecord         0.00%      12.800us         0.00%      12.800us       2.133us       0.000us         0.00%       0.000us       0.000us             6
                                    cudaStreamWaitEvent         0.00%       9.823us         0.00%       9.823us       1.965us       0.000us         0.00%       0.000us       0.000us             5
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 336.872ms
Self CUDA time total: 65.085ms

Does CPU and GPU output match? True

CPU runtime looks about an order of magnitude slower at modest input sizes, but the output is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Under Review
Development

Successfully merging this pull request may close these issues.

1 participant