Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4090 with P2P, alltoall is too low #1603

Open
artetaout opened this issue Feb 12, 2025 · 2 comments
Open

4090 with P2P, alltoall is too low #1603

artetaout opened this issue Feb 12, 2025 · 2 comments

Comments

@artetaout
Copy link

we activate the P2P on 4x4090, It's good when set NCCL_P2P_LEVEL=SYS on allreduce, compared to SHM, got 18->22 BUSBW;

Although in alltoall, compared to SHM, got 18 -> 2 BUSBW

Why ? we don't have PCI switch

@AddyLaddy
Copy link
Collaborator

In general CPUs make very poor PCI switches and we often find that A2A performance is bad when P2P is used across CPUs.
Hence, we normally disable P2P and bounce the communication via Host (SHM) buffers instead.
AllReduce puts less stress on the CPU interconnect, and we believe that is why it doesn't exhibit the same slowdown.

@kiskra-nvidia
Copy link
Member

Also, activating P2P on 4090 is apparently not supported; see, e.g., https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants