4090 with P2P, alltoall is too low #1603

artetaout · 2025-02-12T07:27:37Z

we activate the P2P on 4x4090, It's good when set NCCL_P2P_LEVEL=SYS on allreduce, compared to SHM, got 18->22 BUSBW;

Although in alltoall, compared to SHM, got 18 -> 2 BUSBW

Why ? we don't have PCI switch

AddyLaddy · 2025-02-12T16:48:50Z

In general CPUs make very poor PCI switches and we often find that A2A performance is bad when P2P is used across CPUs.
Hence, we normally disable P2P and bounce the communication via Host (SHM) buffers instead.
AllReduce puts less stress on the CPU interconnect, and we believe that is why it doesn't exhibit the same slowdown.

kiskra-nvidia · 2025-02-12T16:50:02Z

Also, activating P2P on 4090 is apparently not supported; see, e.g., https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4090 with P2P, alltoall is too low #1603

4090 with P2P, alltoall is too low #1603

artetaout commented Feb 12, 2025

AddyLaddy commented Feb 12, 2025

kiskra-nvidia commented Feb 12, 2025

4090 with P2P, alltoall is too low #1603

4090 with P2P, alltoall is too low #1603

Comments

artetaout commented Feb 12, 2025

AddyLaddy commented Feb 12, 2025

kiskra-nvidia commented Feb 12, 2025