Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixed the bug in process group initialization
Summary: torch.distributed.new_group required requires that all processes in the main group (i.e. all processes that are part of the distributed job) enter this function, even if they are not going to be members of the group. Additionally, groups should be created in the same order in all processes. https://pytorch.org/docs/stable/_modules/torch/distributed/distributed_c10d.html#new_group The current implementation requires the process group id is unique across all ranks. However, that is not the case. For example, in llama4: Rank 0 {\"pg_name\": \"0\", \"pg_desc\": \"default_pg\", \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 16, \"group_count\": 5}, {\"pg_name\": \"1\", \"pg_desc\": \"DP\", \"backend_config\": \"cuda:nccl\", \"ranks\": [0, 8], \"group_size\": 2, \"group_count\": 5}, {\"pg_name\": \"2\", \"pg_desc\": \"MP\", \"backend_config\": \"cuda:nccl\", \"ranks\": [0, 1, 2, 3, 4, 5, 6, 7], \"group_size\": 8, \"group_count\": 5}, {\"pg_name\": \"3\", \"pg_desc\": \"TP\", \"backend_config\": \"cuda:nccl\", \"ranks\": [0, 1, 2, 3, 4, 5, 6, 7], \"group_size\": 8, \"group_count\": 5}, {\"pg_name\": \"4\", \"pg_desc\": \"PP\", \"backend_config\": \"cuda:nccl\", \"ranks\": [0], \"group_size\": 1, \"group_count\": 5}]"} Rank 8 {\"pg_name\": \"0\", \"pg_desc\": \"default_pg\", \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 16, \"group_count\": 5}, {\"pg_name\": \"1\", \"pg_desc\": \"DP\", \"backend_config\": \"cuda:nccl\", \"ranks\": [0, 8], \"group_size\": 2, \"group_count\": 5}, {\"pg_name\": \"2\", \"pg_desc\": \"MP\", \"backend_config\": \"cuda:nccl\", \"ranks\": [8, 9, 10, 11, 12, 13, 14, 15], \"group_size\": 8, \"group_count\": 5}, {\"pg_name\": \"3\", \"pg_desc\": \"TP\", \"backend_config\": \"cuda:nccl\", \"ranks\": [8, 9, 10, 11, 12, 13, 14, 15], \"group_size\": 8, \"group_count\": 5}, {\"pg_name\": \"4\", \"pg_desc\": \"PP\", \"backend_config\": \"cuda:nccl\", \"ranks\": [8], \"group_size\": 1, \"group_count\": 5}]"} You can see for pg_id = 1, it ranks are different. This DIFF is to fix this issue by using group rank ids as a key. For every unique group rank id list, a new process group is created. The idea behind it is if the sorted group rank list is the same, it is the same process group. After the process group is created, the process group id in the ET file of the current rank is used to map pg id to the process group. pg id from all other ranks are set to -1, since it is not used to run collectives. Differential Revision: D64345603
- Loading branch information