Zgc/dipu support ascend hccl2 #651

zhaoguochun1995 · 2024-01-19T07:01:50Z

修复多卡通信时HcclAllreduce的bug:

在多卡训练时会报sdma错误，经和华为工程师一起联合调试，定位为hccl内部dma拷贝时内存访问越界。
后面华为工程师定位出来，我们在训练时另一个线程调用了setDevice接口，触发了他们的bug。
当前是否调用了setDevice的标志是个thread_local的标记，尽管已经在通信时根据rank 进行了设备的设置，但是在另一个线程里却没有更新这个标记，导致另一个线程会重新setDevice。
与此同时，setDevcie 和getDevice 调用非常频繁，有必要做下优化：当目前已经是要setDevice的设备时，可以不用set。get时可以用上次更新后的值。

fandaoyi · 2024-01-19T10:26:02Z

dipu/torch_dipu/csrc_dipu/vendor/ascend/deviceimpl.cpp

+  return static_cast<deviceId_t>(currentDeviceIndex);
+}
+
+// set current device given device according to id


加个注释吧，说明下原来的使用线程局部量的flag 导致的问题和原因。因为后面可能还是要支持多线程卡切换的（单进程多卡），后面的容易踩坑。

朝兴看下，我记得朝兴当初加这个线程局部量的 flag 是为了解决多线程 device 切换的啥问题？现在按照国春的说法，我们目前初始化华为的通信库方式，这种方式也会有问题。这其实又回到了朝兴修改前的状态。

确认下华为的 aclrtGet/setDevice 是否支持线程级别的bind。然后这里加注释说明下，这个和 nv的不一致。后面真要做单进程多卡会，埋坑。

备注：现在出现个坑，pytorch backward 执行时会初始化一组bwd线程，这组线程会和主线程共同执行对应 device 上的bwd op。主线程对应 bwd root node （loss）对应的device，每个子线程对应一个进程可见的 device。如果这块的逻辑不改的话，子线程必须要set_device(). 否则无法执行 backward。

线程和设备不一定是绑定的关系吧？一个线程在不同的时刻有可能使用不同的设备，一个设备也可能会被不同的线程使用。

这个还是得用DevcieGuard保证在线程中使用正确的设备？

用哪个设备计算，不应该取决于是在哪个线程，而是相关的tensor分配在哪个设备上吧

单进程单卡、单进程多卡相关的逻辑应该加在上层，不应该加在设备接口适配层。

lljbash · 2024-01-24T09:50:55Z

dipu/torch_dipu/csrc_dipu/vendor/ascend/deviceimpl.cpp

@@ -16,21 +17,29 @@ namespace devapis {
 //  Device class related
 // =====================
 using ascend_deviceId = int32_t;
-thread_local bool setDevFlag = false;
+
+std::atomic<int> currentDeviceIndex(-1);


lljbash · 2024-01-24T09:52:29Z

dipu/torch_dipu/csrc_dipu/vendor/ascend/deviceimpl.cpp

+  if (devId != currentDeviceIndex) {
+    currentDeviceIndex = devId;


按我的理解，这样写不是 thread_safe 的，得用 exchange

首先我不太建议这块用原子变量，这个调用比较频繁，而且这么做有bug。我打了个补丁， fix 了internlm 跑不过的问题，这个变量被改名为 mainThreadDeviceIdx 而且只会被主线程设置。我的补丁里仅使用 volatile 前缀，

lljbash · 2024-01-24T10:04:47Z

dipu/torch_dipu/csrc_dipu/vendor/ascend/deviceimpl.cpp

文件权限变了，注意一下

lljbash · 2024-01-26T06:49:31Z

dipu/torch_dipu/csrc_dipu/vendor/ascend/deviceimpl.cpp

 thread_local bool setDevFlag = false;

+// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+volatile deviceId_t mainThreadDeviceIdx = -1;


volatile object - an object whose type is volatile-qualified, or a subobject of a volatile object, or a mutable subobject of a const-volatile object. Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order). Any attempt to access a volatile object through a glvalue of non-volatile type (e.g. through a reference or pointer to non-volatile type) results in undefined behavior.

这里用法有问题，我猜你需要的还是 atomic

zhaoguochun1995 · 2024-01-30T08:51:54Z

2024-1-19号升级驱动后该问题消失

zhaoguochun1995 added 9 commits January 16, 2024 10:39

fix ascend device init

a2663e1

fix lint

87bebd5

minor change

6ef6c73

minor change

6a738cd

add ascend communicate test

4f67034

move some defines from header to cpp

5f3acfa

minor change

a197660

add acl fun call track

ee475e5

fix acend runtime bug: set to error device in other thread

755edc9

zhaoguochun1995 requested review from mrdanielw, fandaoyi and lljbash as code owners January 19, 2024 07:01

zhaoguochun1995 added 2 commits January 19, 2024 15:10

merge main

650a9e5

minor change

d3b884f

yangbofun approved these changes Jan 19, 2024

View reviewed changes

fandaoyi reviewed Jan 19, 2024

View reviewed changes

lljbash reviewed Jan 24, 2024

View reviewed changes

fandaoyi added 2 commits January 25, 2024 11:26

Merge branch 'main' into zgc/dipu_support_ascend_hccl2

d3d75fd

patch incorect device id in backward thread err

9f01762

lljbash reviewed Jan 26, 2024

View reviewed changes

fix linjie1 commit

34c7ca2

zhaoguochun1995 closed this Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zgc/dipu support ascend hccl2 #651

Zgc/dipu support ascend hccl2 #651

zhaoguochun1995 commented Jan 19, 2024 •

edited

Loading

fandaoyi Jan 19, 2024

fandaoyi Jan 19, 2024

fandaoyi Jan 19, 2024

fandaoyi Jan 25, 2024 •

edited

Loading

zhaoguochun1995 Jan 29, 2024

zhaoguochun1995 Jan 29, 2024

zhaoguochun1995 Jan 29, 2024

zhaoguochun1995 Jan 29, 2024

lljbash Jan 24, 2024

lljbash Jan 24, 2024

fandaoyi Jan 25, 2024

lljbash Jan 24, 2024

lljbash Jan 26, 2024

zhaoguochun1995 commented Jan 30, 2024

		if (devId != currentDeviceIndex) {
		currentDeviceIndex = devId;

Zgc/dipu support ascend hccl2 #651

Zgc/dipu support ascend hccl2 #651

Conversation

zhaoguochun1995 commented Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fandaoyi Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaoguochun1995 commented Jan 30, 2024

zhaoguochun1995 commented Jan 19, 2024 •

edited

Loading

fandaoyi Jan 25, 2024 •

edited

Loading