Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zgc/dipu support ascend hccl2 #651

Conversation

zhaoguochun1995
Copy link
Collaborator

@zhaoguochun1995 zhaoguochun1995 commented Jan 19, 2024

修复多卡通信时HcclAllreduce的bug:
截屏2024-01-17 下午3 33 35
在多卡训练时会报sdma错误,经和华为工程师一起联合调试,定位为hccl内部dma拷贝时内存访问越界。
后面华为工程师定位出来,我们在训练时另一个线程调用了setDevice接口,触发了他们的bug。
当前是否调用了setDevice的标志是个thread_local的标记,尽管已经在通信时根据rank 进行了设备的设置,但是在另一个线程里却没有更新这个标记,导致另一个线程会重新setDevice。
与此同时,setDevcie 和getDevice 调用非常频繁,有必要做下优化: 当目前已经是要setDevice的设备时,可以不用set。get时可以用上次更新后的值。

return static_cast<deviceId_t>(currentDeviceIndex);
}

// set current device given device according to id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加个注释吧, 说明下 原来的 使用线程局部量 的flag 导致的问题和原因。 因为后面可能还是要支持多线程 卡切换的 (单进程多卡),后面的容易踩坑。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 朝兴看下, 我记得朝兴当初加这个 线程局部量的 flag 是为了解决多线程 device 切换的啥问题? 现在按照国春的说法,我们目前初始化华为的通信库方式, 这种方式也会有问题。 这其实 又回到了朝兴修改前的状态。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确认下华为 的 aclrtGet/setDevice 是否支持线程级别的bind。 然后这里加注释说明下, 这个和 nv的不一致。后面真要做单进程多卡会,埋坑。

Copy link
Collaborator

@fandaoyi fandaoyi Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

备注: 现在出现个坑,pytorch backward 执行时会初始化一组bwd线程, 这组线程会和主线程共同执行对应 device 上的bwd op。主线程对应 bwd root node (loss) 对应的device, 每个子线程对应 一个进程可见的 device。 如果这块的逻辑不改的话, 子线程必须要set_device(). 否则无法执行 backward。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

线程和设备不一定是绑定的关系吧?一个线程在不同的时刻有可能使用不同的设备,一个设备也可能会被不同的线程使用。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个还是得用DevcieGuard保证在线程中使用正确的设备?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用哪个设备计算,不应该取决于是在哪个线程,而是相关的tensor分配在哪个设备上吧

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单进程单卡、单进程多卡相关的逻辑应该加在上层,不应该加在设备接口适配层。

@@ -16,21 +17,29 @@ namespace devapis {
// Device class related
// =====================
using ascend_deviceId = int32_t;
thread_local bool setDevFlag = false;

std::atomic<int> currentDeviceIndex(-1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic_int

Comment on lines 37 to 38
if (devId != currentDeviceIndex) {
currentDeviceIndex = devId;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按我的理解,这样写不是 thread_safe 的,得用 exchange

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

首先 我不太建议这块用原子变量, 这个调用比较频繁, 而且这么做有bug。 我打了个 补丁, fix 了internlm 跑不过的问题,这个变量被改名为 mainThreadDeviceIdx 而且只会被主线程设置。我的补丁里 仅 使用 volatile 前缀,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件权限变了,注意一下

thread_local bool setDevFlag = false;

// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
volatile deviceId_t mainThreadDeviceIdx = -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volatile object - an object whose type is volatile-qualified, or a subobject of a volatile object, or a mutable subobject of a const-volatile object. Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order). Any attempt to access a volatile object through a glvalue of non-volatile type (e.g. through a reference or pointer to non-volatile type) results in undefined behavior.

这里用法有问题,我猜你需要的还是 atomic

@zhaoguochun1995
Copy link
Collaborator Author

2024-1-19号升级驱动后该问题消失

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants