-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(dipu,vendor,ascend): simple and thread-safe device management #746
fix(dipu,vendor,ascend): simple and thread-safe device management #746
Conversation
cb27397
to
7cbb754
Compare
11757e7
to
35ccc9b
Compare
// atomically set global device id if it is uninit | ||
// and anyway return the global device id | ||
AscendDeviceId initOnceAndGetGlobalDeviceId( | ||
AscendDeviceId device_id_if_uninit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 device_id_if_uninit
名字是啥意思?可以直接就叫 device_id
吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
就是 uninit 的时候会设置的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
就是 uninit 的时候会设置的
感觉在参数名后面加 if 还是怪怪的。保持参数名 device_id
,把 if uninit 加到函数名里去,改成 initOnceIfUninitAndGetGlobalDeviceId
这样会不会更合适一点?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我现在改成了 AscendDeviceId tryInitAndAnywayGetProcessDevice(AscendDeviceId ascend_device_id)
。
DIPU_LOGW( | ||
"current_device() is called before setDevice(). Setting device to " | ||
"default device."); | ||
setDevice(kDeviceIdDefault); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
当 g_current_thread_device_id == kDeviceIdUninit
满足时,必须要手动调用一次 aclrtSetDevice
来为当前线程绑定硬件资源。但现在的写法会导致 aclrtSetDevice
可能不会被调用。考虑这种情况:在其他线程已经设置过了 global_device_id
的情况下,setDevice(kDeviceIdDefault)
不会调用 aclrtSetDevice
并报出一个 Trying to use multiple cards ...
的 warning。
怀疑上面这种没有处理好的情况就是导致跑模型跑着跑着会卡死的原因。
可以简单地改成下面几行来解决这个问题(这里删除了部分 warning 的原因:如果 g_current_thread_device_id
在调用 initOnceAndGetGlobalDeviceId
后被设置成了 0,无法判断是因为此处的 try init 成功了,还是因为此处 try init 失败了但在此之前其他线程已经把本进程的 global_device_id 设置成了 default device ):
DIPU_LOGW( | |
"current_device() is called before setDevice(). Setting device to " | |
"default device."); | |
setDevice(kDeviceIdDefault); | |
DIPU_LOGW("current_device() is called before setDevice()"); | |
g_current_thread_device_id = initOnceAndGetGlobalDeviceId(kDeviceIdDefault); | |
DIPU_CALLACLRT(::aclrtSetDevice(static_cast<AscendDeviceId>(g_current_thread_device_id))) |
如果觉得上面这种修改和 setDevice
里的逻辑有一定的重复的话,也可以考虑再抽象一个内部函数比如 setDeviceInternal
出来,实现下面的核心逻辑:
auto ascend_device_id = static_cast<AscendDeviceId>(device_id);
if (g_current_thread_device_id == kDeviceIdUninit) {
g_current_thread_device_id = initOnceAndGetGlobalDeviceId(ascend_device_id);
DIPU_CALLACLRT(::aclrtSetDevice(ascend_device_id))
}
然后在 current_device
和 setDevice
内都调用 setDeviceInternal
。在 setDevice
中最后再通过检查来报 warning:
if (ascend_device_id != g_current_thread_device_id) {
DIPU_LOGW(
"Trying to use multiple cards in the same process may cause unexpected "
"results in hccl communication, such as sdma memory copy failure");
}
无论哪一种改法,核心逻辑应当是当且仅当 g_current_thread_device_id == kDeviceIdUninit
时调用一次 aclrtSetDevice
来为当前线程绑定硬件资源。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后面那个不等价了。我觉得给 setDevice 多处理一种情况可能比较合适。定义 kDeviceIdDefault = -2 之类的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后面那个不等价了。我觉得给 setDevice 多处理一种情况可能比较合适。
等价的。贴的代码里有一行我写错了,不应该是 DIPU_CALLACLRT(::aclrtSetDevice(ascend_device_id))
,而应该是 DIPU_CALLACLRT(::aclrtSetDevice(g_current_thread_device_id))
,这样就等价了。
定义 kDeviceIdDefault = -2 之类的。
增加 -2 之类的特殊值定义的话就又回到老路上去了,kDeviceIdDefault
的含义就没那么直观了,提升了理解难度,不利于后期维护。
if (device_id < 0) { | ||
DIPU_LOGW("Requested device id is invalid. Ignoring the request."); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据 @fandaoyi 在群里的说法,“上层不会有传 -1 的 case, 传了报错就好, 就算处理也应该是上层处理, 这里不需要处理”。
if (device_id < 0) { | |
DIPU_LOGW("Requested device id is invalid. Ignoring the request."); | |
return; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删。
constexpr AscendDeviceId kDeviceIdUninit = -1; | ||
constexpr AscendDeviceId kDeviceIdDefault = 0; | ||
// Thread-level cache for process-level device id | ||
thread_local auto process_device_id_thread_cache = kDeviceIdUninit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thread_local auto process_device_id_thread_cache = kDeviceIdUninit; | |
thread_local auto g_process_device_id_thread_cache = kDeviceIdUninit; |
这还是个全局
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// when and only when the thread-level cache is initialized. | ||
process_device_id_thread_cache = | ||
tryInitAndAnywayGetProcessDevice(ascend_device_id); | ||
DIPU_CALLACLRT(::aclrtSetDevice(process_device_id_thread_cache)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set 不同的会有什么后果,会不会导致崩溃什么的,要不要阻止?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可能看错了,你检查下就好
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set 不同的会有什么后果,会不会导致崩溃什么的,要不要阻止?
根据 @zhaoguochun1995 在群里发的聊天记录截图,torch_npu 也是这样做的,无声无息地让第二次及以后的 set 不起效果,甚至连 warning 都没有……
所以这里的 if 检查如果失败了(已经被 set 过了),就什么都不做。我们比 torch_npu 多的就是如果这次 tryInitThreadDeviceCache() 的调用是通过显式调用 setDevice() 发起的,那么就报一个 warning。
这个问题后面 @jingguo-st 会进一步确认。
* fix mem leak
See also: #682