Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dipu,vendor,ascend): simple and thread-safe device management #746

Merged
merged 7 commits into from
Apr 12, 2024

Conversation

lljbash
Copy link
Collaborator

@lljbash lljbash commented Mar 25, 2024

See also: #682


  1. 重构,在不改变功能(不支持单进程绑定多卡,强制要求单进程只能绑定单卡,且只能绑定一次)的前提下,更清晰地定义了编程模型,降低了理解难度,便于后期维护。
  2. 增强了并发时的线程安全。

@lljbash lljbash added the DIPU DIPU related label Mar 25, 2024
@lljbash lljbash force-pushed the llj_ascend_device_set branch from cb27397 to 7cbb754 Compare March 25, 2024 04:31
@lljbash lljbash force-pushed the llj_ascend_device_set branch from 11757e7 to 35ccc9b Compare March 25, 2024 04:56
// atomically set global device id if it is uninit
// and anyway return the global device id
AscendDeviceId initOnceAndGetGlobalDeviceId(
AscendDeviceId device_id_if_uninit) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 device_id_if_uninit 名字是啥意思?可以直接就叫 device_id 吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就是 uninit 的时候会设置的

Copy link
Collaborator

@jfxu-st jfxu-st Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就是 uninit 的时候会设置的

感觉在参数名后面加 if 还是怪怪的。保持参数名 device_id,把 if uninit 加到函数名里去,改成 initOnceIfUninitAndGetGlobalDeviceId 这样会不会更合适一点?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我现在改成了 AscendDeviceId tryInitAndAnywayGetProcessDevice(AscendDeviceId ascend_device_id)

Comment on lines 50 to 53
DIPU_LOGW(
"current_device() is called before setDevice(). Setting device to "
"default device.");
setDevice(kDeviceIdDefault);
Copy link
Collaborator

@jfxu-st jfxu-st Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_current_thread_device_id == kDeviceIdUninit 满足时,必须要手动调用一次 aclrtSetDevice 来为当前线程绑定硬件资源。但现在的写法会导致 aclrtSetDevice 可能不会被调用。考虑这种情况:在其他线程已经设置过了 global_device_id 的情况下,setDevice(kDeviceIdDefault) 不会调用 aclrtSetDevice 并报出一个 Trying to use multiple cards ... 的 warning。

怀疑上面这种没有处理好的情况就是导致跑模型跑着跑着会卡死的原因。

可以简单地改成下面几行来解决这个问题(这里删除了部分 warning 的原因:如果 g_current_thread_device_id 在调用 initOnceAndGetGlobalDeviceId 后被设置成了 0,无法判断是因为此处的 try init 成功了,还是因为此处 try init 失败了但在此之前其他线程已经把本进程的 global_device_id 设置成了 default device ):

Suggested change
DIPU_LOGW(
"current_device() is called before setDevice(). Setting device to "
"default device.");
setDevice(kDeviceIdDefault);
DIPU_LOGW("current_device() is called before setDevice()");
g_current_thread_device_id = initOnceAndGetGlobalDeviceId(kDeviceIdDefault);
DIPU_CALLACLRT(::aclrtSetDevice(static_cast<AscendDeviceId>(g_current_thread_device_id)))

如果觉得上面这种修改和 setDevice 里的逻辑有一定的重复的话,也可以考虑再抽象一个内部函数比如 setDeviceInternal 出来,实现下面的核心逻辑:

  auto ascend_device_id = static_cast<AscendDeviceId>(device_id);
  if (g_current_thread_device_id == kDeviceIdUninit) {
    g_current_thread_device_id = initOnceAndGetGlobalDeviceId(ascend_device_id);
    DIPU_CALLACLRT(::aclrtSetDevice(ascend_device_id))
  }

然后在 current_devicesetDevice 内都调用 setDeviceInternal。在 setDevice 中最后再通过检查来报 warning:

  if (ascend_device_id != g_current_thread_device_id) {
    DIPU_LOGW(
        "Trying to use multiple cards in the same process may cause unexpected "
        "results in hccl communication, such as sdma memory copy failure");
  }

无论哪一种改法,核心逻辑应当是当且仅当 g_current_thread_device_id == kDeviceIdUninit 时调用一次 aclrtSetDevice 来为当前线程绑定硬件资源。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面那个不等价了。我觉得给 setDevice 多处理一种情况可能比较合适。定义 kDeviceIdDefault = -2 之类的。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面那个不等价了。我觉得给 setDevice 多处理一种情况可能比较合适。

等价的。贴的代码里有一行我写错了,不应该是 DIPU_CALLACLRT(::aclrtSetDevice(ascend_device_id)),而应该是 DIPU_CALLACLRT(::aclrtSetDevice(g_current_thread_device_id)),这样就等价了。

定义 kDeviceIdDefault = -2 之类的。

增加 -2 之类的特殊值定义的话就又回到老路上去了,kDeviceIdDefault 的含义就没那么直观了,提升了理解难度,不利于后期维护。

Comment on lines 60 to 63
if (device_id < 0) {
DIPU_LOGW("Requested device id is invalid. Ignoring the request.");
return;
}
Copy link
Collaborator

@jfxu-st jfxu-st Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据 @fandaoyi 在群里的说法,“上层不会有传 -1 的 case, 传了报错就好, 就算处理也应该是上层处理, 这里不需要处理”。

Suggested change
if (device_id < 0) {
DIPU_LOGW("Requested device id is invalid. Ignoring the request.");
return;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删。

constexpr AscendDeviceId kDeviceIdUninit = -1;
constexpr AscendDeviceId kDeviceIdDefault = 0;
// Thread-level cache for process-level device id
thread_local auto process_device_id_thread_cache = kDeviceIdUninit;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
thread_local auto process_device_id_thread_cache = kDeviceIdUninit;
thread_local auto g_process_device_id_thread_cache = kDeviceIdUninit;

这还是个全局

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// when and only when the thread-level cache is initialized.
process_device_id_thread_cache =
tryInitAndAnywayGetProcessDevice(ascend_device_id);
DIPU_CALLACLRT(::aclrtSetDevice(process_device_id_thread_cache))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set 不同的会有什么后果,会不会导致崩溃什么的,要不要阻止?

Copy link
Collaborator Author

@lljbash lljbash Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可能看错了,你检查下就好

Copy link
Collaborator

@jfxu-st jfxu-st Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set 不同的会有什么后果,会不会导致崩溃什么的,要不要阻止?

根据 @zhaoguochun1995 在群里发的聊天记录截图,torch_npu 也是这样做的,无声无息地让第二次及以后的 set 不起效果,甚至连 warning 都没有……

所以这里的 if 检查如果失败了(已经被 set 过了),就什么都不做。我们比 torch_npu 多的就是如果这次 tryInitThreadDeviceCache() 的调用是通过显式调用 setDevice() 发起的,那么就报一个 warning。

这个问题后面 @jingguo-st 会进一步确认。

@fandaoyi fandaoyi merged commit 2f9cb9f into DeepLink-org:main Apr 12, 2024
30 checks passed
xuq7410 pushed a commit to xuq7410/deeplink.framework that referenced this pull request May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DIPU DIPU related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants