Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try fix tgs agitate #751

Merged
merged 23 commits into from
Apr 10, 2024

Conversation

zhaoguochun1995
Copy link
Collaborator

@zhaoguochun1995 zhaoguochun1995 commented Mar 28, 2024

截屏2024-04-10 下午5 32 02
image

  1. llama2 70B 1024卡, 10小时平均tgs从320提升至351
  2. llama2 70B 1024卡,cpu利用率从11%降低至3%
  3. llama2 70B 1024卡 设备利用率波动减小,10小时内只有7次出现了利用率降低到30%的情况,其他时间能稳定维持在50%左右
    https://aicarrier.feishu.cn/wiki/Eh4lwUgHkivsjVkPtupcDPptnYd

@@ -18,6 +18,15 @@ namespace dipu {
// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
std::mutex DIPURawDeviceAllocator::mutex_;

size_t kMaxAsyncResourcePoolLength = [](){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

Copy link
Collaborator

@wiryls wiryls Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉得有个统一的入口控制一下,现在读取环境变量太分散了,不方便维护

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯嗯,确实

@zhaoguochun1995 zhaoguochun1995 force-pushed the zgc/dipu_fix_tgs_agitate2 branch from ff63f12 to ea2db53 Compare April 8, 2024 08:39
@@ -31,7 +29,11 @@ class AsyncResourcePoolImpl : public AsyncResourcePool<T> {
public:
void add(const T& t, std::deque<DIPUEvent>& events) override {
std::lock_guard<mutex_t> lk(mutex_);
list_.emplace_back(t, std::move(events));
if (events.size() > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里 events 为空需要加入 list 吗,我以为可以忽略它

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要的,不然会内存泄漏

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议后续把这块逻辑整体改下, 对于没有在流上等待的 tensor, 析构时直接 restore()。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里实际上是故意没有在析构的时候restore。主要目的: 1. 加快tensor析构的速度 2. tensor析构时restore没有什么用,只有在申请的时候才需要尽可能多的内存已经回收。 3. 析构时里面回收,有可能流上还没有读写完毕,减小竞争的概率 4. resotre时可能会有碎片整理等操作, 把潜在的耗时放在申请的时候,可以让一部分wait变成有意义的cpu操作

namespace dipu {

template <typename T>
T get_env_or_default(const char* env_name, const T& defalut_value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
T get_env_or_default(const char* env_name, const T& defalut_value) {
T get_env_or_default(const char* env_name, const T& default_value) {

T get_env_or_default(const char* env_name, const T& defalut_value) {
const char* env = std::getenv(env_name);
if (env == nullptr) {
return defalut_value;
Copy link
Collaborator

@jfxu-st jfxu-st Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
return defalut_value;
return default_value;

@zhaoguochun1995 zhaoguochun1995 force-pushed the zgc/dipu_fix_tgs_agitate2 branch from 33bcd11 to 98dd050 Compare April 9, 2024 08:12
std::this_thread::yield();
continue;
auto now = std::chrono::steady_clock::now();
auto elasped = now - start;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
auto elasped = now - start;
auto elapsed = now - start;

continue;
auto now = std::chrono::steady_clock::now();
auto elasped = now - start;
if (elasped < maxWaitTime) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
if (elasped < maxWaitTime) {
if (elapsed < maxWaitTime) {

@zhaoguochun1995 zhaoguochun1995 force-pushed the zgc/dipu_fix_tgs_agitate2 branch 3 times, most recently from 13d7299 to 4029d2d Compare April 9, 2024 12:37
std::this_thread::yield();
continue;
}
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个情况相当于如果等了一段时间还没ready 就不做empty_resource_pool了? 如果是的话 这个函数得设计下 比如try empty 之类的 如果最后没释放成功return个false 释放成功return个true 不然看到吗以为empty了就是释放掉了 其实没做成?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯嗯,确实可以这样

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样内存不够的时候也会有问题了,得完善下

Copy link
Collaborator

@fandaoyi fandaoyi Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

备注2个点:
根据刚才和国春的讨论, 我备注2个点 (都是个老 问题,后续单独改即可, 无需计入这个pr)。。
@yangbofun @mrdanielw @caikun-pjlab @Wrench-Git @jfxu-st

  1. 现有的 allocator 对于 非默认流上 tensor 的分配有bug,无法保证内存安全,国春昨天提到一个简单的改法 (所有分配都先 record stream), 但是感觉对现有性能影响有点大。 还有一种方法是 “分配时让非默认流先强制同步到默认流”, 对现有代码影响小些,但对非默认流上的分配也不友好。
  2. 目前的 record Stream 调用有些地方(Copy, DIPUGuardImpl)是和默认流比较并选择忽略记录,这个在配合非默认流分配的tensor使用时也会出问题 (非默认流 上分配, 然后转移到默认流上计算的tensor的记录会被忽略)。 建议上层取消判断,统一由 allocator 来判断是否真的要记录(allocator 需要先能正确记录tensor 分配时的当前流,才能高效的完成这件事,否则 默认流上copy 的Tensor 会退化为全部record了, 把@jfxu-st 之前的优化又冲掉了)。

@zhaoguochun1995 zhaoguochun1995 force-pushed the zgc/dipu_fix_tgs_agitate2 branch from 4029d2d to 3a04678 Compare April 10, 2024 02:50
@zhaoguochun1995 zhaoguochun1995 force-pushed the zgc/dipu_fix_tgs_agitate2 branch from 53c7755 to a5edb80 Compare April 10, 2024 02:58
@mrdanielw mrdanielw merged commit f01a8c1 into DeepLink-org:main Apr 10, 2024
30 checks passed
@wiryls wiryls deleted the zgc/dipu_fix_tgs_agitate2 branch April 11, 2024 10:00
xuq7410 pushed a commit to xuq7410/deeplink.framework that referenced this pull request May 23, 2024
* add rms norm op
* take functions_ext into the adaptor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants