-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try fix tgs agitate #751
try fix tgs agitate #751
Changes from 7 commits
2941910
d5f3696
a18bfd0
5a177b0
9156868
6139204
cf57466
949f83d
f87d57e
ff5d314
2f48b62
549e274
ea2db53
e49f4cf
c36dadb
4128475
a580fbf
98dd050
3122100
f277255
3a04678
a5edb80
cd605ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
// Copyright (c) 2023, DeepLink. | ||
Check notice on line 1 in dipu/torch_dipu/csrc_dipu/runtime/core/allocator/DIPUCachingAllocator.cpp
|
||
|
||
#include "DIPUCachingAllocator.h" | ||
|
||
|
@@ -18,6 +18,15 @@ | |
// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables) | ||
std::mutex DIPURawDeviceAllocator::mutex_; | ||
|
||
size_t kMaxAsyncResourcePoolLength = [](){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. const There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 感觉得有个统一的入口控制一下,现在读取环境变量太分散了,不方便维护 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 嗯嗯,确实 |
||
size_t maxAsyncResourcePoolLength = 8; | ||
const char* env = std::getenv("DIPU_MAX_ASYNC_RESOURCE_POOL_LENGTH"); | ||
wiryls marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if (env != nullptr) { | ||
maxAsyncResourcePoolLength = std::atoi(env); | ||
} | ||
return maxAsyncResourcePoolLength; | ||
}(); | ||
|
||
namespace { | ||
|
||
// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里 events 为空需要加入 list 吗,我以为可以忽略它
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要的,不然会内存泄漏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议后续把这块逻辑整体改下, 对于没有在流上等待的 tensor, 析构时直接 restore()。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里实际上是故意没有在析构的时候restore。主要目的: 1. 加快tensor析构的速度 2. tensor析构时restore没有什么用,只有在申请的时候才需要尽可能多的内存已经回收。 3. 析构时里面回收,有可能流上还没有读写完毕,减小竞争的概率 4. resotre时可能会有碎片整理等操作, 把潜在的耗时放在申请的时候,可以让一部分wait变成有意义的cpu操作