-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try fix tgs agitate #751
try fix tgs agitate #751
Conversation
…syncResourcePoolLength
@@ -18,6 +18,15 @@ namespace dipu { | |||
// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables) | |||
std::mutex DIPURawDeviceAllocator::mutex_; | |||
|
|||
size_t kMaxAsyncResourcePoolLength = [](){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉得有个统一的入口控制一下,现在读取环境变量太分散了,不方便维护
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,确实
ff63f12
to
ea2db53
Compare
dipu/torch_dipu/csrc_dipu/runtime/core/allocator/DIPUBFCachingAllocator.cpp
Outdated
Show resolved
Hide resolved
dipu/torch_dipu/csrc_dipu/runtime/core/allocator/DIPUCachingAllocator.cpp
Outdated
Show resolved
Hide resolved
@@ -31,7 +29,11 @@ class AsyncResourcePoolImpl : public AsyncResourcePool<T> { | |||
public: | |||
void add(const T& t, std::deque<DIPUEvent>& events) override { | |||
std::lock_guard<mutex_t> lk(mutex_); | |||
list_.emplace_back(t, std::move(events)); | |||
if (events.size() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里 events 为空需要加入 list 吗,我以为可以忽略它
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要的,不然会内存泄漏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议后续把这块逻辑整体改下, 对于没有在流上等待的 tensor, 析构时直接 restore()。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里实际上是故意没有在析构的时候restore。主要目的: 1. 加快tensor析构的速度 2. tensor析构时restore没有什么用,只有在申请的时候才需要尽可能多的内存已经回收。 3. 析构时里面回收,有可能流上还没有读写完毕,减小竞争的概率 4. resotre时可能会有碎片整理等操作, 把潜在的耗时放在申请的时候,可以让一部分wait变成有意义的cpu操作
namespace dipu { | ||
|
||
template <typename T> | ||
T get_env_or_default(const char* env_name, const T& defalut_value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
T get_env_or_default(const char* env_name, const T& defalut_value) { | |
T get_env_or_default(const char* env_name, const T& default_value) { |
T get_env_or_default(const char* env_name, const T& defalut_value) { | ||
const char* env = std::getenv(env_name); | ||
if (env == nullptr) { | ||
return defalut_value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
return defalut_value; | |
return default_value; |
33bcd11
to
98dd050
Compare
std::this_thread::yield(); | ||
continue; | ||
auto now = std::chrono::steady_clock::now(); | ||
auto elasped = now - start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
auto elasped = now - start; | |
auto elapsed = now - start; |
continue; | ||
auto now = std::chrono::steady_clock::now(); | ||
auto elasped = now - start; | ||
if (elasped < maxWaitTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
if (elasped < maxWaitTime) { | |
if (elapsed < maxWaitTime) { |
dipu/torch_dipu/csrc_dipu/runtime/core/allocator/DIPUBFCachingAllocator.cpp
Outdated
Show resolved
Hide resolved
13d7299
to
4029d2d
Compare
std::this_thread::yield(); | ||
continue; | ||
} | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个情况相当于如果等了一段时间还没ready 就不做empty_resource_pool了? 如果是的话 这个函数得设计下 比如try empty 之类的 如果最后没释放成功return个false 释放成功return个true 不然看到吗以为empty了就是释放掉了 其实没做成?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,确实可以这样
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样内存不够的时候也会有问题了,得完善下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
备注2个点:
根据刚才和国春的讨论, 我备注2个点 (都是个老 问题,后续单独改即可, 无需计入这个pr)。。
@yangbofun @mrdanielw @caikun-pjlab @Wrench-Git @jfxu-st
- 现有的 allocator 对于 非默认流上 tensor 的分配有bug,无法保证内存安全,国春昨天提到一个简单的改法 (所有分配都先 record stream), 但是感觉对现有性能影响有点大。 还有一种方法是 “分配时让非默认流先强制同步到默认流”, 对现有代码影响小些,但对非默认流上的分配也不友好。
- 目前的 record Stream 调用有些地方(Copy, DIPUGuardImpl)是和默认流比较并选择忽略记录,这个在配合非默认流分配的tensor使用时也会出问题 (非默认流 上分配, 然后转移到默认流上计算的tensor的记录会被忽略)。 建议上层取消判断,统一由 allocator 来判断是否真的要记录(allocator 需要先能正确记录tensor 分配时的当前流,才能高效的完成这件事,否则 默认流上copy 的Tensor 会退化为全部record了, 把@jfxu-st 之前的优化又冲掉了)。
4029d2d
to
3a04678
Compare
53c7755
to
a5edb80
Compare
* add rms norm op * take functions_ext into the adaptor
https://aicarrier.feishu.cn/wiki/Eh4lwUgHkivsjVkPtupcDPptnYd