Fix CUDA OOM Error by Disabling NVTX During DeepSpeed Initialization #1

cosineai · 2024-11-12T23:51:02Z

This pull request addresses the CUDA Out of Memory (OOM) error encountered during the initialization of a customized model with DeepSpeed. The issue was caused by NVTX being enabled, which can lead to increased memory usage. The solution involves temporarily disabling NVTX during the compilation process in the deepspeed/runtime/engine.py file. The NVTX state is saved before disabling and restored after the compilation to ensure that the original state is maintained. This change helps in reducing memory overhead and prevents the OOM error, allowing for successful model initialization with DeepSpeed.

Created by Genie. You can follow its reasoning on Cosine

Co-authored-by: Genie <genie@cosine.sh>

fix: preserve NVTX state during compilation

c9b1b01

Co-authored-by: Genie <genie@cosine.sh>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA OOM Error by Disabling NVTX During DeepSpeed Initialization #1

Fix CUDA OOM Error by Disabling NVTX During DeepSpeed Initialization #1

cosineai bot commented Nov 12, 2024

Fix CUDA OOM Error by Disabling NVTX During DeepSpeed Initialization #1

Are you sure you want to change the base?

Fix CUDA OOM Error by Disabling NVTX During DeepSpeed Initialization #1

Conversation

cosineai bot commented Nov 12, 2024