Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Process PID not found" during "Running memory tracking" #154

Closed
andxalex opened this issue Mar 11, 2024 · 15 comments
Closed

"Process PID not found" during "Running memory tracking" #154

andxalex opened this issue Mar 11, 2024 · 15 comments

Comments

@andxalex
Copy link

Hello,

First, thanks for this amazing resource! It's incredibly helpful!

I was running a simple script similar to the one in the examples folder. The script executes correctly but I'm unable to collect VRAM information during the memory tracking stage:

image

I'm running through a docker container, using the "cuda.Dockerfile" image in the repo here

@IlyasMoutawwakil
Copy link
Member

There seems to be a process that's executing on the gpu device but who's info are not accessible.
I can see this happening in case your docker container doesn't have access to the host's pids.
You might consider adding --pid=host when running your container.

@franchukpetro
Copy link

@IlyasMoutawwakil in my case I'm running code in docker container of cloud provider. In the case I'm not able to specify docker run parameters, is there any other way to fix such issue?

@IlyasMoutawwakil
Copy link
Member

Yes I believe there's something wrong with the way vram memory is measured now I started seeing the same errors in my logs. I will check it out.

@franchukpetro
Copy link

@IlyasMoutawwakil that is what cloud support answered me:

"But all the code should be running in the same container, so I don't see why you need access to the host process space."

They don't have an option to provide --pid=host for running container, so that seems to be not a solution for my case .

@IlyasMoutawwakil
Copy link
Member

so this issue is related to nvidia drivers see NVIDIA/nvidia-docker#179 (comment), basically what's happening is that when you run a docker container without --pid=host, whatever code you run on gpu within the container will show up (in best cases) on nvidia-smi as the pid of the docker container (or not at all), thus making it impossible to see the memory usage of specific processes within the container (for the gpu they're basically all the same process).

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Mar 12, 2024

I can think of the following as solution, introducing an environment variable TRACK_GLOBAL_VRAM="1" which when set to 1, doesn't check the process specific memory usage and measures the global device memory usage instead:

device_handle = nvmlDeviceGetHandleByIndex(device_id)
global_used_vram = nvmlDeviceGetMemoryInfo(device_handle).used

this makes it possible to get the memory measurment by pynvml/nvidia-smi but only viable when you know that no one else is using your gpu device at the same time (shared resources).
@franchukpetro would you be interested in this feature ?

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Mar 12, 2024

just added it as part of #156 and the results look like this on my machine (without --pid=host)

TRACK_GLOBAL_VRAM=1 optimum-benchmark --config-dir examples/ --config-name pytorch_timm
....
[PROC-0][2024-03-12 16:49:06,508][memory][INFO] -       + Tracking VRAM memory of CUDA devices: [0]
[PROC-0][2024-03-12 16:49:06,508][memory][INFO] -       + Tracking Allocated/Reserved memory of 1 Pytorch CUDA devices
[PROC-0][2024-03-12 16:49:06,508][inference][INFO] -    + Running memory tracking
tracking global VRAM usage. This will track the memory usage of all processes using the device(s).
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] -               + forward max RAM memory: 857.153536 (MB)
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] -               + forward max VRAM memory: 75356.962816 (MB)
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] -               + forward max reserved memory: 195.035136 (MB)
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] -               + forward max allocated memory: 168.565248 (MB)

notice the big VRAM usage because there's someone on another docker container using the GPU at the same time.

@franchukpetro
Copy link

I'm using vast.ai instances, and sadly there is no option to provide additional container argument. So, your workaround can be the only option for me, let me try it today.

Do I understand correctly, I need to checkout to this commit? Will it be available in the stable version in the future, or do you plan to keep it as a separate version/separate commit for such cases as mine?

@IlyasMoutawwakil
Copy link
Member

it will be merged today as part of the CI workflows migration. It's necessary for that PR as the new runners have the same PID namespaces issue.

@franchukpetro
Copy link

@IlyasMoutawwakil which environmental variable should I set to enable that?

I'm trying TRACK_GLOBAL_VRAM="1", but logs also shows this:

image

So I also set GLOBAL_VRAM_USAGE='1', but it hasn't helped as well.

@franchukpetro
Copy link

UPD:

Setting those two variables in .bashrc file wasn't helping, so I forced them to be set within python script for benchmarking:

import os
os.environ["TRACK_GLOBAL_VRAM"] = "1"
os.environ["GLOBAL_VRAM_USAGE"] = "1"

And this finally resolved the issue with VRAM monitoring 🥳 :

image

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Mar 12, 2024

Update: Finally I decided to use PROCESS_SPECIFIC_VRAM=0 because it sounds more verbose than gobal/system.

For env vars, use export or set them before optimum-benchmark like in the example :
PROCESS_SPECIFIC_VRAM=0 optimum-benchmark --config-dir examples/ --config-name pytorch_timm
or in the config file using env_set https://github.com/huggingface/optimum-benchmark/blob/main/examples/openvino_diffusion.yaml#L33

@franchukpetro
Copy link

Hi @IlyasMoutawwakil

I come back to experiments with optimimum-benchmark, and faced the issue of monitoring VRAM consumption from docker image again:

image

I tried both workaround which worked for me previously:

import os
os.environ["TRACK_GLOBAL_VRAM"] = "1"
os.environ["GLOBAL_VRAM_USAGE"] = "1"

And new env variable, which as far as I understand, should be used for now:

import os
os.environ["PECESS_SPECIFIC_VRAM"] = "0"

I tried both variants in python script (as shown in above code snippets) and as env vars in .bashrc, but none of these worked - VRAM consumption still is reported to be zero.

How can I enable that VRAM tracking from python script? Have you changed some env var for that, or has something broken since that time?

@IlyasMoutawwakil
Copy link
Member

sorry It seems I replied with a typo, it's PROCESS_SPECIFIC_VRAM=0 not PECESS_SPECIFIC_VRAM=0

@franchukpetro
Copy link

Thanks, that works!

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants