"Process PID not found" during "Running memory tracking" #154

andxalex · 2024-03-11T19:29:20Z

Hello,

First, thanks for this amazing resource! It's incredibly helpful!

I was running a simple script similar to the one in the examples folder. The script executes correctly but I'm unable to collect VRAM information during the memory tracking stage:

I'm running through a docker container, using the "cuda.Dockerfile" image in the repo here

IlyasMoutawwakil · 2024-03-12T01:05:59Z

There seems to be a process that's executing on the gpu device but who's info are not accessible.
I can see this happening in case your docker container doesn't have access to the host's pids.
You might consider adding --pid=host when running your container.

franchukpetro · 2024-03-12T13:07:21Z

@IlyasMoutawwakil in my case I'm running code in docker container of cloud provider. In the case I'm not able to specify docker run parameters, is there any other way to fix such issue?

IlyasMoutawwakil · 2024-03-12T14:02:13Z

Yes I believe there's something wrong with the way vram memory is measured now I started seeing the same errors in my logs. I will check it out.

franchukpetro · 2024-03-12T14:26:43Z

@IlyasMoutawwakil that is what cloud support answered me:

"But all the code should be running in the same container, so I don't see why you need access to the host process space."

They don't have an option to provide --pid=host for running container, so that seems to be not a solution for my case .

IlyasMoutawwakil · 2024-03-12T15:32:28Z

so this issue is related to nvidia drivers see NVIDIA/nvidia-docker#179 (comment), basically what's happening is that when you run a docker container without --pid=host, whatever code you run on gpu within the container will show up (in best cases) on nvidia-smi as the pid of the docker container (or not at all), thus making it impossible to see the memory usage of specific processes within the container (for the gpu they're basically all the same process).

IlyasMoutawwakil · 2024-03-12T16:16:21Z

I can think of the following as solution, introducing an environment variable TRACK_GLOBAL_VRAM="1" which when set to 1, doesn't check the process specific memory usage and measures the global device memory usage instead:

device_handle = nvmlDeviceGetHandleByIndex(device_id)
global_used_vram = nvmlDeviceGetMemoryInfo(device_handle).used

this makes it possible to get the memory measurment by pynvml/nvidia-smi but only viable when you know that no one else is using your gpu device at the same time (shared resources).
@franchukpetro would you be interested in this feature ?

IlyasMoutawwakil · 2024-03-12T16:52:46Z

just added it as part of #156 and the results look like this on my machine (without --pid=host)

TRACK_GLOBAL_VRAM=1 optimum-benchmark --config-dir examples/ --config-name pytorch_timm
....
[PROC-0][2024-03-12 16:49:06,508][memory][INFO] -       + Tracking VRAM memory of CUDA devices: [0]
[PROC-0][2024-03-12 16:49:06,508][memory][INFO] -       + Tracking Allocated/Reserved memory of 1 Pytorch CUDA devices
[PROC-0][2024-03-12 16:49:06,508][inference][INFO] -    + Running memory tracking
tracking global VRAM usage. This will track the memory usage of all processes using the device(s).
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] -               + forward max RAM memory: 857.153536 (MB)
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] -               + forward max VRAM memory: 75356.962816 (MB)
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] -               + forward max reserved memory: 195.035136 (MB)
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] -               + forward max allocated memory: 168.565248 (MB)

notice the big VRAM usage because there's someone on another docker container using the GPU at the same time.

franchukpetro · 2024-03-12T17:07:48Z

I'm using vast.ai instances, and sadly there is no option to provide additional container argument. So, your workaround can be the only option for me, let me try it today.

Do I understand correctly, I need to checkout to this commit? Will it be available in the stable version in the future, or do you plan to keep it as a separate version/separate commit for such cases as mine?

IlyasMoutawwakil · 2024-03-12T19:30:43Z

it will be merged today as part of the CI workflows migration. It's necessary for that PR as the new runners have the same PID namespaces issue.

franchukpetro · 2024-03-12T19:58:20Z

@IlyasMoutawwakil which environmental variable should I set to enable that?

I'm trying TRACK_GLOBAL_VRAM="1", but logs also shows this:

So I also set GLOBAL_VRAM_USAGE='1', but it hasn't helped as well.

franchukpetro · 2024-03-12T20:06:28Z

UPD:

Setting those two variables in .bashrc file wasn't helping, so I forced them to be set within python script for benchmarking:

import os
os.environ["TRACK_GLOBAL_VRAM"] = "1"
os.environ["GLOBAL_VRAM_USAGE"] = "1"

And this finally resolved the issue with VRAM monitoring 🥳 :

IlyasMoutawwakil · 2024-03-12T20:31:17Z

Update: Finally I decided to use PROCESS_SPECIFIC_VRAM=0 because it sounds more verbose than gobal/system.

For env vars, use export or set them before optimum-benchmark like in the example :
PROCESS_SPECIFIC_VRAM=0 optimum-benchmark --config-dir examples/ --config-name pytorch_timm
or in the config file using env_set https://github.com/huggingface/optimum-benchmark/blob/main/examples/openvino_diffusion.yaml#L33

franchukpetro · 2024-04-03T10:14:47Z

Hi @IlyasMoutawwakil

I come back to experiments with optimimum-benchmark, and faced the issue of monitoring VRAM consumption from docker image again:

I tried both workaround which worked for me previously:

import os
os.environ["TRACK_GLOBAL_VRAM"] = "1"
os.environ["GLOBAL_VRAM_USAGE"] = "1"

And new env variable, which as far as I understand, should be used for now:

import os
os.environ["PECESS_SPECIFIC_VRAM"] = "0"

I tried both variants in python script (as shown in above code snippets) and as env vars in .bashrc, but none of these worked - VRAM consumption still is reported to be zero.

How can I enable that VRAM tracking from python script? Have you changed some env var for that, or has something broken since that time?

IlyasMoutawwakil · 2024-04-03T11:19:23Z

sorry It seems I replied with a typo, it's PROCESS_SPECIFIC_VRAM=0 not PECESS_SPECIFIC_VRAM=0

franchukpetro · 2024-04-03T12:34:32Z

Thanks, that works!

IlyasMoutawwakil mentioned this issue Mar 12, 2024

Reduction of memory requirements to run benchmarks #151

Closed

IlyasMoutawwakil mentioned this issue Mar 13, 2024

Migrate CUDA CI workflows #156

Merged

IlyasMoutawwakil closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Process PID not found" during "Running memory tracking" #154

"Process PID not found" during "Running memory tracking" #154

andxalex commented Mar 11, 2024

IlyasMoutawwakil commented Mar 12, 2024

franchukpetro commented Mar 12, 2024

IlyasMoutawwakil commented Mar 12, 2024

franchukpetro commented Mar 12, 2024

IlyasMoutawwakil commented Mar 12, 2024

IlyasMoutawwakil commented Mar 12, 2024 •

edited

Loading

IlyasMoutawwakil commented Mar 12, 2024 •

edited

Loading

franchukpetro commented Mar 12, 2024

IlyasMoutawwakil commented Mar 12, 2024

franchukpetro commented Mar 12, 2024

franchukpetro commented Mar 12, 2024

IlyasMoutawwakil commented Mar 12, 2024 •

edited

Loading

franchukpetro commented Apr 3, 2024

IlyasMoutawwakil commented Apr 3, 2024

franchukpetro commented Apr 3, 2024

"Process PID not found" during "Running memory tracking" #154

"Process PID not found" during "Running memory tracking" #154

Comments

andxalex commented Mar 11, 2024

IlyasMoutawwakil commented Mar 12, 2024

franchukpetro commented Mar 12, 2024

IlyasMoutawwakil commented Mar 12, 2024

franchukpetro commented Mar 12, 2024

IlyasMoutawwakil commented Mar 12, 2024

IlyasMoutawwakil commented Mar 12, 2024 • edited Loading

IlyasMoutawwakil commented Mar 12, 2024 • edited Loading

franchukpetro commented Mar 12, 2024

IlyasMoutawwakil commented Mar 12, 2024

franchukpetro commented Mar 12, 2024

franchukpetro commented Mar 12, 2024

IlyasMoutawwakil commented Mar 12, 2024 • edited Loading

franchukpetro commented Apr 3, 2024

IlyasMoutawwakil commented Apr 3, 2024

franchukpetro commented Apr 3, 2024

IlyasMoutawwakil commented Mar 12, 2024 •

edited

Loading

IlyasMoutawwakil commented Mar 12, 2024 •

edited

Loading

IlyasMoutawwakil commented Mar 12, 2024 •

edited

Loading