-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Process PID not found" during "Running memory tracking" #154
Comments
There seems to be a process that's executing on the gpu device but who's info are not accessible. |
@IlyasMoutawwakil in my case I'm running code in docker container of cloud provider. In the case I'm not able to specify docker run parameters, is there any other way to fix such issue? |
Yes I believe there's something wrong with the way vram memory is measured now I started seeing the same errors in my logs. I will check it out. |
@IlyasMoutawwakil that is what cloud support answered me: "But all the code should be running in the same container, so I don't see why you need access to the host process space." They don't have an option to provide --pid=host for running container, so that seems to be not a solution for my case . |
so this issue is related to nvidia drivers see NVIDIA/nvidia-docker#179 (comment), basically what's happening is that when you run a docker container without |
I can think of the following as solution, introducing an environment variable device_handle = nvmlDeviceGetHandleByIndex(device_id)
global_used_vram = nvmlDeviceGetMemoryInfo(device_handle).used this makes it possible to get the memory measurment by |
just added it as part of #156 and the results look like this on my machine (without TRACK_GLOBAL_VRAM=1 optimum-benchmark --config-dir examples/ --config-name pytorch_timm
....
[PROC-0][2024-03-12 16:49:06,508][memory][INFO] - + Tracking VRAM memory of CUDA devices: [0]
[PROC-0][2024-03-12 16:49:06,508][memory][INFO] - + Tracking Allocated/Reserved memory of 1 Pytorch CUDA devices
[PROC-0][2024-03-12 16:49:06,508][inference][INFO] - + Running memory tracking
tracking global VRAM usage. This will track the memory usage of all processes using the device(s).
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] - + forward max RAM memory: 857.153536 (MB)
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] - + forward max VRAM memory: 75356.962816 (MB)
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] - + forward max reserved memory: 195.035136 (MB)
[PROC-0][2024-03-12 16:49:15,607][memory][INFO] - + forward max allocated memory: 168.565248 (MB) notice the big VRAM usage because there's someone on another docker container using the GPU at the same time. |
I'm using vast.ai instances, and sadly there is no option to provide additional container argument. So, your workaround can be the only option for me, let me try it today. Do I understand correctly, I need to checkout to this commit? Will it be available in the stable version in the future, or do you plan to keep it as a separate version/separate commit for such cases as mine? |
it will be merged today as part of the CI workflows migration. It's necessary for that PR as the new runners have the same PID namespaces issue. |
@IlyasMoutawwakil which environmental variable should I set to enable that? I'm trying So I also set |
Update: Finally I decided to use For env vars, use |
I come back to experiments with optimimum-benchmark, and faced the issue of monitoring VRAM consumption from docker image again: I tried both workaround which worked for me previously: import os
os.environ["TRACK_GLOBAL_VRAM"] = "1"
os.environ["GLOBAL_VRAM_USAGE"] = "1" And new env variable, which as far as I understand, should be used for now: import os
os.environ["PECESS_SPECIFIC_VRAM"] = "0" I tried both variants in python script (as shown in above code snippets) and as env vars in .bashrc, but none of these worked - VRAM consumption still is reported to be zero. How can I enable that VRAM tracking from python script? Have you changed some env var for that, or has something broken since that time? |
sorry It seems I replied with a typo, it's |
Hello,
First, thanks for this amazing resource! It's incredibly helpful!
I was running a simple script similar to the one in the examples folder. The script executes correctly but I'm unable to collect VRAM information during the memory tracking stage:
I'm running through a docker container, using the "cuda.Dockerfile" image in the repo here
The text was updated successfully, but these errors were encountered: