Replies: 3 comments 4 replies
-
I should probably update that default driver version, but yes it should auto-detect the hosts's driver version. Does the Try running directly from the command line inside the container (just type If you install strace or ptrace inside the container, you can probably get some specific details about what is making it crash-- an AI chatbot can probably interpret it for you. See here for more on how I set it up when troubleshooting another problem. Basically I switched ./resolve.sh /bin/bash
[bunch of stuff]
sudo dnf install strace -y
[bunch of stuff]
strace /opt/resolve/bin/resolve Collect the last few lines where it crashes and see if an AI can interpret it for you to specifically tell you the issue or post it here (just don't post any private information!) For anyone googling this in the future, here's what a mismatched driver looks like. Something like this: $ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 565.77 Last thought-- do you see anything in the logs that are generated? You can look inside the container at |
Beta Was this translation helpful? Give feedback.
-
First off-- way to go, you're on your way to solving whatever this issue is, and hopefully we'll all benefit from what you learn. (and I think we have already!) In the past, typing But first, you're right-- the place to put that line is right around here, and tells the container to allow "strace" work. "strace" is a program that will show you the lower-level operation of a program as it runs, and hopefully whatever is wrong will show up there. The sudo error is saying that its expecting you to enter a password, which is unexpected because as I recall this line says "don't worry about a password, just make sudo work" (the
I am seeing reports that this PAM error is a new phenomenon and may be related to apparmor on the host, which can be (temporarily) disabled by turning it off.. But I don't know if turning off security on the host is a good idea for the long term. So let's see what else might have happened... THEORY 1: I notice that the We can log in as root without any of the volumes or anything being mounted with a simple So Theory 1 is a fail. THEORY 2: This may be related to the policy stuff in the container at
Why are those there? I don't know. Lets see if we remove those lines with AAAAND... it doesn't seem to matter. I still see:
Theory 2: also fails. THEORY 3: I am checking permissions for
That doesn't seem normal as it apparently restricts PAM's ability to read the file. At least it's different from the setting in Ubuntu. So let me try this:
Now all of a sudden,
to the The answer: sudo now works. Theory 3 seems to be correct! WRAPUP So the question is-- why is this set differently on centos than on ubuntu and is there another way (such as changing some SELinux access policy) that's the correct way to do it? But in the meantime, adding
to the Dockerfile and rebuilding the container should fix the issue. Because this is all running in a container, and the container is running as a normal user, and because this setting is similar to that of other linux distros, AND because it's not even like the root user in the container has any password set to begin with, I am less concerned about the security implications in this case. But in the future I'd like to have a better understanding of when root's access (and I assume something in pam was running as suid) broke upstream somewhere... and why. Again, thanks for reporting this and, see, we're all learning stuff! Let me know if adding that one line to the Dockerfile giving PAM access to the /etc/shadow file fixed the issue for you! |
Beta Was this translation helpful? Give feedback.
-
Congrats! re: the mounts not working all of a sudden-- i just want to make sure you're running I think maybe it would be worth it to add the chmod 600 to the Dockerfile to restore root in the container and maybe even have it look for /dev/nvidia# so it knows what number to use. Last thing-- I don't think people are actually having a license issue in the container. When that was written it wasn't clear how the identity of the computer was determined, but i read somewhere that it's not based on the container ID at all but rather some hardware thing like the ethernet mac address. Let me know if you figure out the mount thing-- that should be the easiest problem to fix I suspect! |
Beta Was this translation helpful? Give feedback.
-
Thanks for making the scripts, everything works....expect resolve won't run past its setup options.
DR 19.1.3 on Mint 22.1 with nvidia 550.144.03
I'm trying to troubleshoot and the only thing that jumps out is the build script reports a different driver than the one being installed (550.144.03). I checked a log of the output and it was installing the correct driver the host is using so unsure what this discrepancy means, or if relevant.
./resolve.sh nvidia-smi works
STEP 1/18: FROM docker.io/rockylinux:8.6
STEP 2/18: ARG ARCH=x86_64
STEP 3/18: ARG NVIDIA_VERSION=525.105
STEP 4/18: ARG NO_PIPEWIRE=0
Error on running below. On the host machine you get a blip of the DR splash screen and it immediately quits. It will launch the DR setup initially but that's it.
thanks!
Beta Was this translation helpful? Give feedback.
All reactions