troubleshooting resolve.sh error #70

sdcgi · 2025-02-10T21:51:03Z

sdcgi
Feb 10, 2025

Thanks for making the scripts, everything works....expect resolve won't run past its setup options.

DR 19.1.3 on Mint 22.1 with nvidia 550.144.03

I'm trying to troubleshoot and the only thing that jumps out is the build script reports a different driver than the one being installed (550.144.03). I checked a log of the output and it was installing the correct driver the host is using so unsure what this discrepancy means, or if relevant.

./resolve.sh nvidia-smi works

STEP 1/18: FROM docker.io/rockylinux:8.6
STEP 2/18: ARG ARCH=x86_64
STEP 3/18: ARG NVIDIA_VERSION=525.105
STEP 4/18: ARG NO_PIPEWIRE=0

Error on running below. On the host machine you get a blip of the DR splash screen and it immediately quits. It will launch the DR setup initially but that's it.

Main thread starts: 41223000
0x7a2441223000 | Undefined            | INFO  | 2025-02-10 21:48:03,870 | --------------------------------------------------------------------------------
0x7a2441223000 | Undefined            | INFO  | 2025-02-10 21:48:03,870 | Loaded log config from /home/resolve/.local/share/DaVinciResolve/configs/log-conf.xml
0x7a2441223000 | Undefined            | INFO  | 2025-02-10 21:48:03,870 | --------------------------------------------------------------------------------

thanks!

fat-tire · 2025-02-10T23:15:44Z

fat-tire
Feb 10, 2025
Maintainer

I should probably update that default driver version, but yes it should auto-detect the hosts's driver version.

Does the nvidia-smi in the container report the same 525.105 version as in the host? If so, and if you're not seeing any weird errors, this doesn't seem to be nvidia-related, so that's a relief :)

Try running directly from the command line inside the container (just type /opt/resolve/bin/resolve to run it by hand) and see if you see any errors pop up before it crashes besides the log thing you posted... I don't think the log config message is necessarily an issue...

If you install strace or ptrace inside the container, you can probably get some specific details about what is making it crash-- an AI chatbot can probably interpret it for you. See here for more on how I set it up when troubleshooting another problem. Basically I switched --privileged with --cap-add=SYS_PTRACE to resolve.sh and then in the container:

./resolve.sh /bin/bash
[bunch of stuff]
sudo dnf install strace -y
[bunch of stuff]
strace /opt/resolve/bin/resolve

Collect the last few lines where it crashes and see if an AI can interpret it for you to specifically tell you the issue or post it here (just don't post any private information!)

For anyone googling this in the future, here's what a mismatched driver looks like. Something like this:

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 565.77

Last thought-- do you see anything in the logs that are generated? You can look inside the container at /opt/resolve/logs or in the host at the mapped host location (which you can confirm in the text that's spit out when you run resolve.sh.

1 reply

sdcgi Feb 11, 2025
Author

thanks for the instructions. I'm wary of asking for hand-holding, but I only managed to get this far thanks to successfully hosting llama3.2 and deepseek locally with ollama and openweb ui in docker... it was far too easy thanks to having an AI coach and has given me unearned confidence to try getting DR to work.

that said, it feels so close to working I want to keep trying, but I'm stuck on getting sudo working in the container to install strace. I've tried putting the --cap-add=SYS_PTRACE in the script in various places in that run block at the end (if I have the location right?) but I keep getting

sudo: PAM account management error: Authentication service cannot retrieve authentication info
sudo: a password is required

Reading that link, could you clarify where to put it? This block?

"${CONTAINER_ENGINE}" run -it \
     --cap-add=SYS_PTRACE \
     --user resolve:resolve \

Also tried it right after -it, right at the very end, just above ${CONTAINER_RUN_ARGS} , after it... total guesswork and my AI buddy is not helping on this one.

Re the driver, smi in the container reports the correct driver, 550.144.03, same as the host. So yeah it doesn't seem to be a missmatch.

fat-tire · 2025-02-11T21:31:38Z

fat-tire
Feb 11, 2025
Maintainer

First off-- way to go, you're on your way to solving whatever this issue is, and hopefully we'll all benefit from what you learn. (and I think we have already!)

In the past, typing sudo bash was all you needed to do to have root in the container. But it seems you've found that something has changed either in podman or perhaps in the host security, or maybe in centos. Let's investigate....

But first, you're right-- the place to put that line is right around here, and tells the container to allow "strace" work. "strace" is a program that will show you the lower-level operation of a program as it runs, and hopefully whatever is wrong will show up there.

The sudo error is saying that its expecting you to enter a password, which is unexpected because as I recall this line says "don't worry about a password, just make sudo work" (the :NOPASSWD part) and sets resolve as the password if you really need it with this part:

echo "$USER:$USER" | chpasswd

I am seeing reports that this PAM error is a new phenomenon and may be related to apparmor on the host, which can be (temporarily) disabled by turning it off..

But I don't know if turning off security on the host is a good idea for the long term. So let's see what else might have happened...

THEORY 1: I notice that the resolve user is not in the wheel group (which is needed to gain access to root) even though it's supposed to be.

We can log in as root without any of the volumes or anything being mounted with a simple podman run -it --rm resolve /bin/bash. I notice that when I do this, I can then su resolve (switch to the resolve user) and when I type groups this time, I see wheel IS in the list. Which is weird. The same is true if I run the container AS the resolve user with run --user resolve -it --rm resolve /bin/bash. But even then I get the PAM error if I try something like sudo ls which previously was NOT a problem.

So Theory 1 is a fail.

THEORY 2: This may be related to the policy stuff in the container at /etc/pam.d/system-auth which includes a lot of pam_deny.so lines (truncated):

# cat /etc/pam.d/system-auth 
#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authselect is run.
auth        required      pam_deny.so
password    required      pam_deny.so

Why are those there? I don't know. Lets see if we remove those lines with RUN sed -i '/pam_deny.so/d' /etc/pam.d/system-auth in the Dockerfile and see whether it makes any difference...

AAAAND... it doesn't seem to matter. I still see:

$ sudo bash
sudo: PAM account management error: Authentication service cannot retrieve authentication info

Theory 2: also fails.

THEORY 3: I am checking permissions for /etc/passwd and /etc/shadow and noticed this:

---------- 1 root root 661 Feb 11 20:30 /etc/shadow

That doesn't seem normal as it apparently restricts PAM's ability to read the file. At least it's different from the setting in Ubuntu. So let me try this:

# sudo chmod 600 /etc/shadow
# ls -la /etc/shadow
-rw------- 1 root root 661 Feb 11 20:30 /etc/shadow

Now all of a sudden, sudo is working both in the root and the resolve user! What's up with that? Apparently this is "correct" for centos and is part of the way SELinux is configured on CentOS. But what if we add:

RUN chmod 600 /etc/shadow

to the Dockerfile instead of the line we added previously?

The answer: sudo now works. Theory 3 seems to be correct!

WRAPUP

So the question is-- why is this set differently on centos than on ubuntu and is there another way (such as changing some SELinux access policy) that's the correct way to do it?

But in the meantime, adding

RUN chmod 600 /etc/shadow

to the Dockerfile and rebuilding the container should fix the issue.

Because this is all running in a container, and the container is running as a normal user, and because this setting is similar to that of other linux distros, AND because it's not even like the root user in the container has any password set to begin with, I am less concerned about the security implications in this case. But in the future I'd like to have a better understanding of when root's access (and I assume something in pam was running as suid) broke upstream somewhere... and why.

Again, thanks for reporting this and, see, we're all learning stuff! Let me know if adding that one line to the Dockerfile giving PAM access to the /etc/shadow file fixed the issue for you!

3 replies

sdcgi Feb 12, 2025
Author

Aha, thanks for the thorough explanation of your troubleshooting, had no idea I was running into a secondary issue, very helpful. I followed your steps with chmod 600 on /etc/shadow in the container and got the same result here (I could su resolve and then sudo ls, no PAM error). Added that line to the dockerfile and rebuilt. I'm in and can sudo as the resolve user. Now I guess the actual troubleshooting begins for why DR doesn't run...

sdcgi Feb 12, 2025
Author

so DR doesn't run from the host with ./resolve.sh

But it will run from inside the container with /opt/resolve/bin/resolve !

It works! I built it using the free version, as I read the notes about potential issues using your own license in a containerised Studio version. But that's another problem.

Of course I can't access any of the mount points defined in the resolve.sh, but its running. So I need to figure out what's happening when using the script.

sdcgi Feb 12, 2025
Author

OK got it!

First I removed all the nvidia device lines in the run block of resolve.sh. It runs with an error dialog saying it can't find the GPU, as expected, Quit button to leave.

So it's something to do with the gpu... of which I have two installed. So I change ---device /dev/nvidia0 to ---device /dev/nvidia1 and it worked. Got it running.

But now I can't access any of the mounts, which is weird as I haven't changed anything in the resolve.sh regarding those.

fat-tire · 2025-02-12T18:51:57Z

fat-tire
Feb 12, 2025
Maintainer

So it's something to do with the gpu... of which I have two installed. So I change ---device
/dev/nvidia0 to ---device /dev/nvidia1 and it worked. Got it running.

Congrats!

re: the mounts not working all of a sudden-- i just want to make sure you're running ./resolve.sh by itself and not podman run ... which wouldn't include the volumes. Also, make sure that every line you've added to the final section of resolve.sh includes a \ character at the end to indicate that the command continues on the next line.

I think maybe it would be worth it to add the chmod 600 to the Dockerfile to restore root in the container and maybe even have it look for /dev/nvidia# so it knows what number to use.

Last thing-- I don't think people are actually having a license issue in the container. When that was written it wasn't clear how the identity of the computer was determined, but i read somewhere that it's not based on the container ID at all but rather some hardware thing like the ethernet mac address.

Let me know if you figure out the mount thing-- that should be the easiest problem to fix I suspect!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

troubleshooting resolve.sh error #70

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

troubleshooting resolve.sh error #70

sdcgi Feb 10, 2025

Replies: 3 comments · 4 replies

fat-tire Feb 10, 2025 Maintainer

sdcgi Feb 11, 2025 Author

fat-tire Feb 11, 2025 Maintainer

sdcgi Feb 12, 2025 Author

sdcgi Feb 12, 2025 Author

sdcgi Feb 12, 2025 Author

fat-tire Feb 12, 2025 Maintainer

sdcgi
Feb 10, 2025

Replies: 3 comments 4 replies

fat-tire
Feb 10, 2025
Maintainer

sdcgi Feb 11, 2025
Author

fat-tire
Feb 11, 2025
Maintainer

sdcgi Feb 12, 2025
Author

sdcgi Feb 12, 2025
Author

sdcgi Feb 12, 2025
Author

fat-tire
Feb 12, 2025
Maintainer