Skip to content

Podman commands and APIs hang after a container remove #26041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
DominicLavery opened this issue May 1, 2025 · 15 comments
Open

Podman commands and APIs hang after a container remove #26041

DominicLavery opened this issue May 1, 2025 · 15 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triaged Issue has been triaged

Comments

@DominicLavery
Copy link

Issue Description

I have a scenario where all podman commands and usages of the API via the socket hang for around 10 minutes after a particular container removal.

The system podman is running in runs arbitrary customer supplied images and code. We have about 65k distinct images a week and handle about 73 million container creations and removals. Most of these work fine, which i think is some evidence that our general usage of the APIs are fine. We mostly use the compat APIs to manage these containers.

When we call DELETE /v1.41/containers/{container_id}?force=1 on one particular image & code combo the call takes around 10 minutes to return. During this time other API calls and commands also hang (Including at least podman info, podman image list, podman ps, podman pull alpine and their equivalent API calls). It seems to require more than simply running the image within the system to cause the issue, as running it without the customers code does not recreate the issue. Whilst other container removes might take a long time (say around 7 minutes), only this one seems to additionally cause other commands to hang.

Are long container removes like this expected? How can I go about debugging the hang?

Steps to reproduce the issue

Steps to reproduce the issue

  1. Container remove on the specific container

Describe the results you received

API and command calls hang
Container remove takes 10 minutes

Describe the results you expected

API and command calls not to hang
Container remove takes less than 10 minutes

podman info output

host:
  arch: amd64
  buildahVersion: 1.39.4
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: Unknown
    path: /usr/bin/conmon
    version: 'conmon version 2.1.13, commit: 82de887596ed8ee6d9b2ee85e4f167f307bb569b'
  cpuUtilization:
    idlePercent: 97.63
    systemPercent: 0.49
    userPercent: 1.88
  cpus: 36
  databaseBackend: sqlite
  distribution:
    codename: jammy
    distribution: ubuntu
    version: "22.04"
  eventLogger: journald
  freeLocks: 2045
  hostname: ip-172-29-4-49
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.8.0-1024-aws
  linkmode: dynamic
  logDriver: journald
  memFree: 60246339584
  memTotal: 73720172544
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: 'aardvark-dns: /usr/lib/podman/aardvark-dns'
      path: /usr/lib/podman/aardvark-dns
      version: aardvark-dns 1.14.0
    package: 'netavark: /usr/lib/podman/netavark'
    path: /usr/lib/podman/netavark
    version: netavark 1.14.1
  ociRuntime:
    name: runc
    package: 'runc: /usr/sbin/runc'
    path: /usr/sbin/runc
    version: |-
      runc version 1.2.6
      commit: v1.2.6-0-ge89a2992
      spec: 1.2.0
      go: go1.24.2
      libseccomp: 2.5.3
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: true
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: ""
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 0
  swapTotal: 0
  uptime: 44h 58m 28.00s (Approximately 1.83 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 1
    stopped: 1
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/lib/docker
  graphRootAllocated: 899560546304
  graphRootUsed: 37004599296
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "true"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/lib/docker/tmp
  imageStore:
    number: 24
  runRoot: /var/lib/docker
  transientStore: false
  volumePath: /var/lib/docker/volumes
version:
  APIVersion: 5.4.2
  Built: 1744035741
  BuiltTime: Mon Apr  7 14:22:21 2025
  GitCommit: be85287fcf4590961614ee37be65eeb315e5d9ff
  GoVersion: go1.24.2
  Os: linux
  OsArch: linux/amd64
  Version: 5.4.2

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

Running in AWS instances.
Running 5.4.2 built from source on Ubuntu 22.04
Invoking the API via the docker go client and some direct calls to the libpod api
The graphroot /var/lib/docker is mounted on ssds

Additional information

I'll provide some logs in comments

@DominicLavery DominicLavery added the kind/bug Categorizes issue or PR as related to a bug. label May 1, 2025
@DominicLavery
Copy link
Author

DominicLavery commented May 1, 2025

Beginning of the removal of the container:

Apr 29 14:32:33 ip-172-29-12-235 podman[1470]: @ - - [29/Apr/2025:14:32:25 +0000] "GET /v1.41/containers/4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65/json HTTP/1.1" 200 8711 "" "docker-agent (07e94d7e-7027-48b0-9004-dca88ec75749-0-build)"
Apr 29 14:32:33 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:32:33Z" level=debug msg="IdleTracker:idle 3m+0h/785t connection(s)" X-Reference-Id=0xc000d12000
Apr 29 14:32:33 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:32:33Z" level=debug msg="IdleTracker:active 3m+0h/785t connection(s)" X-Reference-Id=0xc000d12000
Apr 29 14:32:33 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:32:33Z" level=debug msg="Starting parallel job on container 4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65"
Apr 29 14:32:33 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:32:33Z" level=debug msg="Removing container 4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65"
Apr 29 14:32:33 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:32:33Z" level=debug msg="Cleaning up container 4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65"
Apr 29 14:32:33 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:32:33Z" level=debug msg="Network is already cleaned up, skipping..."
Apr 29 14:32:33 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:32:33Z" level=debug msg="Container 4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65 storage is already unmounted, skipping..."
Apr 29 14:32:33 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:32:33Z" level=debug msg="Removing all exec sessions for container 4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65"
Apr 29 14:32:33 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:32:33Z" level=debug msg="Container 4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65 storage is already unmounted, skipping..."

10 minutes later

Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: 2025-04-29 14:42:09.813531447 +0000 UTC m=+4921.014172601 container remove 4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65 (image=(customer image), name=romantic_gauss, build_uri=(internal tracking reference), org.opencontainers.image.ref.name=ubuntu, org.opencontainers.image.version=20.04, task_id=(internal tracking reference))
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="Multiple results for container 4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65 - attempted multiple removals?"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: @ - - [29/Apr/2025:14:32:33 +0000] "DELETE /v1.41/containers/4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65?force=1&v=1 HTTP/1.1" 204 0 "" "docker-agent (07e94d7e-7027-48b0-9004-dca88ec75749-0-build)"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="exporting opaque data as blob \"sha256:3c69a9ca2c95e52830d24b1ba41368bc7aba41f7144cdd0f02318ad6d596df19\""
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="IdleTracker:closed 11m+0h/825t connection(s)" X-Reference-Id=0xc000d12000
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="Normalized platform linux/amd64 to {amd64 linux  [] }"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="Attempting to pull candidate docker.io/library/busybox:latest for docker.io/library/busybox:latest"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="parsed reference into \"[overlay@/var/lib/docker+/var/lib/docker]docker.io/library/busybox:latest\""
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: Trying to pull docker.io/library/busybox:latest...
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="Copying source image //busybox:latest to destination image [overlay@/var/lib/docker+/var/lib/docker]docker.io/library/busybox:latest"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="Using registries.d directory /etc/containers/registries.d"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="Trying to access \"docker.io/library/busybox:latest\""
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="Returning credentials for docker.io/library/busybox from DockerAuthConfig"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg=" No signatur e storage configuration found for docker.io/library/busybox:latest, using built-in default file:///var/lib/containers/sigstore"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="Looking for TLS certificates and private keys in /etc/docker/certs.d/docker.io"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=debug msg="GET https://registry-1.docker.io/v2/"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: time="2025-04-29T14:42:09Z" level=info msg="Request Failed(Internal Server Error): no container with ID 4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65 found in database: no such container"
Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: @ - - [29/Apr/2025:14:38:54 +0000] "POST /v1.41/networks/prune?filters=%7B%22label%21%22%3A%7B%22keep%22%3Atrue%7D%2C%22until%22%3A%7B%2230m0s%22%3Atrue%7D%7D HTTP/1.1" 500 180 "" "docker-gc 2.0.2308-e5b77c0"

The problematic call is

Apr 29 14:42:09 ip-172-29-12-235 podman[1470]: @ - - [29/Apr/2025:14:32:33 +0000] "DELETE /v1.41/containers/4e029f7fe15d056dc8a82d7e6cab3429a3800a56fefcb5a163142af5c929ab65?force=1&v=1 HTTP/1.1" 204 0 "" "docker-agent (07e94d7e-7027-48b0-9004-dca88ec75749-0-build)"

@DominicLavery
Copy link
Author

This issue seems similar to #24487 but I think it might be a different issue as:

  1. The linked issue is rootless and this is privileged
  2. The observed logs and scenario around container removal seem different

@baude
Copy link
Member

baude commented May 1, 2025

using the service will make stracing this difficult, esp if you have a ton of things going on at once. But perhaps an strace might reveal something. The behavior with the fact that other commands are hanging indicates a possible locking problem.

@Luap99 @mheon do you think stracing additional podman commands while it is hung would reveal what lock is being held? Or is the fact that a container lock was taken but it is taking a really long time to resolve (rm), this lock is really a red herring ?

@DominicLavery
Copy link
Author

Ah! I hadn't thought of doing it from one of the stuck commands. Here is where it was up to when it got stuck

strace.log

@mheon
Copy link
Member

mheon commented May 1, 2025

You can use the podman system locks command (unfortunately has no remote counterpart but you have to be in the VM). I built it to debug situations like this.

@mheon mheon added the triaged Issue has been triaged label May 1, 2025
@DominicLavery
Copy link
Author

podman system locks similarly hung.

Here are the results:

root@ip-172-29-10-165:~# podman system locks

No lock conflicts have been detected.

Lock 1 is presently being held

@DominicLavery
Copy link
Author

using the service will make stracing this difficult, esp if you have a ton of things going on at once. But perhaps an strace might reveal something. The behavior with the fact that other commands are hanging indicates a possible locking problem.

@Luap99 @mheon do you think stracing additional podman commands while it is hung would reveal what lock is being held? Or is the fact that a container lock was taken but it is taking a really long time to resolve (rm), this lock is really a red herring ?

There shouldn't be too much going on at once. There is a limit on the number of jobs on a node and we have isolated this customers work to it's own set of nodes. Would stracing the podman system service call be helpful?

@baude
Copy link
Member

baude commented May 1, 2025

I dont think it will hurt. More information is usually better. The best ofc would be a reproducer that we could run and diagnose.

@mheon
Copy link
Member

mheon commented May 1, 2025

The system locks command hanging is very unusual, it should not be taking any locks, just querying their status. It could be a c/storage lock hanging?

@Luap99
Copy link
Member

Luap99 commented May 2, 2025

Well if other podman image list and podman pull commands are hanging then it should be safe to assume that the hang is not on a container lock.

Also this is not a deadlock, if it just takes a long time then it means we just hold locks for a long time. The easiest way to find out on which lock the command hangs would be to run podman images in parallel and then if it hangs send SIGABRT to trigger a stack trace from the goruntime. That way we know the exact line where the lock is taken so we know which lock is being hold for a while.

Each command does a storage init which takes the central c/storage lock so I guess it is waiting for that one, currently c/storage locking is not very smart. So my best guess is you are trying to remove a container with a lot of files and that just takes a while and during removal the lock is being hold (see containers/storage#2314) so that is something we want to fix for sure but doing things properly takes time.
So for now I would assume that is expected.

strace on the service can be done but if you actually run so many commands in parallel it would be almost impossible to figure out the output so I doubt that will be helpful.

@DominicLavery
Copy link
Author

DominicLavery commented May 2, 2025

Ah thank you all very much for the insights.

It does seem likely to be related to the number of files created/modified within the container. Running

          for i in {0..4000000}
          do
              echo hello > "File${i}.txt"
              echo $i
          done

Within the container results in a remove taking about 2 minutes and commands hanging.
I tried a SIGABRT using this to recreate the issue and attached the results. I'll try and get one with the customers job to confirm they are they same.

sigabrt.log

It does look to me like it is waiting for a c/storage lock

@Luap99
Copy link
Member

Luap99 commented May 2, 2025

goroutine 1 gp=0xc000002380 m=0 mp=0x2fdd040 [syscall]:
syscall.Syscall(0x48, 0x6, 0x7, 0xc0006184b0)
	/usr/local/go/src/syscall/syscall_linux.go:73 +0x25 fp=0xc000618448 sp=0xc0006183e8 pc=0x4ec1a5
syscall.Syscall(0x48, 0x6, 0x7, 0xc0006184b0)
	<autogenerated>:1 +0x2e fp=0xc000618478 sp=0xc000618448 pc=0x4ecf4e
golang.org/x/sys/unix.FcntlFlock(...)
	/tmp/podman/vendor/golang.org/x/sys/unix/fcntl.go:31
github.com/containers/storage/pkg/lockfile.lockHandle(0x6, 0x2a?, 0x0)
	/tmp/podman/vendor/github.com/containers/storage/pkg/lockfile/lockfile_unix.go:92 +0x8f fp=0xc0006184f0 sp=0xc000618478 pc=0x80e60f
github.com/containers/storage/pkg/lockfile.(*LockFile).lock(0xc0000e2420, 0x1)
	/tmp/podman/vendor/github.com/containers/storage/pkg/lockfile/lockfile.go:416 +0xf1 fp=0xc000618538 sp=0xc0006184f0 pc=0x80dd51
github.com/containers/storage/pkg/lockfile.(*LockFile).Lock(...)
	/tmp/podman/vendor/github.com/containers/storage/pkg/lockfile/lockfile.go:132
github.com/containers/storage.(*imageStore).startWritingWithReload(0xc00047a000, 0x0)
	/tmp/podman/vendor/github.com/containers/storage/images.go:217 +0x47 fp=0xc000618590 sp=0xc000618538 pc=0xe5db67
github.com/containers/storage.newImageStore({0xc00011a180, 0x1e})
	/tmp/podman/vendor/github.com/containers/storage/images.go:609 +0x1cc fp=0xc0006186b8 sp=0xc000618590 pc=0xe5ff6c
github.com/containers/storage.(*store).load(0xc000467520)

Yeah that is the c/storage store lock as I suspected

@DominicLavery
Copy link
Author

Thanks for confirming.

As a work around: We know at the container removal point that the container is stopped, would it be safe for our system to remove files from the disk before starting the delete? That way they are removed without a lock being held.

Files changed as a part of the running container would be in the UpperDir, is that correct?
e.g.

"GraphDriver": {
               "Name": "overlay",
               "Data": {
                    "LowerDir": "/var/lib/docker/overlay/3e6906ee423fb328a37411e519430aa0a35ac3bea14e0ace43269a77eecc52f4/diff",
                    "MergedDir": "/var/lib/docker/overlay/fa05584161f73347b35ba160306e2209729cb731c3034b89989fc49b87358e2a/merged",
                    "UpperDir": "/var/lib/docker/overlay/fa05584161f73347b35ba160306e2209729cb731c3034b89989fc49b87358e2a/diff",
                    "WorkDir": "/var/lib/docker/overlay/fa05584161f73347b35ba160306e2209729cb731c3034b89989fc49b87358e2a/work"
               }
          },**

here "/var/lib/docker/overlay/fa05584161f73347b35ba160306e2209729cb731c3034b89989fc49b87358e2a/diff" in this case

@Luap99
Copy link
Member

Luap99 commented May 2, 2025

Files changed as a part of the running container would be in the UpperDir, is that correct?

They should be yes. That might help you to work around yes but I don't think messing with internal details is what any maintainer here would recommend or encourage so do at your own risk.

The good news is we are discussing ways to reduce the lock time for removals so I think it is likely that this will get fixed in the near future.

@DominicLavery
Copy link
Author

That makes sense and seems fair.

The good news is we are discussing ways to reduce the lock time for removals so I think it is likely that this will get fixed in the near future

That's great! Is this issue the best place to get updates for this or should it be closed as a dupe of another?

Thanks again for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triaged Issue has been triaged
Projects
None yet
Development

No branches or pull requests

4 participants