-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Podman commands and APIs hang after a container remove #26041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Beginning of the removal of the container:
10 minutes later
The problematic call is
|
This issue seems similar to #24487 but I think it might be a different issue as:
|
using the service will make stracing this difficult, esp if you have a ton of things going on at once. But perhaps an strace might reveal something. The behavior with the fact that other commands are hanging indicates a possible locking problem. @Luap99 @mheon do you think stracing additional podman commands while it is hung would reveal what lock is being held? Or is the fact that a container lock was taken but it is taking a really long time to resolve (rm), this lock is really a red herring ? |
Ah! I hadn't thought of doing it from one of the stuck commands. Here is where it was up to when it got stuck |
You can use the |
Here are the results:
|
There shouldn't be too much going on at once. There is a limit on the number of jobs on a node and we have isolated this customers work to it's own set of nodes. Would stracing the |
I dont think it will hurt. More information is usually better. The best ofc would be a reproducer that we could run and diagnose. |
The |
Well if other podman image list and podman pull commands are hanging then it should be safe to assume that the hang is not on a container lock. Also this is not a deadlock, if it just takes a long time then it means we just hold locks for a long time. The easiest way to find out on which lock the command hangs would be to run Each command does a storage init which takes the central c/storage lock so I guess it is waiting for that one, currently c/storage locking is not very smart. So my best guess is you are trying to remove a container with a lot of files and that just takes a while and during removal the lock is being hold (see containers/storage#2314) so that is something we want to fix for sure but doing things properly takes time. strace on the service can be done but if you actually run so many commands in parallel it would be almost impossible to figure out the output so I doubt that will be helpful. |
Ah thank you all very much for the insights. It does seem likely to be related to the number of files created/modified within the container. Running
Within the container results in a remove taking about 2 minutes and commands hanging. It does look to me like it is waiting for a c/storage lock |
Yeah that is the c/storage store lock as I suspected |
Thanks for confirming. As a work around: We know at the container removal point that the container is stopped, would it be safe for our system to remove files from the disk before starting the delete? That way they are removed without a lock being held. Files changed as a part of the running container would be in the
here |
They should be yes. That might help you to work around yes but I don't think messing with internal details is what any maintainer here would recommend or encourage so do at your own risk. The good news is we are discussing ways to reduce the lock time for removals so I think it is likely that this will get fixed in the near future. |
That makes sense and seems fair.
That's great! Is this issue the best place to get updates for this or should it be closed as a dupe of another? Thanks again for the help. |
Issue Description
I have a scenario where all podman commands and usages of the API via the socket hang for around 10 minutes after a particular container removal.
The system podman is running in runs arbitrary customer supplied images and code. We have about 65k distinct images a week and handle about 73 million container creations and removals. Most of these work fine, which i think is some evidence that our general usage of the APIs are fine. We mostly use the compat APIs to manage these containers.
When we call DELETE
/v1.41/containers/{container_id}?force=1
on one particular image & code combo the call takes around 10 minutes to return. During this time other API calls and commands also hang (Including at leastpodman info
,podman image list
,podman ps
,podman pull alpine
and their equivalent API calls). It seems to require more than simply running the image within the system to cause the issue, as running it without the customers code does not recreate the issue. Whilst other container removes might take a long time (say around 7 minutes), only this one seems to additionally cause other commands to hang.Are long container removes like this expected? How can I go about debugging the hang?
Steps to reproduce the issue
Steps to reproduce the issue
Describe the results you received
API and command calls hang
Container remove takes 10 minutes
Describe the results you expected
API and command calls not to hang
Container remove takes less than 10 minutes
podman info output
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
Yes
Additional environment details
Running in AWS instances.
Running 5.4.2 built from source on Ubuntu 22.04
Invoking the API via the docker go client and some direct calls to the libpod api
The graphroot
/var/lib/docker
is mounted on ssdsAdditional information
I'll provide some logs in comments
The text was updated successfully, but these errors were encountered: