Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(shim): change the current dir back to bundle path #577

Closed
wants to merge 1 commit into from

Conversation

Mossaka
Copy link
Member

@Mossaka Mossaka commented Apr 30, 2024

This commit changes the current dir for the shim process back to the bundle path (e.g. /run/containerd/io.containerd.runtime.v2.task/<ns>/<id>) after shim::wait() finishes.

The motivation for this change is that the create request handled by youki's libcontainer will change the current dir to container path (e.g. /run/containerd/wasmtime/<ns>/<id>), this path does not contain the address file for getting the shim socket address that needs to be deleted by the ttrpc server closes.

So in order to properly delete the socket path, the current directory used by the shim crate has to be the bundle path.

Perf test regarding the startup of the process BEFORE the change

  • Mean: 1704 milliseconds
  • Median: 1693 milliseconds
  • Mode: 1831 milliseconds
  • Standard Deviation: 48.64 milliseconds
  • Minimum Time: 1653 milliseconds
  • Maximum Time: 1831 milliseconds

As reported by @rperezvaz at https://cloud-native.slack.com/archives/C04LTPB6Z0V/p1705059273058979

Perf test regarding the startup of the process AFTER the change

  • Mean: 447 milliseconds
  • Median: 447 milliseconds
  • Mode: 466 milliseconds
  • Standard Deviation: 10.63 milliseconds
  • Minimum Time: 434 milliseconds
  • Maximum Time: 466 milliseconds

Explanation

The primary reason for the huge start-up time reduction is due to that the shim socket has not been properly cleaned up after the shim get closes and so every time the shim starts, it hangs for 1 whole second to try to reconnect with an existing socket address. See more containerd/rust-extensions#263 and spinframework/containerd-shim-spin#41 (comment))

bpftrace (unchanged)

PID 3543236 (containerd): Changing directory to /run/containerd/io.containerd.runtime.v2.task/default/testwasm
Process started: /usr/local/bin/containerd-shim-wasmtime-v1 PID: 3543236
PID 3543245 (containerd-shim): Changing directory to /run/containerd/io.containerd.runtime.v2.task/default/testwasm
Process started: /usr/local/bin/containerd-shim-wasmtime-v1 PID: 3543245
PID 3543245 (client_handler): Changing directory to /run/containerd/wasmtime/default/testwasm
PID 3543245 (client_handler): Changing directory to /run/containerd/wasmtime/default/testwasm
PID 3543245 (client_handler): Changing directory to /run/containerd/wasmtime/default/testwasm
PID 3543271 (youki:[2:INIT]): File unlink in target directory: /run/containerd/io.containerd.runtime.v2.task/default/testwasm/..
PID 3543245 (client_handler): Changing directory to /run/containerd/wasmtime/default/testwasm
PID 3543245 (client_handler): Changing directory to /run/containerd/wasmtime/default/testwasm
PID 3543245 (client_handler): Changing directory to /run/containerd/wasmtime/default/testwasm
PID 3543245 (client_handler): File unlinkat in target directory: /run/containerd/wasmtime/default/testwasm
Process started: /usr/local/bin/containerd-shim-wasmtime-v1 PID: 3543363
PID 569984 (containerd): File unlinkat in target directory: /run/containerd/io.containerd.runtime.v2.task/default/testwasm/..
PID 569984 (containerd): File unlinkat in target directory: /run/containerd/io.containerd.runtime.v2.task/default/testwasm/..
PID 569984 (containerd): File unlinkat in target directory: /run/containerd/io.containerd.runtime.v2.task/default/.testwasm..
PID 569984 (containerd): File unlinkat in target directory: /run/containerd/io.containerd.runtime.v2.task/default/.testwasm..
PID 569984 (containerd): File unlinkat in target directory: .testwasm
PID 569984 (containerd): File unlinkat in target directory: .testwasm
PID 569984 (containerd): File unlinkat in target directory: /run/containerd/io.containerd.runtime.v2.task/default/testwasm/..
PID 569984 (containerd): File unlinkat in target directory: /run/containerd/io.containerd.runtime.v2.task/default/testwasm/..
PID 3543224 (ctr): File unlinkat in target directory: testwasm-stderr
PID 3543224 (ctr): File unlinkat in target directory: testwasm-stdout
PID 3543224 (ctr): File unlinkat in target directory: testwasm-stdin

Notice how libcontainer handler changes the process's current directory to /run/containerd/wasmtime/default/testwasm, and at the end of delete request, libcontainer cleans up this path, so after the shim::wait(), the address path is not longer able to be found by using the current directory because it no longer exists!

This commit changes the current dir for the shim process back
to the bundle path (e.g. `/run/containerd/io.containerd.runtime.v2.task/<ns>/<id>`)
after `shim::wait()` finishes.

The motivation for this change is that the `create` request handled by
youki's libcontainer will change the current dir to container path (e.g. `/run/containerd/wasmtime/<ns>/<id>`),
this path does not contain the `address` file for getting the shim socket address that needs
to be deleted by the ttrpc server closes.

So in order to properly delete the socket path, the current directory used by
the shim crate has to be the bundle path.

Signed-off-by: jiaxiao zhou <jiazho@microsoft.com>
@cpuguy83
Copy link
Member

This does not seem like the correct fix.
The bit that executes youki should probably unshare CLONE_FS or similar when creating the thread with clone.
This way changes to the directory do not affect the shim process.

@cpuguy83
Copy link
Member

Looks like that code is happening in container.start()

@Mossaka
Copy link
Member Author

Mossaka commented May 1, 2024

This does not seem like the correct fix. The bit that executes youki should probably unshare CLONE_FS or similar when creating the thread with clone. This way changes to the directory do not affect the shim process.

The problem is that the shim process is calling youki's APIs directly (e.g. this) and youki calls the chdir syscall that changes the shim process's current directory to container path (e.g. here).

@cpuguy83
Copy link
Member

cpuguy83 commented May 1, 2024

@utam0k Its not clear to me from the code why youki is changing directories here. Can you chime in?
At least in build it doesn't look like its doing anything that would be directory dependent.

@Mossaka
Copy link
Member Author

Mossaka commented May 1, 2024

There is another place where youki changes the directory to container_path: this,

@utam0k
Copy link
Member

utam0k commented May 1, 2024

@utam0k Its not clear to me from the code why youki is changing directories here. Can you chime in? At least in build it doesn't look like its doing anything that would be directory dependent.

I believe this chdir is not necessary but I'm confirming it now...
youki-dev/youki#2772

@utam0k
Copy link
Member

utam0k commented May 1, 2024

At least, it is unnecessary if you don't use pseudo terminals.

@Mossaka Mossaka marked this pull request as draft May 2, 2024 19:14
@Mossaka
Copy link
Member Author

Mossaka commented May 5, 2024

Closing as youki-dev/youki#2780 will fix the issue

@Mossaka Mossaka closed this May 5, 2024
@Mossaka
Copy link
Member Author

Mossaka commented May 16, 2024

The upstream fix is in: youki-dev/youki#2780

@Mossaka Mossaka deleted the fix-crash branch May 27, 2024 23:55
@Mossaka
Copy link
Member Author

Mossaka commented May 27, 2024

Merged in upstream fix by this PR #601

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants