Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when a client disconnect? #208

Closed
palainp opened this issue Dec 19, 2024 · 7 comments
Closed

Race condition when a client disconnect? #208

palainp opened this issue Dec 19, 2024 · 7 comments

Comments

@palainp
Copy link
Member

palainp commented Dec 19, 2024

Dear developpers,

On the Qubes forum, a user has a special setting, qubes-mirage-firewall uses OpenBSD (the unikernel is configured manually with no netvm, as described in README.md), and the latest release prevent that setup to run correctly.

To my understanding, the netvm client connect twice to the unikernel (not sure if that is mandatory, but it worked earlier), and in the logs, the logs lines are (this is the extract that seems relevant, I anonymized manually, XX=YY+1):

...
[INFO] [dispatcher] add client vif {domid=XX;device_id=0} with IP A.B.C.D
[INFO] [dispatcher] Client XX (IP: A.B.C.D) ready
[INFO] [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
[INFO] [dispatcher] New firewall rules for A.B.C.D
0   any   accept
[INFO] [dispatcher] add client vif {domid=YY;device_id=0} with IP A.B.C.D
[INFO] [qubes.db] got rm "/qubes-firewall/A.B.C.D/"
[INFO] [qubes.db] got update: "/qubes-firewall/A.B.C.D/policy" = "drop"
[INFO] [qubes.db] got update: "/qubes-firewall/A.B.C.D/0000" = "action=accept"
[INFO] [qubes.db] got update: "/qubes-firewall/A.B.C.D" = ""
[INFO] [dispatcher] Rules did not change for A.B.C.D
[INFO] [net-xen backend] Frontend asked to close network device dom:XX/vif:0
[INFO] [dispatcher] Client YY (IP: A.B.C.D) ready
[INFO] [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
[INFO] [dispatcher] New firewall rules for A.B.C.D
0   any   accept
[domYY:A.B.C.D] [client_eth] Waiting for old client domXX:A.B.C.D to go away before accepting new on
...

So client XX connects and shutdown (again, maybe an artifact due to the way OpenBSD connects as netvm), and client YY cannot connect after client XX has shut down.

The behaviour with previous release is:

...
[INFO] [dispatcher] add client vif {domid=XX;device_id=0} with IP A.B.C.D
[INFO] [dispatcher] Client XX (IP: A.B.C.D) ready
[INFO] [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
[INFO] [dispatcher] New firewall rules for A.B.C.D
0   any   accept
[INFO] [dispatcher] add client vif {domid=YY;device_id=0} with IP A.B.C.D
[INFO] [qubes.db] got rm "/qubes-firewall/A.B.C.D/"
[INFO] [dispatcher] QubesDB has changed but not the situation of our netvm!
[INFO] [dispatcher] Waiting for netvm changes to "/qubes-gateway"...
[INFO] [qubes.db] got update: "/qubes-firewall/A.B.C.D/policy" = "drop"
[INFO] [dispatcher] QubesDB has changed but not the situation of our netvm!
[INFO] [dispatcher] Waiting for netvm changes to "/qubes-gateway"...
[INFO] [qubes.db] got update: "/qubes-firewall/A.B.C.D/0000" = "action=accept"
[INFO] [qubes.db] got update: "/qubes-firewall/A.B.C.D" = ""
[INFO] [dispatcher] QubesDB has changed but not the situation of our netvm!
[INFO] [dispatcher] Waiting for netvm changes to "/qubes-gateway"...
[INFO] [dispatcher] Rules did not change for A.B.C.D
[INFO] [net-xen backend] Frontend asked to close network device dom:XX/vif:0
[INFO] [dispatcher] client {domid=XX;device_id=0} has gone
[INFO] [dispatcher] Client YY (IP: A.B.C.D) ready
[INFO] [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
[INFO] [dispatcher] New firewall rules for A.B.C.D
0   any   accept
[client_eth] who-has A.B.C.D? ignoring request for client's own IP
...

Here after Frontend asked to close network device dom:XX/vif:0 we can see that the client has correctly gone (the line after), and client YY is correctly connected.

There is some path I can follow:

  • there is some additional QubesDB messages in the meantime, but I'm not sure how they are related
  • we have some changes in the cleanup callbacks in the last release, maybe there is now a side effect in that scenario

If you have some other ideas I'll be glad to explore in depth how that issue could be fixed :)

@hannesm
Copy link
Member

hannesm commented Dec 19, 2024

This only happens in the no-netvm scenario (OpenBSD as netvm, which doesn't provide a netback)?

What is the actual issue? A hang in the firewall? A hang in domid=XX?

It seems that the Dispatcher.wait_clients (which is responsible for "client .. has gone") doesn't work as expected (anymore) -- can we get some further debug information in there - such as printing the new_set (and its size)? On my first read, the VifMap uses the module ClientVif which relies on the polymorphic compare -- that may be an issue -- but then that code is around since 2016, and so far hasn't caused trouble.

@palainp
Copy link
Member Author

palainp commented Dec 20, 2024

Thank @hannesm for your reply.
This happens with the "BSD sys-net" setup, but I can't reproduce it only with "no netvm" configuration. I currently can't use OpenBSD as sys-net (OpenBSD lacks the drivers for my wifi card :( ), so can't test it by myself, but when I setup the unikernel with manual network configuration, and connect two AppVM to it, it works as intended.

I don't really understand the client YY + client XX (with the same IP address) part, and the issue may come from that (so far I guess this is an artifact from the OpenBSD frontend driver, but I'm really unsure about that), but that was already the case with the previous release and there was no issue.

The impact from the user POV is that sys-net can't connect to the firewall as it's stuck at line go away before accepting new on. As we don't have the message client {domid=XX;device_id=0} has gone right after [net-xen backend] Frontend asked to close network device dom:XX/vif:0, I think we miss something that need to trigger the cleanup task.

@palainp
Copy link
Member Author

palainp commented Dec 23, 2024

Update: I got a laptop that allows me to use OpenBSD (but I still can't NAT the traffic properly). So now I'm also seeing the failure message. The OpenBSD AppVM netif disconnect happens during boot.
The user on the Qubes forum confirmed that this commit fixed the situation 1 but it basically undo all of @dinosaure's changes :( However, with an additional commit the unikernel works fine locally on my laptop 2.

Based on this combination of the two commits, the main difference is as follows:

-    Lwt.async (fun () -> Lwt.pick [ qubesdb_updater; listener ]);
-    Lwt.return_unit
+    Lwt.pick [ qubesdb_updater; listener ]

and an update in add_client to get Lwt.async (fun () ->... back there instead of add_vif.
The commit's comment is:

We currently try to spawn 2 fibers [qubes_updated] and [listener] per clients
and we already finalise them correctly if the client is disconnected. However,
the Lwt.async is localized into add_client instead of where we attach a
finalisers for these tasks. The first objective of this patch is to be sure that
the Lwt.async is near where we registerd cancellation of these tasks.

Could there be some sort of side effect in this situation?

@palainp
Copy link
Member Author

palainp commented Dec 24, 2024

I first thought the issue might be with the inversion of Lwt.async and Lwt.catch (the working code has Lwt.async( fun () -> Lwt.catch and the issue occurs with Lwt.catch( fun () -> Lwt.async), but I've managed to get something to work if I swap the two in add_client.

But all my attempts to place Lwt.async in add_vif have failed so far :(

@palainp
Copy link
Member Author

palainp commented Dec 31, 2024

To my understanding, the boot process with OpenBSD adds two vifs (I don't know why?) and at some point one is removed.

With the current code, we miss that event (reported by xen-net-backend, and we remove clients only in wait_clients, I don't know why Dao.watch_clients miss the vif disappearance) and we never remove the OpenBSD IP, therefore we cannot terminate the configuration for the remaining vif (the Waiting for old client ... message).

So the ending lines of add_vif sound very sensible :)

@palainp
Copy link
Member Author

palainp commented Jan 4, 2025

Surprisingly, the client setup works fine if I keep the add_client call from add_vif into Lwt.async. So I adapted a bit the code in 85de608 and 812b998.
Feel free to review the coding style :) If accepted/When updated I will cut a release :)
Best

@palainp
Copy link
Member Author

palainp commented Jan 4, 2025

Fixed in #209

@palainp palainp closed this as completed Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants