On 13/02/2025 20:40, Antonio Quartulli wrote:
On 13/02/2025 16:46, Sabrina Dubroca wrote:
2025-02-13, 12:46:34 +0100, Antonio Quartulli wrote:
On 13/02/2025 00:34, Sabrina Dubroca wrote:
Hello,
2025-02-11, 01:39:53 +0100, Antonio Quartulli wrote:
All minor and major reported problems have been finally addressed. Big thanks to Sabrina, who took the time to guide me through converting the peer socket to an RCU pointer.
Something is off (not sure if it's new to this version): if I use test-tcp.sh to setup a set of interfaces and peers (I stop the test just after setup to keep the environment alive), then remove all netns with "ip -all netns delete", I expect all devices to go away, but they don't. With debug messages enabled I'm seeing some activity from the module ("tun0: sending keepalive to peer 3" and so on), and ovpn_net_uninit/ovpn_priv_free never got called.
I can reproduce it. If later I rmmod ovpn I then get all the "Deleting peer" messages. So instances are not being purged on netns exit.
Will dive into it.
I think the socket holds a ref on the netns, so it's not getting destroyed, simply "removed" from iproute's point of view. And the socket isn't going away as long as it's used by a peer.
After a deep dive I was getting to the same conclusion: cleanup_net() is never invoked because we lack an invocation of put_net(). sk_alloc() invokes get_net(), so this is the ref that is not being released.
If I delete the peer(s) for the ovpn device and then the netns it was in, the netns is fully removed, and the ovpn device is gone. Also no issue if I delete the ovpn device before its netns, then everything is destroyed as expected.
I'm not sure that can be solved, as least under the current refcount scheme.
I went back to v12 of the patchset (which is pre-refcount-restructuring) and indeed the problem there doesn't exist.
However, it's unclear to me how in v12 the socket release was happening upon netns delete. Who was triggering that? ovpn still needed to call sockfd_put() in order to let it go.
Will investigate some more and think about a solution.
I may have done something wrong, but today I tried again and I am reproducing this issue also on v8 + 6.11.
I am indeed wondering how it could have ever worked: if we don't delete the peer, we don't detach the socket, hence the following chain:
sockfd_put() -> sk_destructor() -> put_net()
does not happen.
Shouldn't we be notified somehow that the netns is going away?
For example in wireguard/device.c the socket is released in pernet_operations.pre_exit().
But pre_exit() is invoked in cleanup_net(), which is invoked ONLY if the net refcount has reached 0...but how can it be zero before the sockets have been released?
I must be missing something, because this seems to be a reference loop.
I'll keep digging..
Regards,