Jakub Kicinski kuba@kernel.org wrote:
On Mon, 8 Sep 2025 13:47:24 -0700 Calvin Owens wrote:
I wonder if there might be a demon lurking in bonding+netpoll that this was papering over? Not a reason not to fix the leaks IMO, I'm just curious, I don't want to spend time on it if you already did :)
+1, I also feel like it'd be good to have some bonding tests in place when we're removing a hack added specifically for bonding.
I'll disclaimer this by saying up front that I'm not super familiar with the innards of netpoll.
That said, I looked at commit efa95b01da18 ("netpoll: fix use after free") and the relevant upstream discussion, and I'm not sure the assertion that "After a bonding master reclaims the netpoll info struct, slaves could still hold a pointer to the reclaimed data" is correct.
I'm not sure the efa9 patch's reference count math is correct (more on that below).
Second, I'm a bit unsure what's going on with the struct netpoll *np parameter of __netpoll_setup for the second and subsequent netpoll instances (i.e., second and later call), as the function will unconditionally do
npinfo->netpoll = np;
which it seems like would overwrite the "np" supplied by any prior calls to __netpoll_setup. In bonding, slave_enable_netpoll() stashes the "np" it allocates as slave->np, and slave_disable_netpoll relies on __netpoll_free to free it, so I don't think it's lost, but it seems like netpoll internally only tracks one of these at a time, regardless of the reference count.
On the reference counting, the upstream example from the prior discussion includes:
mkdir /sys/kernel/config/netconsole/blah echo 0 > /sys/kernel/config/netconsole/blah/enabled echo bond0 > /sys/kernel/config/netconsole/blah/dev_name echo 192.168.56.42 > /sys/kernel/config/netconsole/blah/remote_ip echo 1 > /sys/kernel/config/netconsole/blah/enabled # npinfo refcnt ->1 ifenslave bond0 eth1 # npinfo refcnt ->2 ifenslave bond0 eth0 # (this should be optional, preventing ndo_cleanup_nepoll below) # npinfo refcnt ->3
I'm suspicious of the refcnt values here; both then and now, the npinfo for each of the relevant interfaces is a separate per-interface allocation in __netpoll_setup, so I'm not sure what exactly is supposed to be getting a refcnt of 3.
If there are two netpoll instances using the slave in question (either directly or via the bond itself), then clearing the np->dev->npinfo pointer looks like the wrong thing to do until the last reference is released.
-J
--- -Jay Vosburgh, jv@jvosburgh.net