On Tue, May 03, 2022 at 11:24:01AM -0300, Thadeu Lima de Souza Cascardo wrote:
On Tue, May 03, 2022 at 03:49:15PM +0200, Greg KH wrote:
On Mon, May 02, 2022 at 05:49:24PM -0300, Thadeu Lima de Souza Cascardo wrote:
When dropping the rtnl_lock for looking up for a module, the device may be removed, releasing the qdisc and class memory. Right after trying to load the module, cl_ops->put is called, leading to a potential use-after-free.
Though commit e368fdb61d8e ("net: sched: use Qdisc rcu API instead of relying on rtnl lock") fixes this, it involves a lot of refactoring of the net/sched/ code, complicating its backport.
What about 4.14.y? We can not take a commit for 4.9.y with it also being broken in 4.14.y, and yet fixed in 4.19.y, right? Anyone who updates from 4.9 to 4.14 will have a regression.
thanks,
greg k-h
4.14.y does not call cl_ops->put (the get/put and class refcount has been done with on 4.14.y). However, on the error path after the lock has been dropped, tcf_chain_put is called. But it does not touch the qdisc, but only the chain and block objects, which cannot be released on a race condition, as far as I was able to investigate.
So what changed between 4.9 and 4.14 that requires this out-of-tree change to 4.9 for the issue? Shouldn't we backport that change instead of this custom one?
thanks,
greg k-h