On Tue, May 03, 2022 at 04:27:20PM +0200, Greg KH wrote:
On Tue, May 03, 2022 at 11:24:01AM -0300, Thadeu Lima de Souza Cascardo wrote:
On Tue, May 03, 2022 at 03:49:15PM +0200, Greg KH wrote:
On Mon, May 02, 2022 at 05:49:24PM -0300, Thadeu Lima de Souza Cascardo wrote:
When dropping the rtnl_lock for looking up for a module, the device may be removed, releasing the qdisc and class memory. Right after trying to load the module, cl_ops->put is called, leading to a potential use-after-free.
Though commit e368fdb61d8e ("net: sched: use Qdisc rcu API instead of relying on rtnl lock") fixes this, it involves a lot of refactoring of the net/sched/ code, complicating its backport.
What about 4.14.y? We can not take a commit for 4.9.y with it also being broken in 4.14.y, and yet fixed in 4.19.y, right? Anyone who updates from 4.9 to 4.14 will have a regression.
thanks,
greg k-h
4.14.y does not call cl_ops->put (the get/put and class refcount has been done with on 4.14.y). However, on the error path after the lock has been dropped, tcf_chain_put is called. But it does not touch the qdisc, but only the chain and block objects, which cannot be released on a race condition, as far as I was able to investigate.
So what changed between 4.9 and 4.14 that requires this out-of-tree change to 4.9 for the issue? Shouldn't we backport that change instead of this custom one?
thanks,
greg k-h
143976ce992f ("net_sched: remove tc class reference counting") removed the call to cops->put as that reference counting was removed and the get call was replaced by find.
Backporting it is an alternative fix, but there are more chances of breaking something else, as it is not a trivial cherry-pick.
Cascardo.