On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain mcgrof@kernel.org wrote:
When sysfs attributes use a lock also used on module removal we can race to deadlock. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock.
This can now be easily reproducible with our sysfs selftest as follows:
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
To fix this we extend the struct kernfs_node with a module reference and use the try_module_get() after kernfs_get_active() is called which protects integrity and the existence of the kernfs node during the operation.
So long as the kernfs node is protected with kernfs_get_active() we know we can rely on its contents. And, as now just documented in the previous patch, we also now know that once kernfs_get_active() is called the module is also guarded to exist and cannot be removed.
If try_module_get() fails we fail the operation on the kernfs node.
We use a try method as a full lock means we'd then make our sysfs attributes busy us out from possible module removal, and so userspace could force denying module removal, a silly form of "DOS" against module removal. A try lock on the module removal ensures we give priority to module removal and interacting with sysfs attributes only comes second. Using a full lock could mean for instance that if you don't stop poking at sysfs files you cannot remove a module.
Races between removal of sysfs files and the module are not possible given sysfs files are created by the same module, and when a sysfs file is being used kernfs prevents removal of the sysfs file. So if module removal is actually happening the removal would have to wait until the sysfs file operation is complete.
This deadlock was first reported with the zram driver, however the live patching folks have acknowledged they have observed this as well with live patching, when a live patch is removed. I was then able to reproduce easily by creating a dedicated selftests.
A sketch of how this can happen follows:
CPU A CPU B whatever_store() module_unload mutex_lock(foo) mutex_lock(foo) del_gendisk(zram->disk); device_del() device_remove_groups()
This flow seems possible to trigger with:
echo $dev > /sys/bus/$bus/drivers/$driver/unbind
I am missing why module pinning is part of the solution when it's the device_del() path that is racing? Module removal is just a more coarse grained way to trigger unbind => device_del(). Isn't the above a bug in the driver, not missing synchronization in kernfs? Forgive me if the unbind question was asked and answered elsewhere, this is my first time taking a look at this series.