On Tue, Sep 21, 2021 at 1:24 AM David Laight <David.Laight(a)aculab.com> wrote:
>
> From: Luis Chamberlain
> > Sent: 17 September 2021 20:47
> >
> > When sysfs attributes use a lock also used on module removal we can
> > race to deadlock. This happens when for instance a sysfs file on
> > a driver is used, then at the same time we have module removal call
> > trigger. The module removal call code holds a lock, and then the sysfs
> > file entry waits for the same lock. While holding the lock the module
> > removal tries to remove the sysfs entries, but these cannot be removed
> > yet as one is waiting for a lock. This won't complete as the lock is
> > already held. Likewise module removal cannot complete, and so we deadlock.
>
> Isn't the real problem the race between a sysfs file action and the
> removal of the sysfs node?
Nope, that is taken care of by kernfs.
> This isn't really related to module unload - except that may
> well remove some sysfs nodes.
Nope, the issue is a deadlock that can happen due to a shared lock on
module removal and a driver sysfs operation.
> This is the same problem as removing any other kind of driver callback.
> There are three basic solutions:
> 1) Use a global lock - not usually useful.
> 2) Have the remove call sleep until any callbacks are complete.
> 3) Have the remove just request removal and have a final
> callback (from a different context).
Kernfs already does a sort of combination of 1) and 2) but 1) is using
atomic reference counts.
> If the remove can sleep (as in 2) then there is a requirement
> on the driver code to not hold any locks across the 'remove'
> that can be acquired during the callbacks.
And this is the part that kernfs has no control over since the removal
and sysfs operation are implementation specific.
> Now, for sysfs, you probably only want to sleep the remove code
> while a read/write is in progress - not just because the node
> is open.
> That probably requires marking an open node 'invalid' and
> deferring delete to close.
This is already done by kernfs.
> None of this requires a reference count on the module.
You are missing the point to the other aspect of the try_module_get(),
it lets you also check if module exit has been entered. By using
try_module_get() you let the module exit trump proceeding with an
operation, therefore also preventing any potential use of a shared
lock on module exit and the driver specific sysfs operation.
Luis
Makefile uses TEST_PROGS instead of TEST_GEN_PROGS to define
executables. TEST_PROGS is for shell scripts that need to be
installed and run by the common lib.mk framework. The common
framework doesn't touch TEST_PROGS when it does build and clean.
As a result "make kselftest-clean" and "make clean" fail to remove
executables. Run and install work because the common framework runs
and installs TEST_PROGS. Build works because the Makefile defines
"all" rule which is unnecessary if TEST_GEN_PROGS is used.
Use TEST_GEN_PROGS so the common framework can handle build/run/
install/clean properly.
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
tools/testing/selftests/net/af_unix/Makefile | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/tools/testing/selftests/net/af_unix/Makefile b/tools/testing/selftests/net/af_unix/Makefile
index cfc7f4f97fd1..df341648f818 100644
--- a/tools/testing/selftests/net/af_unix/Makefile
+++ b/tools/testing/selftests/net/af_unix/Makefile
@@ -1,5 +1,2 @@
-##TEST_GEN_FILES := test_unix_oob
-TEST_PROGS := test_unix_oob
+TEST_GEN_PROGS := test_unix_oob
include ../../lib.mk
-
-all: $(TEST_PROGS)
--
2.30.2
The ATTRIBUTE_GROUPS is typically used to avoid boiler plate
code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was
long due on the zram driver, however a recent fix for sysfs allows
users of ATTRIBUTE_GROUPS to also associate a module to the group
attribute.
In zram's case this also means it allows us to fix a race which triggers
a deadlock on the zram driver. This deadlock happens when a sysfs attribute
use a lock also used on module removal. This happens when for instance a
sysfs file on a driver is used, then at the same time we have module
removal call trigger. The module removal call code holds a lock, and then
the sysfs file entry waits for the same lock. While holding the lock the
module removal tries to remove the sysfs entries, but these cannot be
removed yet as one is waiting for a lock. This won't complete as the lock
is already held. Likewise module removal cannot complete, and so we
deadlock.
Sysfs fixes this when the group attributes have a module associated to
it, sysfs will *try* to get a refcount to the module when a shared
lock is used, prior to mucking with a sysfs attribute. If this fails we
just give up right away.
This deadlock was first reported with the zram driver, a sketch of how
this can happen follows:
CPU A CPU B
whatever_store()
module_unload
mutex_lock(foo)
mutex_lock(foo)
del_gendisk(zram->disk);
device_del()
device_remove_groups()
In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked
completes which is waiting for a lock already held.
This issue can be reproduced easily on the zram driver as follows:
Loop 1 on one terminal:
while true;
do modprobe zram;
modprobe -r zram;
done
Loop 2 on a second terminal:
while true; do
echo 1024 > /sys/block/zram0/disksize;
echo 1 > /sys/block/zram0/reset;
done
Without this patch we end up in a deadlock, and the following
stack trace is produced which hints to us what the issue was:
INFO: task bash:888 blocked for more than 120 seconds.
Tainted: G E 5.12.0-rc1-next-20210304+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash state:D stack: 0 pid: 888 ppid: 887 flags:<etc>
Call Trace:
__schedule+0x2e4/0x900
schedule+0x46/0xb0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.constprop.0+0x2c3/0x490
? _kstrtoull+0x35/0xd0
reset_store+0x6c/0x160 [zram]
kernfs_fop_write_iter+0x124/0x1b0
new_sync_write+0x11c/0x1b0
vfs_write+0x1c2/0x260
ksys_write+0x5f/0xe0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f34f2c3df33
RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33
RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001
RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001
R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0
INFO: task modprobe:1104 can't die for more than 120 seconds.
task:modprobe state:D stack: 0 pid: 1104 ppid: 916 flags:<etc>
Call Trace:
__schedule+0x2e4/0x900
schedule+0x46/0xb0
__kernfs_remove.part.0+0x228/0x2b0
? finish_wait+0x80/0x80
kernfs_remove_by_name_ns+0x50/0x90
remove_files+0x2b/0x60
sysfs_remove_group+0x38/0x80
sysfs_remove_groups+0x29/0x40
device_remove_attrs+0x4a/0x80
device_del+0x183/0x3e0
? mutex_lock+0xe/0x30
del_gendisk+0x27a/0x2d0
zram_remove+0x8a/0xb0 [zram]
? hot_remove_store+0xf0/0xf0 [zram]
zram_remove_cb+0xd/0x10 [zram]
idr_for_each+0x5e/0xd0
destroy_devices+0x39/0x6f [zram]
__do_sys_delete_module+0x190/0x2a0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f32adf727d7
RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78
RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78
R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20
[0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
drivers/block/zram/zram_drv.c | 11 ++---------
1 file changed, 2 insertions(+), 9 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b26abcb955cc..60a55ae8cd91 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1902,14 +1902,7 @@ static struct attribute *zram_disk_attrs[] = {
NULL,
};
-static const struct attribute_group zram_disk_attr_group = {
- .attrs = zram_disk_attrs,
-};
-
-static const struct attribute_group *zram_disk_attr_groups[] = {
- &zram_disk_attr_group,
- NULL,
-};
+ATTRIBUTE_GROUPS(zram_disk);
/*
* Allocate and initialize new zram device. the function returns
@@ -1981,7 +1974,7 @@ static int zram_add(void)
blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
- device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
+ device_add_disk(NULL, zram->disk, zram_disk_groups);
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
--
2.30.2
Provide a simple state machine to fix races with driver exit where we
remove the CPU multistate callbacks and re-initialization / creation of
new per CPU instances which should be managed by these callbacks.
The zram driver makes use of cpu hotplug multistate support, whereby it
associates a struct zcomp per CPU. Each struct zcomp represents a
compression algorithm in charge of managing compression streams per
CPU. Although a compiled zram driver only supports a fixed set of
compression algorithms, each zram device gets a struct zcomp allocated
per CPU. The "multi" in CPU hotplug multstate refers to these per
cpu struct zcomp instances. Each of these will have the CPU hotplug
callback called for it on CPU plug / unplug. The kernel's CPU hotplug
multistate keeps a linked list of these different structures so that
it will iterate over them on CPU transitions.
By default at driver initialization we will create just one zram device
(num_devices=1) and a zcomp structure then set for the now default
lzo-rle comrpession algorithm. At driver removal we first remove each
zram device, and so we destroy the associated struct zcomp per CPU. But
since we expose sysfs attributes to create new devices or reset /
initialize existing zram devices, we can easily end up re-initializing
a struct zcomp for a zram device before the exit routine of the module
removes the cpu hotplug callback. When this happens the kernel's CPU
hotplug will detect that at least one instance (struct zcomp for us)
exists. This can happen in the following situation:
CPU 1 CPU 2
disksize_store(...);
class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();
idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);
The warning comes up on cpuhp_remove_multi_state() when it sees that the
state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
In this case, that a struct zcom still exists, the driver allowed its
creation per CPU even though we could have just freed them per CPU
though a call on another CPU, and we are then later trying to remove the
hotplug callback.
Fix all this by providing a zram initialization boolean
protected the shared in the driver zram_index_mutex, which we
can use to annotate when sysfs attributes are safe to use or
not -- once the driver is properly initialized. When the driver
is going down we also are sure to not let userspace muck with
attributes which may affect each per cpu struct zcomp.
This also fixes a series of possible memory leaks. The
crashes and memory leaks can easily be caused by issuing
the zram02.sh script from the LTP project [0] in a loop
in two separate windows:
cd testcases/kernel/device-drivers/zram
while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
You end up with a splat as follows:
kernel: zram: Removed device: zram0
kernel: zram: Added device: zram0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: Adding 104857596k swap on /dev/zram0. <etc>
kernel: zram0: detected capacitky change from 209715200 to 0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at \
kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G \
E 5.12.0-rc1-next-20210304 #3
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
BIOS 1.14.0-2 04/01/2014
kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Code: <etc>
kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033
kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: __cpuhp_remove_state+0x2e/0x80
kernel: __do_sys_delete_module+0x190/0x2a0
kernel: do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
The "Error: Removing state 63 which has instances left" refers
to the zram per CPU struct zcomp instances left.
[0] https://github.com/linux-test-project/ltp.git
Acked-by: Minchan Kim <minchan(a)kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
1 file changed, 55 insertions(+), 8 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f61910c65f0f..b26abcb955cc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
static int zram_major;
static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
+static bool zram_up;
+
/* Module params (documentation at end) */
static unsigned int num_devices = 1;
/*
@@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
comp = zram->comp;
disksize = zram->disksize;
zram->disksize = 0;
+ zram->comp = NULL;
set_capacity_and_notify(zram->disk, 0);
part_stat_set_all(zram->disk->part0, 0);
@@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
struct zram *zram = dev_to_zram(dev);
int err;
+ mutex_lock(&zram_index_mutex);
+
+ if (!zram_up) {
+ err = -ENODEV;
+ goto out;
+ }
+
disksize = memparse(buf, NULL);
- if (!disksize)
- return -EINVAL;
+ if (!disksize) {
+ err = -EINVAL;
+ goto out;
+ }
down_write(&zram->init_lock);
if (init_done(zram)) {
@@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(&zram->init_lock);
+ mutex_unlock(&zram_index_mutex);
+
return len;
out_free_meta:
zram_meta_free(zram, disksize);
out_unlock:
up_write(&zram->init_lock);
+out:
+ mutex_unlock(&zram_index_mutex);
return err;
}
@@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
if (ret)
return ret;
- if (!do_reset)
- return -EINVAL;
+ mutex_lock(&zram_index_mutex);
+
+ if (!zram_up) {
+ len = -ENODEV;
+ goto out;
+ }
+
+ if (!do_reset) {
+ len = -EINVAL;
+ goto out;
+ }
zram = dev_to_zram(dev);
bdev = zram->disk->part0;
@@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
/* Do not reset an active device or claimed device */
if (bdev->bd_openers || zram->claim) {
mutex_unlock(&bdev->bd_disk->open_mutex);
- return -EBUSY;
+ len = -EBUSY;
+ goto out;
}
/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
zram->claim = false;
mutex_unlock(&bdev->bd_disk->open_mutex);
+out:
+ mutex_unlock(&zram_index_mutex);
return len;
}
@@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class,
int ret;
mutex_lock(&zram_index_mutex);
+ if (!zram_up) {
+ mutex_unlock(&zram_index_mutex);
+ return -ENODEV;
+ }
ret = zram_add();
mutex_unlock(&zram_index_mutex);
@@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class,
mutex_lock(&zram_index_mutex);
+ if (!zram_up) {
+ ret = -ENODEV;
+ goto out;
+ }
+
zram = idr_find(&zram_index_idr, dev_id);
if (zram) {
ret = zram_remove(zram);
@@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class,
ret = -ENODEV;
}
+out:
mutex_unlock(&zram_index_mutex);
return ret ? ret : count;
}
@@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
static void destroy_devices(void)
{
+ mutex_lock(&zram_index_mutex);
+ zram_up = false;
class_unregister(&zram_control_class);
idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
zram_debugfs_destroy();
idr_destroy(&zram_index_idr);
unregister_blkdev(zram_major, "zram");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+ mutex_unlock(&zram_index_mutex);
}
static int __init zram_init(void)
@@ -2105,15 +2146,21 @@ static int __init zram_init(void)
return -EBUSY;
}
+ mutex_lock(&zram_index_mutex);
+
while (num_devices != 0) {
- mutex_lock(&zram_index_mutex);
ret = zram_add();
- mutex_unlock(&zram_index_mutex);
- if (ret < 0)
+ if (ret < 0) {
+ mutex_unlock(&zram_index_mutex);
goto out_error;
+ }
num_devices--;
}
+ zram_up = true;
+
+ mutex_unlock(&zram_index_mutex);
+
return 0;
out_error:
--
2.30.2
Now that sysfs has the deadlock race fixed with module removal,
enable the deadlock tests module removal tests. They were left
disabled by default as otherwise you would deadlock your system
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
Running test: sysfs_test_0027 - run #0
Test for possible rmmod deadlock while writing x ... ok
./tools/testing/selftests/sysfs/sysfs.sh -t 0028
Running test: sysfs_test_0028 - run #0
Test for possible rmmod deadlock using rtnl_lock while writing x ... ok
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
tools/testing/selftests/sysfs/sysfs.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
index f928635d0e35..4047ac48e764 100755
--- a/tools/testing/selftests/sysfs/sysfs.sh
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -60,8 +60,8 @@ ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block"
ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block"
ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
-ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
-ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+ALL_TESTS="$ALL_TESTS 0027:1:1:test_dev_x:block" # deadlock test
+ALL_TESTS="$ALL_TESTS 0028:1:1:test_dev_x:block" # deadlock test with rntl_lock
ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
--
2.30.2
If one ends up expanding on this line checkpatch will complain that the
combination S_IRWXU|S_IRUGO|S_IXUGO should just be replaced with the
octal 0755. Do that.
This makes no functional changes.
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
fs/sysfs/dir.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 59dffd5ca517..b6b6796e1616 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -56,8 +56,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
kobject_get_ownership(kobj, &uid, &gid);
- kn = kernfs_create_dir_ns(parent, kobject_name(kobj),
- S_IRWXU | S_IRUGO | S_IXUGO, uid, gid,
+ kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
kobj, ns);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
--
2.30.2