September 2021 - Linux-kselftest-mirror

Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal

by Luis Chamberlain

On Tue, Sep 21, 2021 at 1:24 AM David Laight <David.Laight(a)aculab.com> wrote: > > From: Luis Chamberlain > > Sent: 17 September 2021 20:47 > > > > When sysfs attributes use a lock also used on module removal we can > > race to deadlock. This happens when for instance a sysfs file on > > a driver is used, then at the same time we have module removal call > > trigger. The module removal call code holds a lock, and then the sysfs > > file entry waits for the same lock. While holding the lock the module > > removal tries to remove the sysfs entries, but these cannot be removed > > yet as one is waiting for a lock. This won't complete as the lock is > > already held. Likewise module removal cannot complete, and so we deadlock. > > Isn't the real problem the race between a sysfs file action and the > removal of the sysfs node? Nope, that is taken care of by kernfs. > This isn't really related to module unload - except that may > well remove some sysfs nodes. Nope, the issue is a deadlock that can happen due to a shared lock on module removal and a driver sysfs operation. > This is the same problem as removing any other kind of driver callback. > There are three basic solutions: > 1) Use a global lock - not usually useful. > 2) Have the remove call sleep until any callbacks are complete. > 3) Have the remove just request removal and have a final > callback (from a different context). Kernfs already does a sort of combination of 1) and 2) but 1) is using atomic reference counts. > If the remove can sleep (as in 2) then there is a requirement > on the driver code to not hold any locks across the 'remove' > that can be acquired during the callbacks. And this is the part that kernfs has no control over since the removal and sysfs operation are implementation specific. > Now, for sysfs, you probably only want to sleep the remove code > while a read/write is in progress - not just because the node > is open. > That probably requires marking an open node 'invalid' and > deferring delete to close. This is already done by kernfs. > None of this requires a reference count on the module. You are missing the point to the other aspect of the try_module_get(), it lets you also check if module exit has been entered. By using try_module_get() you let the module exit trump proceeding with an operation, therefore also preventing any potential use of a shared lock on module exit and the driver specific sysfs operation. Luis

4 years, 2 months

1
0
0 0

[PATCH v7 03/12] selftests: add tests_sysfs module

by Luis Chamberlain

This adds a new selftest module which can be used to test sysfs, which would otherwise require using an existing driver. This lets us muck with a template driver to test breaking things without affecting system behaviour or requiring the dependencies of a real device driver. A series of 28 tests are added. Support for using two device types are supported: * misc * block Contrary to sysctls, sysfs requires a full write to happen at once, and so we reduce the digit tests to single writes. Two main sysfs knobs are provided for testing reading/storing, one which doesn't inclur any delays and another which can incur programmed delays. What locks are held, if any, are configurable, at module load time, or through dynamic configuration at run time. Since sysfs is a technically filesystem, but a pseudo one, which requires a kernel user, our test_sysfs module and respective test script embraces fstests format for tests in the kernel ring bufffer. Likewise, a scraper for kernel crashes is provided which matches what fstests does as well. Two tests are kept disabled as they currently cause a deadlock, and so this provides a mechanism to easily show proof and demo how the deadlock can happen: Demos the deadlock with a device specific lock ./tools/testing/selftests/sysfs/sysfs.sh -t 0027 Demos the deadlock with rtnl_lock() ./tools/testing/selftests/sysfs/sysfs.sh -t 0028 Two separate solutions to the deadlock issue have been proposed, and so now its a matter of either documenting this limitation or eventually adopting a generic fix. This selftests will shortly be expanded upon with more tests which require further kernel changes in order to provide better test coverage. Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org> --- MAINTAINERS | 7 + lib/Kconfig.debug | 12 + lib/Makefile | 1 + lib/test_sysfs.c | 943 ++++++++++++++++++ tools/testing/selftests/sysfs/Makefile | 12 + tools/testing/selftests/sysfs/config | 2 + tools/testing/selftests/sysfs/sysfs.sh | 1208 ++++++++++++++++++++++++ 7 files changed, 2185 insertions(+) create mode 100644 lib/test_sysfs.c create mode 100644 tools/testing/selftests/sysfs/Makefile create mode 100644 tools/testing/selftests/sysfs/config create mode 100755 tools/testing/selftests/sysfs/sysfs.sh diff --git a/MAINTAINERS b/MAINTAINERS index 15a5fd8323f7..28a34384f541 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18166,6 +18166,13 @@ L: linux-mmc(a)vger.kernel.org S: Maintained F: drivers/mmc/host/sdhci-pci-dwc-mshc.c +SYSFS TEST DRIVER +M: Luis Chamberlain <mcgrof(a)kernel.org> +L: linux-kernel(a)vger.kernel.org +S: Maintained +F: lib/test_sysfs.c +F: tools/testing/selftests/sysfs/ + SYSTEM CONFIGURATION (SYSCON) M: Lee Jones <lee.jones(a)linaro.org> M: Arnd Bergmann <arnd(a)arndb.de> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index dbff322207ad..ae19bf1a21b8 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2343,6 +2343,18 @@ config TEST_SYSCTL If unsure, say N. +config TEST_SYSFS + tristate "sysfs test driver" + depends on SYSFS + depends on NET + depends on BLOCK + help + This builds the "test_sysfs" module. This driver enables to test the + sysfs file system safely without affecting production knobs which + might alter system functionality. + + If unsure, say N. + config BITFIELD_KUNIT tristate "KUnit test bitfield functions at runtime" depends on KUNIT diff --git a/lib/Makefile b/lib/Makefile index 2cfd33917ad5..5143d65f90d6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -61,6 +61,7 @@ obj-$(CONFIG_TEST_FIRMWARE) += test_firmware.o obj-$(CONFIG_TEST_BITOPS) += test_bitops.o CFLAGS_test_bitops.o += -Werror obj-$(CONFIG_TEST_SYSCTL) += test_sysctl.o +obj-$(CONFIG_TEST_SYSFS) += test_sysfs.o obj-$(CONFIG_TEST_HASH) += test_hash.o test_siphash.o obj-$(CONFIG_TEST_IDA) += test_ida.o obj-$(CONFIG_KASAN_KUNIT_TEST) += test_kasan.o diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c new file mode 100644 index 000000000000..273fc3f39740 --- /dev/null +++ b/lib/test_sysfs.c @@ -0,0 +1,943 @@ +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1 +/* + * sysfs test driver + * + * Copyright (C) 2021 Luis Chamberlain <mcgrof(a)kernel.org> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or at your option any + * later version; or, when distributed separately from the Linux kernel or + * when incorporated into other software packages, subject to the following + * license: + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of copyleft-next (version 0.3.1 or later) as published + * at http://copyleft-next.org/. + */ + +/* + * This module allows us to add race conditions which we can test for + * against the sysfs filesystem. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/init.h> +#include <linux/list.h> +#include <linux/module.h> +#include <linux/printk.h> +#include <linux/fs.h> +#include <linux/miscdevice.h> +#include <linux/slab.h> +#include <linux/uaccess.h> +#include <linux/async.h> +#include <linux/delay.h> +#include <linux/vmalloc.h> +#include <linux/debugfs.h> +#include <linux/rtnetlink.h> +#include <linux/genhd.h> +#include <linux/blkdev.h> + +static bool enable_lock; +module_param(enable_lock, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_lock, + "enable locking on reads / stores from the start"); + +static bool enable_lock_on_rmmod; +module_param(enable_lock_on_rmmod, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_lock_on_rmmod, + "enable locking on rmmod"); + +static bool use_rtnl_lock; +module_param(use_rtnl_lock, bool_enable_only, 0644); +MODULE_PARM_DESC(use_rtnl_lock, + "use an rtnl_lock instead of the device mutex_lock"); + +static unsigned int write_delay_msec_y = 500; +module_param_named(write_delay_msec_y, write_delay_msec_y, uint, 0644); +MODULE_PARM_DESC(write_delay_msec_y, "msec write delay for writes to y"); + +static unsigned int test_devtype; +module_param_named(devtype, test_devtype, uint, 0644); +MODULE_PARM_DESC(devtype, "device type to register"); + +static bool enable_busy_alloc; +module_param(enable_busy_alloc, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_busy_alloc, "do a fake allocation during writes"); + +static bool enable_debugfs; +module_param(enable_debugfs, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_debugfs, "enable a few debugfs files"); + +static bool enable_verbose_writes; +module_param(enable_verbose_writes, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_debugfs, "enable stores to print verbose information"); + +static unsigned int delay_rmmod_ms; +module_param_named(delay_rmmod_ms, delay_rmmod_ms, uint, 0644); +MODULE_PARM_DESC(delay_rmmod_ms, "if set how many ms to delay rmmod before device deletion"); + +static bool enable_verbose_rmmod; +module_param(enable_verbose_rmmod, bool_enable_only, 0644); +MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod"); + +static int sysfs_test_major; + +/** + * test_config - used for configuring how the sysfs test device will behave + * + * @enable_lock: if enabled a lock will be used when reading/storing variables + * @enable_lock_on_rmmod: if enabled a lock will be used when reading/storing + * sysfs attributes, but it will also be used to lock on rmmod. This is + * useful to test for a deadlock. + * @use_rtnl_lock: if enabled instead of configuration specific mutex, we'll + * use the rtnl_lock. If your test case is modifying this on the fly + * while doing other stores / reads, things will break as a lock can be + * left contending. Best is that tests use this knob serially, without + * allowing userspace to modify other knobs while this one changes. + * @write_delay_msec_y: the amount of delay to use when writing to y + * @enable_busy_alloc: if enabled we'll do a large allocation between + * writes. We immediately free right away. We also schedule to give the + * kernel some time to re-use any memory we don't need. This is intened + * to mimic typical driver behaviour. + */ +struct test_config { + bool enable_lock; + bool enable_lock_on_rmmod; + bool use_rtnl_lock; + unsigned int write_delay_msec_y; + bool enable_busy_alloc; +}; + +/** + * enum sysfs_test_devtype - sysfs device type + * @TESTDEV_TYPE_MISC: misc device type + * @TESTDEV_TYPE_BLOCK: use a block device for the sysfs test device. + */ +enum sysfs_test_devtype { + TESTDEV_TYPE_MISC = 0, + TESTDEV_TYPE_BLOCK, +}; + +/** + * sysfs_test_device - test device to help test sysfs + * + * @devtype: the type of device to use + * @config: configuration for the test + * @config_mutex: protects configuration of test + * @misc_dev: we use a misc device under the hood + * @disk: represents a disk when used as a block device + * @dev: pointer to misc_dev's own struct device + * @dev_idx: unique ID for test device + * @x: variable we can use to test read / store + * @y: slow variable we can use to test read / store + */ +struct sysfs_test_device { + enum sysfs_test_devtype devtype; + struct test_config config; + struct mutex config_mutex; + struct miscdevice misc_dev; + struct gendisk *disk; + struct device *dev; + int dev_idx; + int x; + int y; +}; + +static struct sysfs_test_device *first_test_dev; + +static struct miscdevice *dev_to_misc_dev(struct device *dev) +{ + return dev_get_drvdata(dev); +} + +static struct sysfs_test_device *misc_dev_to_test_dev(struct miscdevice *misc_dev) +{ + return container_of(misc_dev, struct sysfs_test_device, misc_dev); +} + +static struct sysfs_test_device *devblock_to_test_dev(struct device *dev) +{ + return (struct sysfs_test_device *)dev_to_disk(dev)->private_data; +} + +static struct sysfs_test_device *devmisc_to_testdev(struct device *dev) +{ + struct miscdevice *misc_dev; + + misc_dev = dev_to_misc_dev(dev); + return misc_dev_to_test_dev(misc_dev); +} + +static struct sysfs_test_device *dev_to_test_dev(struct device *dev) +{ + if (test_devtype == TESTDEV_TYPE_MISC) + return devmisc_to_testdev(dev); + else if (test_devtype == TESTDEV_TYPE_BLOCK) + return devblock_to_test_dev(dev); + return NULL; +} + +static void test_dev_config_lock(struct sysfs_test_device *test_dev) +{ + struct test_config *config; + + config = &test_dev->config; + if (config->enable_lock) { + if (config->use_rtnl_lock) + rtnl_lock(); + else + mutex_lock(&test_dev->config_mutex); + } +} + +static void test_dev_config_unlock(struct sysfs_test_device *test_dev) +{ + struct test_config *config; + + config = &test_dev->config; + if (config->enable_lock) { + if (config->use_rtnl_lock) + rtnl_unlock(); + else + mutex_unlock(&test_dev->config_mutex); + } +} + +static void test_dev_config_lock_rmmod(struct sysfs_test_device *test_dev) +{ + struct test_config *config; + + config = &test_dev->config; + if (config->enable_lock_on_rmmod) + test_dev_config_lock(test_dev); +} + +static void test_dev_config_unlock_rmmod(struct sysfs_test_device *test_dev) +{ + struct test_config *config; + + config = &test_dev->config; + if (config->enable_lock_on_rmmod) + test_dev_config_unlock(test_dev); +} + +static void free_test_dev_sysfs(struct sysfs_test_device *test_dev) +{ + if (test_dev) { + kfree_const(test_dev->misc_dev.name); + test_dev->misc_dev.name = NULL; + kfree(test_dev); + test_dev = NULL; + } +} + +static void test_sysfs_reset_vals(struct sysfs_test_device *test_dev) +{ + test_dev->x = 3; + test_dev->y = 4; +} + +static ssize_t config_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config = &test_dev->config; + int len = 0; + + test_dev_config_lock(test_dev); + + len += snprintf(buf, PAGE_SIZE, + "Configuration for: %s\n", + dev_name(dev)); + + len += snprintf(buf+len, PAGE_SIZE - len, + "x:\t%d\n", + test_dev->x); + + len += snprintf(buf+len, PAGE_SIZE - len, + "y:\t%d\n", + test_dev->y); + + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_lock:\t%s\n", + config->enable_lock ? "true" : "false"); + + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_lock_on_rmmmod:\t%s\n", + config->enable_lock_on_rmmod ? "true" : "false"); + + len += snprintf(buf+len, PAGE_SIZE - len, + "use_rtnl_lock:\t%s\n", + config->use_rtnl_lock ? "true" : "false"); + + len += snprintf(buf+len, PAGE_SIZE - len, + "write_delay_msec_y:\t%d\n", + config->write_delay_msec_y); + + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_busy_alloc:\t%s\n", + config->enable_busy_alloc ? "true" : "false"); + + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_debugfs:\t%s\n", + enable_debugfs ? "true" : "false"); + + len += snprintf(buf+len, PAGE_SIZE - len, + "enable_verbose_writes:\t%s\n", + enable_verbose_writes ? "true" : "false"); + + test_dev_config_unlock(test_dev); + + return len; +} +static DEVICE_ATTR_RO(config); + +static ssize_t reset_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + + config = &test_dev->config; + + /* + * We compromise and simplify this condition and do not use a lock + * here as the lock type can change. + */ + config->enable_lock = false; + config->enable_lock_on_rmmod = false; + config->use_rtnl_lock = false; + config->enable_busy_alloc = false; + test_sysfs_reset_vals(test_dev); + + dev_info(dev, "reset\n"); + + return count; +} +static DEVICE_ATTR_WO(reset); + +static void test_dev_busy_alloc(struct sysfs_test_device *test_dev) +{ + struct test_config *config; + char *ignore; + + config = &test_dev->config; + if (!config->enable_busy_alloc) + return; + + ignore = kzalloc(sizeof(struct sysfs_test_device) * 10, GFP_KERNEL); + kfree(ignore); + + schedule(); +} + +static ssize_t test_dev_x_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + int ret; + + test_dev_busy_alloc(test_dev); + test_dev_config_lock(test_dev); + + ret = kstrtoint(buf, 10, &test_dev->x); + if (ret) + count = ret; + + if (enable_verbose_writes) + dev_info(test_dev->dev, "wrote x = %d\n", test_dev->x); + + test_dev_config_unlock(test_dev); + + return count; +} + +static ssize_t test_dev_x_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + int ret; + + test_dev_config_lock(test_dev); + ret = snprintf(buf, PAGE_SIZE, "%d\n", test_dev->x); + test_dev_config_unlock(test_dev); + + return ret; +} +static DEVICE_ATTR_RW(test_dev_x); + +static ssize_t test_dev_y_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + int y; + int ret; + + test_dev_busy_alloc(test_dev); + test_dev_config_lock(test_dev); + + config = &test_dev->config; + + ret = kstrtoint(buf, 10, &y); + if (ret) + count = ret; + + msleep(config->write_delay_msec_y); + test_dev->y = test_dev->x + y + 7; + + if (enable_verbose_writes) + dev_info(test_dev->dev, "wrote y = %d\n", test_dev->y); + + test_dev_config_unlock(test_dev); + + return count; +} + +static ssize_t test_dev_y_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + int ret; + + test_dev_config_lock(test_dev); + ret = snprintf(buf, PAGE_SIZE, "%d\n", test_dev->y); + test_dev_config_unlock(test_dev); + + return ret; +} +static DEVICE_ATTR_RW(test_dev_y); + +static ssize_t config_enable_lock_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + int ret; + int val; + + config = &test_dev->config; + ret = kstrtoint(buf, 10, &val); + if (ret) + return ret; + + /* + * We compromise for simplicty and do not lock when changing + * locking configuration, with the assumption userspace tests + * will know this. + */ + if (val) + config->enable_lock = true; + else + config->enable_lock = false; + + return count; +} + +static ssize_t config_enable_lock_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + ssize_t ret; + + config = &test_dev->config; + + test_dev_config_lock(test_dev); + ret = snprintf(buf, PAGE_SIZE, "%d\n", config->enable_lock); + test_dev_config_unlock(test_dev); + + return ret; +} +static DEVICE_ATTR_RW(config_enable_lock); + +static ssize_t config_enable_lock_on_rmmod_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + int ret; + int val; + + config = &test_dev->config; + ret = kstrtoint(buf, 10, &val); + if (ret) + return ret; + + test_dev_config_lock(test_dev); + if (val) + config->enable_lock_on_rmmod = true; + else + config->enable_lock_on_rmmod = false; + test_dev_config_unlock(test_dev); + + return count; +} + +static ssize_t config_enable_lock_on_rmmod_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + ssize_t ret; + + config = &test_dev->config; + + test_dev_config_lock(test_dev); + ret = snprintf(buf, PAGE_SIZE, "%d\n", config->enable_lock_on_rmmod); + test_dev_config_unlock(test_dev); + + return ret; +} +static DEVICE_ATTR_RW(config_enable_lock_on_rmmod); + +static ssize_t config_use_rtnl_lock_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + int ret; + int val; + + config = &test_dev->config; + ret = kstrtoint(buf, 10, &val); + if (ret) + return ret; + + /* + * We compromise and simplify this condition and do not use a lock + * here as the lock type can change. + */ + if (val) + config->use_rtnl_lock = true; + else + config->use_rtnl_lock = false; + + return count; +} + +static ssize_t config_use_rtnl_lock_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + + config = &test_dev->config; + + return snprintf(buf, PAGE_SIZE, "%d\n", config->use_rtnl_lock); +} +static DEVICE_ATTR_RW(config_use_rtnl_lock); + +static ssize_t config_write_delay_msec_y_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + int ret; + int val; + + config = &test_dev->config; + ret = kstrtoint(buf, 10, &val); + if (ret) + return ret; + + test_dev_config_lock(test_dev); + config->write_delay_msec_y = val; + test_dev_config_unlock(test_dev); + + return count; +} + +static ssize_t config_write_delay_msec_y_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + + config = &test_dev->config; + + return snprintf(buf, PAGE_SIZE, "%d\n", config->write_delay_msec_y); +} +static DEVICE_ATTR_RW(config_write_delay_msec_y); + +static ssize_t config_enable_busy_alloc_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + int ret; + int val; + + config = &test_dev->config; + ret = kstrtoint(buf, 10, &val); + if (ret) + return ret; + + test_dev_config_lock(test_dev); + config->enable_busy_alloc = val; + test_dev_config_unlock(test_dev); + + return count; +} + +static ssize_t config_enable_busy_alloc_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct sysfs_test_device *test_dev = dev_to_test_dev(dev); + struct test_config *config; + + config = &test_dev->config; + + return snprintf(buf, PAGE_SIZE, "%d\n", config->enable_busy_alloc); +} +static DEVICE_ATTR_RW(config_enable_busy_alloc); + +#define TEST_SYSFS_DEV_ATTR(name) (&dev_attr_##name.attr) + +static struct attribute *test_dev_attrs[] = { + /* Generic driver knobs go here */ + TEST_SYSFS_DEV_ATTR(config), + TEST_SYSFS_DEV_ATTR(reset), + + /* These are used to test sysfs */ + TEST_SYSFS_DEV_ATTR(test_dev_x), + TEST_SYSFS_DEV_ATTR(test_dev_y), + + /* + * These are configuration knobs to modify how we test sysfs when + * doing reads / stores. + */ + TEST_SYSFS_DEV_ATTR(config_enable_lock), + TEST_SYSFS_DEV_ATTR(config_enable_lock_on_rmmod), + TEST_SYSFS_DEV_ATTR(config_use_rtnl_lock), + TEST_SYSFS_DEV_ATTR(config_write_delay_msec_y), + TEST_SYSFS_DEV_ATTR(config_enable_busy_alloc), + + NULL, +}; + +ATTRIBUTE_GROUPS(test_dev); + +static int sysfs_test_dev_alloc_miscdev(struct sysfs_test_device *test_dev) +{ + struct miscdevice *misc_dev; + + misc_dev = &test_dev->misc_dev; + misc_dev->minor = MISC_DYNAMIC_MINOR; + misc_dev->name = kasprintf(GFP_KERNEL, "test_sysfs%d", test_dev->dev_idx); + if (!misc_dev->name) { + pr_err("Cannot alloc misc_dev->name\n"); + return -ENOMEM; + } + misc_dev->groups = test_dev_groups; + + return 0; +} + +static int testdev_open(struct block_device *bdev, fmode_t mode) +{ + return -EINVAL; +} + +static blk_qc_t testdev_submit_bio(struct bio *bio) +{ + return BLK_QC_T_NONE; +} + +static void testdev_slot_free_notify(struct block_device *bdev, + unsigned long index) +{ +} + +static int testdev_rw_page(struct block_device *bdev, sector_t sector, + struct page *page, unsigned int op) +{ + return -EOPNOTSUPP; +} + +static const struct block_device_operations sysfs_testdev_ops = { + .open = testdev_open, + .submit_bio = testdev_submit_bio, + .swap_slot_free_notify = testdev_slot_free_notify, + .rw_page = testdev_rw_page, + .owner = THIS_MODULE +}; + +static int sysfs_test_dev_alloc_blockdev(struct sysfs_test_device *test_dev) +{ + int ret = -ENOMEM; + + test_dev->disk = blk_alloc_disk(NUMA_NO_NODE); + if (!test_dev->disk) { + pr_err("Error allocating disk structure for device %d\n", + test_dev->dev_idx); + goto out; + } + + test_dev->disk->major = sysfs_test_major; + test_dev->disk->first_minor = test_dev->dev_idx + 1; + test_dev->disk->fops = &sysfs_testdev_ops; + test_dev->disk->private_data = test_dev; + snprintf(test_dev->disk->disk_name, 16, "test_sysfs%d", + test_dev->dev_idx); + set_capacity(test_dev->disk, 0); + blk_queue_flag_set(QUEUE_FLAG_NONROT, test_dev->disk->queue); + blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, test_dev->disk->queue); + blk_queue_physical_block_size(test_dev->disk->queue, PAGE_SIZE); + blk_queue_max_discard_sectors(test_dev->disk->queue, UINT_MAX); + blk_queue_flag_set(QUEUE_FLAG_DISCARD, test_dev->disk->queue); + + return 0; +out: + return ret; +} + +static struct sysfs_test_device *alloc_test_dev_sysfs(int idx) +{ + struct sysfs_test_device *test_dev; + int ret; + + switch (test_devtype) { + case TESTDEV_TYPE_MISC: + fallthrough; + case TESTDEV_TYPE_BLOCK: + break; + default: + return NULL; + } + + test_dev = kzalloc(sizeof(struct sysfs_test_device), GFP_KERNEL); + if (!test_dev) + goto err_out; + + mutex_init(&test_dev->config_mutex); + test_dev->dev_idx = idx; + test_dev->devtype = test_devtype; + + if (test_dev->devtype == TESTDEV_TYPE_MISC) { + ret = sysfs_test_dev_alloc_miscdev(test_dev); + if (ret) + goto err_out_free; + } else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) { + ret = sysfs_test_dev_alloc_blockdev(test_dev); + if (ret) + goto err_out_free; + } + return test_dev; + +err_out_free: + kfree(test_dev); + test_dev = NULL; +err_out: + return NULL; +} + +static int register_test_dev_sysfs_misc(struct sysfs_test_device *test_dev) +{ + int ret; + + ret = misc_register(&test_dev->misc_dev); + if (ret) + return ret; + + test_dev->dev = test_dev->misc_dev.this_device; + + return 0; +} + +static int register_test_dev_sysfs_block(struct sysfs_test_device *test_dev) +{ + device_add_disk(NULL, test_dev->disk, test_dev_groups); + test_dev->dev = disk_to_dev(test_dev->disk); + + return 0; +} + +static struct sysfs_test_device *register_test_dev_sysfs(void) +{ + struct sysfs_test_device *test_dev = NULL; + int ret; + + test_dev = alloc_test_dev_sysfs(0); + if (!test_dev) + goto out; + + if (test_dev->devtype == TESTDEV_TYPE_MISC) { + ret = register_test_dev_sysfs_misc(test_dev); + if (ret) { + pr_err("could not register misc device: %d\n", ret); + goto out_free_dev; + } + } else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) { + ret = register_test_dev_sysfs_block(test_dev); + if (ret) { + pr_err("could not register block device: %d\n", ret); + goto out_free_dev; + } + } + + dev_info(test_dev->dev, "interface ready\n"); + +out: + return test_dev; +out_free_dev: + free_test_dev_sysfs(test_dev); + return NULL; +} + +static struct sysfs_test_device *register_test_dev_set_config(void) +{ + struct sysfs_test_device *test_dev; + struct test_config *config; + + test_dev = register_test_dev_sysfs(); + if (!test_dev) + return NULL; + + config = &test_dev->config; + + if (enable_lock) + config->enable_lock = true; + if (enable_lock_on_rmmod) + config->enable_lock_on_rmmod = true; + if (use_rtnl_lock) + config->use_rtnl_lock = true; + if (enable_busy_alloc) + config->enable_busy_alloc = true; + + config->write_delay_msec_y = write_delay_msec_y; + test_sysfs_reset_vals(test_dev); + + return test_dev; +} + +static void unregister_test_dev_sysfs_misc(struct sysfs_test_device *test_dev) +{ + misc_deregister(&test_dev->misc_dev); +} + +static void unregister_test_dev_sysfs_block(struct sysfs_test_device *test_dev) +{ + del_gendisk(test_dev->disk); + blk_cleanup_disk(test_dev->disk); +} + +static void unregister_test_dev_sysfs(struct sysfs_test_device *test_dev) +{ + test_dev_config_lock_rmmod(test_dev); + + dev_info(test_dev->dev, "removing interface\n"); + + if (test_dev->devtype == TESTDEV_TYPE_MISC) + unregister_test_dev_sysfs_misc(test_dev); + else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) + unregister_test_dev_sysfs_block(test_dev); + + test_dev_config_unlock_rmmod(test_dev); + + free_test_dev_sysfs(test_dev); +} + +static struct dentry *debugfs_dir; + +/* When read represents how many times we have reset the first_test_dev */ +static u8 reset_first_test_dev; + +static ssize_t read_reset_first_test_dev(struct file *file, + char __user *user_buf, + size_t count, loff_t *ppos) +{ + ssize_t len; + char buf[32]; + + reset_first_test_dev++; + len = sprintf(buf, "%d\n", reset_first_test_dev); + return simple_read_from_buffer(user_buf, count, ppos, buf, len); +} + +static ssize_t write_reset_first_test_dev(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + if (!try_module_get(THIS_MODULE)) + return -ENODEV; + + if (!first_test_dev) { + module_put(THIS_MODULE); + return -ENODEV; + } + + dev_info(first_test_dev->dev, "going to reset first interface ...\n"); + + unregister_test_dev_sysfs(first_test_dev); + first_test_dev = register_test_dev_set_config(); + + dev_info(first_test_dev->dev, "first interface reset complete\n"); + + module_put(THIS_MODULE); + + return count; +} + +static const struct file_operations fops_reset_first_test_dev = { + .read = read_reset_first_test_dev, + .write = write_reset_first_test_dev, + .open = simple_open, + .owner = THIS_MODULE, + .llseek = default_llseek, +}; + +static int __init test_sysfs_init(void) +{ + first_test_dev = register_test_dev_set_config(); + if (!first_test_dev) + return -ENOMEM; + + if (!enable_debugfs) + return 0; + + debugfs_dir = debugfs_create_dir("test_sysfs", NULL); + if (!debugfs_dir) { + unregister_test_dev_sysfs(first_test_dev); + return -ENOMEM; + } + + debugfs_create_file("reset_first_test_dev", 0600, debugfs_dir, + NULL, &fops_reset_first_test_dev); + return 0; +} +module_init(test_sysfs_init); + +static void __exit test_sysfs_exit(void) +{ + if (enable_debugfs) + debugfs_remove(debugfs_dir); + if (delay_rmmod_ms) + msleep(delay_rmmod_ms); + unregister_test_dev_sysfs(first_test_dev); + if (enable_verbose_rmmod) + pr_info("unregister_test_dev_sysfs() completed\n"); + first_test_dev = NULL; +} +module_exit(test_sysfs_exit); + +MODULE_AUTHOR("Luis Chamberlain <mcgrof(a)kernel.org>"); +MODULE_LICENSE("GPL"); diff --git a/tools/testing/selftests/sysfs/Makefile b/tools/testing/selftests/sysfs/Makefile new file mode 100644 index 000000000000..fde99caa2338 --- /dev/null +++ b/tools/testing/selftests/sysfs/Makefile @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0-only +# Makefile for sysfs selftests. + +# No binaries, but make sure arg-less "make" doesn't trigger "run_tests". +all: + +TEST_PROGS := sysfs.sh + +include ../lib.mk + +# Nothing to clean up. +clean: diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config new file mode 100644 index 000000000000..9196f452ecd5 --- /dev/null +++ b/tools/testing/selftests/sysfs/config @@ -0,0 +1,2 @@ +CONFIG_SYSFS=m +CONFIG_TEST_SYSFS=m diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh new file mode 100755 index 000000000000..b3f4c2236c7f --- /dev/null +++ b/tools/testing/selftests/sysfs/sysfs.sh @@ -0,0 +1,1208 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0-or-later +# Copyright (C) 2021 Luis Chamberlain <mcgrof(a)kernel.org> +# +# This program is free software; you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the Free +# Software Foundation; either version 2 of the License, or at your option any +# later version; or, when distributed separately from the Linux kernel or +# when incorporated into other software packages, subject to the following +# license: +# +# This program is free software; you can redistribute it and/or modify it +# under the terms of copyleft-next (version 0.3.1 or later) as published +# at http://copyleft-next.org/. + +# This performs a series tests against the sysfs filesystem. + +# Kselftest framework requirement - SKIP code is 4. +ksft_skip=4 + +TEST_NAME="sysfs" +TEST_DRIVER="test_${TEST_NAME}" +TEST_DIR=$(dirname $0) +TEST_FILE=$(mktemp) + +# This represents +# +# TEST_ID:TEST_COUNT:ENABLED:TARGET +# +# TEST_ID: is the test id number +# TEST_COUNT: number of times we should run the test +# ENABLED: 1 if enabled, 0 otherwise +# TARGET: test target file required on the test_sysfs module +# +# Once these are enabled please leave them as-is. Write your own test, +# we have tons of space. +ALL_TESTS="0001:3:1:test_dev_x:misc" +ALL_TESTS="$ALL_TESTS 0002:3:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0003:3:1:test_dev_x:misc" +ALL_TESTS="$ALL_TESTS 0004:3:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0005:1:1:test_dev_x:misc" +ALL_TESTS="$ALL_TESTS 0006:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0007:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0008:1:1:test_dev_x:misc" +ALL_TESTS="$ALL_TESTS 0009:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0010:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0011:1:1:test_dev_x:misc" +ALL_TESTS="$ALL_TESTS 0012:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0013:1:1:test_dev_y:misc" +ALL_TESTS="$ALL_TESTS 0014:3:1:test_dev_x:block" # block equivalent set +ALL_TESTS="$ALL_TESTS 0015:3:1:test_dev_x:block" +ALL_TESTS="$ALL_TESTS 0016:3:1:test_dev_x:block" +ALL_TESTS="$ALL_TESTS 0017:3:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0018:1:1:test_dev_x:block" +ALL_TESTS="$ALL_TESTS 0019:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0020:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0021:1:1:test_dev_x:block" +ALL_TESTS="$ALL_TESTS 0022:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block" +ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block" +ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test +ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock + +allow_user_defaults() +{ + if [ -z $DIR ]; then + case $TEST_DEV_TYPE in + misc) + DIR="/sys/devices/virtual/misc/${TEST_DRIVER}0" + ;; + block) + DIR="/sys/devices/virtual/block/${TEST_DRIVER}0" + ;; + *) + DIR="/sys/devices/virtual/misc/${TEST_DRIVER}0" + ;; + esac + fi + case $TEST_DEV_TYPE in + misc) + MODPROBE_TESTDEV_TYPE="" + ;; + block) + MODPROBE_TESTDEV_TYPE="devtype=1" + ;; + *) + MODPROBE_TESTDEV_TYPE="" + ;; + esac + if [ -z $SYSFS_DEBUGFS_DIR ]; then + SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs" + fi + if [ -z $PAGE_SIZE ]; then + PAGE_SIZE=$(getconf PAGESIZE) + fi + if [ -z $MAX_DIGITS ]; then + MAX_DIGITS=$(($PAGE_SIZE/8)) + fi + if [ -z $INT_MAX ]; then + INT_MAX=$(getconf INT_MAX) + fi + if [ -z $UINT_MAX ]; then + UINT_MAX=$(getconf UINT_MAX) + fi +} + +test_reqs() +{ + uid=$(id -u) + if [ $uid -ne 0 ]; then + echo $msg must be run as root >&2 + exit $ksft_skip + fi + + if ! which modprobe 2> /dev/null > /dev/null; then + echo "$0: You need modprobe installed" >&2 + exit $ksft_skip + fi + if ! which getconf 2> /dev/null > /dev/null; then + echo "$0: You need getconf installed" + exit $ksft_skip + fi + if ! which diff 2> /dev/null > /dev/null; then + echo "$0: You need diff installed" + exit $ksft_skip + fi + if ! which perl 2> /dev/null > /dev/null; then + echo "$0: You need perl installed" + exit $ksft_skip + fi +} + +call_modprobe() +{ + modprobe $TEST_DRIVER $MODPROBE_TESTDEV_TYPE $FIRST_MODPROBE_ARGS $MODPROBE_ARGS + return $? +} + +modprobe_reset() +{ + modprobe -q -r $TEST_DRIVER + call_modprobe + return $? +} + +modprobe_reset_enable_debugfs() +{ + FIRST_MODPROBE_ARGS="enable_debugfs=1" + modprobe_reset + unset FIRST_MODPROBE_ARGS +} + +modprobe_reset_enable_lock_on_rmmod() +{ + FIRST_MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 enable_verbose_writes=1" + modprobe_reset + unset FIRST_MODPROBE_ARGS +} + +modprobe_reset_enable_rtnl_lock_on_rmmod() +{ + FIRST_MODPROBE_ARGS="enable_lock=1 use_rtnl_lock=1 enable_lock_on_rmmod=1" + FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_writes=1" + modprobe_reset + unset FIRST_MODPROBE_ARGS +} + +load_req_mod() +{ + modprobe_reset + if [ ! -d $DIR ]; then + if ! modprobe -q -n $TEST_DRIVER; then + echo "$0: module $TEST_DRIVER not found [SKIP]" + echo "You must set CONFIG_TEST_SYSFS=m in your kernel" >&2 + exit $ksft_skip + fi + call_modprobe + if [ $? -ne 0 ]; then + echo "$0: modprobe $TEST_DRIVER failed." + exit + fi + fi +} + +config_reset() +{ + if ! echo -n "1" >"$DIR"/reset; then + echo "$0: reset should have worked" >&2 + exit 1 + fi +} + +debugfs_reset_first_test_dev_ignore_errors() +{ + echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev +} + +set_orig() +{ + if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then + if [ -f ${TARGET} ]; then + echo "${ORIG}" > "${TARGET}" + fi + fi +} + +set_test() +{ + echo "${TEST_STR}" > "${TARGET}" +} + +set_test_ignore_errors() +{ + echo "${TEST_STR}" > "${TARGET}" 2> /dev/null +} + +verify() +{ + local seen + seen=$(cat "$1") + target_short=$(basename $TARGET) + case $target_short in + test_dev_x) + if [ "${seen}" != "${TEST_STR}" ]; then + return 1 + fi + ;; + test_dev_y) + DIRNAME=$(dirname $1) + EXPECTED_RESULT="" + # If our target was the test file then what we write to it + # is the same as what that we expect when we read from it. + # When we write to test_dev_y directly though we expect + # a computed value which is driver specific. + if [[ "$DIRNAME" == "/tmp" ]]; then + let EXPECTED_RESULT="${TEST_STR}" + else + x=$(cat ${DIR}/test_dev_x) + let EXPECTED_RESULT="$x+${TEST_STR}+7" + fi + + if [[ "${seen}" != "${EXPECTED_RESULT}" ]]; then + return 1 + fi + ;; + *) + echo "Unsupported target type update test script: $target_short" + exit 1 + esac + return 0 +} + +verify_diff_w() +{ + echo "$TEST_STR" | diff -q -w -u - $1 > /dev/null + return $? +} + +test_rc() +{ + if [[ $rc != 0 ]]; then + echo "Failed test, return value: $rc" >&2 + exit $rc + fi +} + +test_finish() +{ + set_orig + rm -f "${TEST_FILE}" + + if [ ! -z ${old_strict} ]; then + echo ${old_strict} > ${WRITES_STRICT} + fi + exit $rc +} + +# kernfs requires us to write everything we want in one shot because +# There is no easy way for us to know if userspace is only doing a partial +# write, so we don't support them. We expect the entire buffer to come on +# the first write. If you're writing a value, first read the file, +# modify only the value you're changing, then write entire buffer back. +# Since we are only testing digits we just full single writes and old stuff. +# For more details, refer to kernfs_fop_write_iter(). +run_numerictests_single_write() +{ + echo "== Testing sysfs behavior against ${TARGET} ==" + + rc=0 + + echo -n "Writing test file ... " + echo "${TEST_STR}" > "${TEST_FILE}" + if ! verify "${TEST_FILE}"; then + echo "FAIL" >&2 + exit 1 + else + echo "ok" + fi + + echo -n "Checking the sysfs file is not set to test value ... " + if verify "${TARGET}"; then + echo "FAIL" >&2 + exit 1 + else + echo "ok" + fi + + echo -n "Writing to sysfs file from shell ... " + set_test + if ! verify "${TARGET}"; then + echo "FAIL" >&2 + exit 1 + else + echo "ok" + fi + + echo -n "Resetting sysfs file to original value ... " + set_orig + if verify "${TARGET}"; then + echo "FAIL" >&2 + exit 1 + else + echo "ok" + fi + + # Now that we've validated the sanity of "set_test" and "set_orig", + # we can use those functions to set starting states before running + # specific behavioral tests. + + echo -n "Writing to the entire sysfs file in a single write ... " + set_orig + dd if="${TEST_FILE}" of="${TARGET}" bs=4096 2>/dev/null + if ! verify "${TARGET}"; then + echo "FAIL" >&2 + rc=1 + else + echo "ok" + fi + + echo -n "Writing to the sysfs file with multiple long writes ... " + set_orig + (perl -e 'print "A" x 50;'; echo "${TEST_STR}") | \ + dd of="${TARGET}" bs=50 2>/dev/null + if verify "${TARGET}"; then + echo "FAIL" >&2 + rc=1 + else + echo "ok" + fi + test_rc +} + +reset_vals() +{ + echo -n 3 > $DIR/test_dev_x + echo -n 4 > $DIR/test_dev_x +} + +check_failure() +{ + echo -n "Testing that $1 fails as expected..." + reset_vals + TEST_STR="$1" + orig="$(cat $TARGET)" + echo -n "$TEST_STR" > $TARGET 2> /dev/null + + # write should fail and $TARGET should retain its original value + if [ $? = 0 ] || [ "$(cat $TARGET)" != "$orig" ]; then + echo "FAIL" >&2 + rc=1 + else + echo "ok" + fi + test_rc +} + +load_modreqs() +{ + export TEST_DEV_TYPE=$(get_test_type $1) + unset DIR + allow_user_defaults + load_req_mod +} + +target_exists() +{ + TARGET="${DIR}/$1" + TEST_ID="$2" + + if [ ! -f ${TARGET} ] ; then + echo "Target for test $TEST_ID: $TARGET does not exist, skipping test ..." + return 0 + fi + return 1 +} + +config_enable_lock() +{ + if ! echo -n 1 > $DIR/config_enable_lock; then + echo "$0: Unable to enable locks" >&2 + exit 1 + fi +} + +config_write_delay_msec_y() +{ + if ! echo -n $1 > $DIR/config_write_delay_msec_y ; then + echo "$0: Unable to set write_delay_msec_y to $1" >&2 + exit 1 + fi +} + +# Default filter for dmesg scanning. +# Ignore lockdep complaining about its own bugginess when scanning dmesg +# output, because we shouldn't be failing filesystem tests on account of +# lockdep. +_check_dmesg_filter() +{ + egrep -v -e "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low" \ + -e "BUG: MAX_STACK_TRACE_ENTRIES too low" +} + +check_dmesg() +{ + # filter out intentional WARNINGs or Oopses + local filter=${1:-_check_dmesg_filter} + + _dmesg_since_test_start | $filter >$seqres.dmesg + egrep -q -e "kernel BUG at" \ + -e "WARNING:" \ + -e "\bBUG:" \ + -e "Oops:" \ + -e "possible recursive locking detected" \ + -e "Internal error" \ + -e "(INFO|ERR): suspicious RCU usage" \ + -e "INFO: possible circular locking dependency detected" \ + -e "general protection fault:" \ + -e "BUG .* remaining" \ + -e "UBSAN:" \ + $seqres.dmesg + if [ $? -eq 0 ]; then + echo "something found in dmesg (see $seqres.dmesg)" + return 1 + else + if [ "$KEEP_DMESG" != "yes" ]; then + rm -f $seqres.dmesg + fi + return 0 + fi +} + +log_kernel_fstest_dmesg() +{ + export FSTYP="$1" + export seqnum="$FSTYP/$2" + export date_time=$(date +"%F %T") + echo "run fstests $seqnum at $date_time" > /dev/kmsg +} + +modprobe_loop() +{ + while true; do + call_modprobe > /dev/null 2>&1 + modprobe -r $TEST_DRIVER > /dev/null 2>&1 + done > /dev/null 2>&1 +} + +write_loop() +{ + while true; do + set_test_ignore_errors > /dev/null 2>&1 + TEST_STR=$(( $TEST_STR + 1 )) + done > /dev/null 2>&1 +} + +write_loop_reset() +{ + while true; do + set_test_ignore_errors > /dev/null 2>&1 + debugfs_reset_first_test_dev_ignore_errors > /dev/null 2>&1 + done > /dev/null 2>&1 +} + +write_loop_bg() +{ + BG_WRITES=1000 > /dev/null 2>&1 + while true; do + for i in $(seq 1 $BG_WRITES); do + set_test_ignore_errors > /dev/null 2>&1 & + TEST_STR=$(( $TEST_STR + 1 )) + done > /dev/null 2>&1 + wait + done > /dev/null 2>&1 + wait +} + +reset_loop() +{ + while true; do + debugfs_reset_first_test_dev_ignore_errors > /dev/null 2>&1 + done > /dev/null 2>&1 +} + +kill_trigger_loop() +{ + + local my_first_loop_pid=$1 + local my_second_loop_pid=$2 + local my_sleep_max=$3 + local my_loop=0 + + while true; do + sleep 1 + if [[ $my_loop -ge $my_sleep_max ]]; then + break + fi + let my_loop=$my_loop+1 + done + + kill -s TERM $my_first_loop_pid 2>&1 > /dev/null + kill -s TERM $my_second_loop_pid 2>&1 > /dev/null +} + +_dmesg_since_test_start() +{ + # search the dmesg log of last run of $seqnum for possible failures + # use sed \cregexpc address type, since $seqnum contains "/" + dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac +} + +sysfs_test_0001() +{ + TARGET="${DIR}/$(get_test_target 0001)" + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + + run_numerictests_single_write +} + +sysfs_test_0002() +{ + TARGET="${DIR}/$(get_test_target 0002)" + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + + run_numerictests_single_write +} + +sysfs_test_0003() +{ + TARGET="${DIR}/$(get_test_target 0003)" + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + + config_enable_lock + + run_numerictests_single_write +} + +sysfs_test_0004() +{ + TARGET="${DIR}/$(get_test_target 0004)" + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + + config_enable_lock + + run_numerictests_single_write +} + +sysfs_test_0005() +{ + TARGET="${DIR}/$(get_test_target 0005)" + modprobe_reset + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing x while loading/unloading the module... " + + modprobe_loop & + modprobe_pid=$! + + write_loop & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0006() +{ + TARGET="${DIR}/$(get_test_target 0006)" + modprobe_reset + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing y while loading/unloading the module... " + modprobe_loop & + modprobe_pid=$! + + write_loop & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0007() +{ + TARGET="${DIR}/$(get_test_target 0007)" + modprobe_reset + config_reset + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing y with a larger delay while loading/unloading the module... " + + MODPROBE_ARGS="write_delay_msec_y=1500" + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + unset MODPROBE_ARGS + + write_loop & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0008() +{ + TARGET="${DIR}/$(get_test_target 0008)" + modprobe_reset + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop busy writing x while loading/unloading the module... " + + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + + write_loop_bg > /dev/null 2>&1 & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0009() +{ + TARGET="${DIR}/$(get_test_target 0009)" + modprobe_reset + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop busy writing y while loading/unloading the module... " + + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + + write_loop_bg > /dev/null 2>&1 & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0010() +{ + TARGET="${DIR}/$(get_test_target 0010)" + modprobe_reset + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop busy writing y with a larger delay while loading/unloading the module... " + modprobe -q -r $TEST_DRIVER > /dev/null 2>&1 + + MODPROBE_ARGS="write_delay_msec_y=1500" + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + unset MODPROBE_ARGS + + write_loop_bg > /dev/null 2>&1 & + write_pid=$! + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0011() +{ + TARGET="${DIR}/$(get_test_target 0011)" + modprobe_reset_enable_debugfs + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing x and resetting ... " + + write_loop > /dev/null 2>&1 & + write_pid=$! + + reset_loop > /dev/null 2>&1 & + reset_pid=$! + + kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0012() +{ + TARGET="${DIR}/$(get_test_target 0012)" + modprobe_reset_enable_debugfs + config_reset + reset_vals + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing y and resetting ... " + + write_loop > /dev/null 2>&1 & + write_pid=$! + + reset_loop > /dev/null 2>&1 & + reset_pid=$! + + kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0013() +{ + TARGET="${DIR}/$(get_test_target 0013)" + modprobe_reset_enable_debugfs + config_reset + reset_vals + config_write_delay_msec_y 1500 + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Loop writing y with a larger delay and resetting ... " + + write_loop > /dev/null 2>&1 & + write_pid=$! + + reset_loop > /dev/null 2>&1 & + reset_pid=$! + + kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0014() +{ + sysfs_test_0001 +} + +sysfs_test_0015() +{ + sysfs_test_0002 +} + +sysfs_test_0016() +{ + sysfs_test_0003 +} + +sysfs_test_0017() +{ + sysfs_test_0004 +} + +sysfs_test_0018() +{ + sysfs_test_0005 +} + +sysfs_test_0019() +{ + sysfs_test_0006 +} + +sysfs_test_0020() +{ + sysfs_test_0007 +} + +sysfs_test_0021() +{ + sysfs_test_0008 +} + +sysfs_test_0022() +{ + sysfs_test_0009 +} + +sysfs_test_0023() +{ + sysfs_test_0010 +} + +sysfs_test_0024() +{ + sysfs_test_0011 +} + +sysfs_test_0025() +{ + sysfs_test_0012 +} + +sysfs_test_0026() +{ + sysfs_test_0013 +} + +sysfs_test_0027() +{ + TARGET="${DIR}/$(get_test_target 0027)" + modprobe_reset_enable_lock_on_rmmod + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Test for possible rmmod deadlock while writing x ... " + + write_loop > /dev/null 2>&1 & + write_pid=$! + + MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 enable_verbose_writes=1" + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + unset MODPROBE_ARGS + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +sysfs_test_0028() +{ + TARGET="${DIR}/$(get_test_target 0028)" + modprobe_reset_enable_lock_on_rmmod + ORIG=$(cat "${TARGET}") + TEST_STR=$(( $ORIG + 1 )) + WAIT_TIME=2 + + echo -n "Test for possible rmmod deadlock using rtnl_lock while writing x ... " + + write_loop > /dev/null 2>&1 & + write_pid=$! + + MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 use_rtnl_lock=1 enable_verbose_writes=1" + modprobe_loop > /dev/null 2>&1 & + modprobe_pid=$! + unset MODPROBE_ARGS + + kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 & + kill_pid=$! + + wait $kill_pid > /dev/null 2>&1 + + if [[ $? -eq 0 ]]; then + echo "ok" + else + echo "FAIL" >&2 + fi +} + +test_gen_desc() +{ + echo -n "$1 x $(get_test_count $1)" +} + +list_tests() +{ + echo "Test ID list:" + echo + echo "TEST_ID x NUM_TEST" + echo "TEST_ID: Test ID" + echo "NUM_TESTS: Number of recommended times to run the test" + echo + echo "$(test_gen_desc 0001) - misc test writing x in different ways" + echo "$(test_gen_desc 0002) - misc test writing y in different ways" + echo "$(test_gen_desc 0003) - misc test writing x in different ways using a mutex lock" + echo "$(test_gen_desc 0004) - misc test writing y in different ways using a mutex lock" + echo "$(test_gen_desc 0005) - misc test writing x load and remove the test_sysfs module" + echo "$(test_gen_desc 0006) - misc writing y load and remove the test_sysfs module" + echo "$(test_gen_desc 0007) - misc test writing y larger delay, load, remove test_sysfs" + echo "$(test_gen_desc 0008) - misc test busy writing x remove test_sysfs module" + echo "$(test_gen_desc 0009) - misc test busy writing y remove the test_sysfs module" + echo "$(test_gen_desc 0010) - misc test busy writing y larger delay, remove test_sysfs" + echo "$(test_gen_desc 0011) - misc test writing x and resetting device" + echo "$(test_gen_desc 0012) - misc test writing y and resetting device" + echo "$(test_gen_desc 0013) - misc test writing y with a larger delay and resetting device" + echo "$(test_gen_desc 0014) - block test writing x in different ways" + echo "$(test_gen_desc 0015) - block test writing y in different ways" + echo "$(test_gen_desc 0016) - block test writing x in different ways using a mutex lock" + echo "$(test_gen_desc 0017) - block test writing y in different ways using a mutex lock" + echo "$(test_gen_desc 0018) - block test writing x load and remove the test_sysfs module" + echo "$(test_gen_desc 0019) - block test writing y load and remove the test_sysfs module" + echo "$(test_gen_desc 0020) - block test writing y larger delay, load, remove test_sysfs" + echo "$(test_gen_desc 0021) - block test busy writing x remove the test_sysfs module" + echo "$(test_gen_desc 0022) - block test busy writing y remove the test_sysfs module" + echo "$(test_gen_desc 0023) - block test busy writing y larger delay, remove test_sysfs" + echo "$(test_gen_desc 0024) - block test writing x and resetting device" + echo "$(test_gen_desc 0025) - block test writing y and resetting device" + echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device" + echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... " + echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..." +} + +usage() +{ + NUM_TESTS=$(grep -o ' ' <<<"$ALL_TESTS" | grep -c .) + let NUM_TESTS=$NUM_TESTS+1 + MAX_TEST=$(printf "%04d\n" $NUM_TESTS) + echo "Usage: $0 [ -t <4-number-digit> ] | [ -w <4-number-digit> ] |" + echo " [ -s <4-number-digit> ] | [ -c <4-number-digit> <test- count>" + echo " [ all ] [ -h | --help ] [ -l ]" + echo "" + echo "Valid tests: 0001-$MAX_TEST" + echo "" + echo " all Runs all tests (default)" + echo " -t Run test ID the number amount of times is recommended" + echo " -w Watch test ID run until it runs into an error" + echo " -c Run test ID once" + echo " -s Run test ID x test-count number of times" + echo " -l List all test ID list" + echo " -h|--help Help" + echo + echo "If an error every occurs execution will immediately terminate." + echo "If you are adding a new test try using -w <test-ID> first to" + echo "make sure the test passes a series of tests." + echo + echo Example uses: + echo + echo "$TEST_NAME.sh -- executes all tests" + echo "$TEST_NAME.sh -t 0002 -- Executes test ID 0002 number of times is recomended" + echo "$TEST_NAME.sh -w 0002 -- Watch test ID 0002 run until an error occurs" + echo "$TEST_NAME.sh -s 0002 -- Run test ID 0002 once" + echo "$TEST_NAME.sh -c 0002 3 -- Run test ID 0002 three times" + echo + list_tests + exit 1 +} + +test_num() +{ + re='^[0-9]+$' + if ! [[ $1 =~ $re ]]; then + usage + fi +} + +get_test_count() +{ + test_num $1 + TEST_NUM=$(echo $1 | sed 's/^0*//') + TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}') + echo ${TEST_DATA} | awk -F":" '{print $2}' +} + +get_test_enabled() +{ + test_num $1 + TEST_NUM=$(echo $1 | sed 's/^0*//') + TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}') + echo ${TEST_DATA} | awk -F":" '{print $3}' +} + +get_test_target() +{ + test_num $1 + TEST_NUM=$(echo $1 | sed 's/^0*//') + TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}') + echo ${TEST_DATA} | awk -F":" '{print $4}' +} + +get_test_type() +{ + test_num $1 + TEST_NUM=$(echo $1 | sed 's/^0*//') + TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}') + echo ${TEST_DATA} | awk -F":" '{print $5}' +} + +run_all_tests() +{ + for i in $ALL_TESTS ; do + TEST_ID=$(echo $i | awk -F":" '{print $1}') + ENABLED=$(get_test_enabled $TEST_ID) + TEST_COUNT=$(get_test_count $TEST_ID) + TEST_TARGET=$(get_test_target $TEST_ID) + if [[ $ENABLED -eq "1" ]]; then + test_case $TEST_ID $TEST_COUNT $TEST_TARGET + else + echo -n "Skipping test $TEST_ID as its disabled, likely " + echo "could crash your system ..." + fi + done +} + +watch_log() +{ + if [ $# -ne 3 ]; then + clear + fi + echo "Running test: $2 - run #$1" +} + +watch_case() +{ + i=0 + while [ 1 ]; do + if [ $# -eq 1 ]; then + test_num $1 + watch_log $i ${TEST_NAME}_test_$1 + log_kernel_fstest_dmesg sysfs $1 + RUN_TEST=${TEST_NAME}_test_$1 + $RUN_TEST + check_dmesg + if [[ $? -ne 0 ]]; then + exit 1 + fi + else + watch_log $i all + run_all_tests + fi + let i=$i+1 + done +} + +test_case() +{ + NUM_TESTS=$2 + + i=0 + + load_modreqs $1 + if target_exists $3 $1; then + return + fi + + while [[ $i -lt $NUM_TESTS ]]; do + test_num $1 + watch_log $i ${TEST_NAME}_test_$1 noclear + log_kernel_fstest_dmesg sysfs $1 + RUN_TEST=${TEST_NAME}_test_$1 + $RUN_TEST + let i=$i+1 + done + check_dmesg + if [[ $? -ne 0 ]]; then + exit 1 + fi +} + +parse_args() +{ + if [ $# -eq 0 ]; then + run_all_tests + else + if [[ "$1" = "all" ]]; then + run_all_tests + elif [[ "$1" = "-w" ]]; then + shift + watch_case $@ + elif [[ "$1" = "-t" ]]; then + shift + test_num $1 + test_case $1 $(get_test_count $1) $(get_test_target $1) + shift + elif [[ "$1" = "-c" ]]; then + shift + test_num $1 + test_num $2 + test_case $1 $2 $(get_test_target $1) + shift + shift + elif [[ "$1" = "-s" ]]; then + shift + test_case $1 1 $(get_test_target $1) + shift + elif [[ "$1" = "-l" ]]; then + list_tests + shift + elif [[ "$1" = "-h" || "$1" = "--help" ]]; then + usage + else + usage + fi + fi +} + +test_reqs +allow_user_defaults + +trap "test_finish" EXIT + +parse_args $@ + +exit 0 -- 2.30.2

4 years, 2 months

2
1
0 0

[PATCH v7 09/12] sysfs: fix deadlock race with module removal

by Luis Chamberlain

When sysfs attributes use a lock also used on module removal we can race to deadlock. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock. This can now be easily reproducible with our sysfs selftest as follows: ./tools/testing/selftests/sysfs/sysfs.sh -t 0027 To fix this we extend the struct kernfs_node with a module reference and use the try_module_get() after kernfs_get_active() is called which protects integrity and the existence of the kernfs node during the operation. So long as the kernfs node is protected with kernfs_get_active() we know we can rely on its contents. And, as now just documented in the previous patch, we also now know that once kernfs_get_active() is called the module is also guarded to exist and cannot be removed. If try_module_get() fails we fail the operation on the kernfs node. We use a try method as a full lock means we'd then make our sysfs attributes busy us out from possible module removal, and so userspace could force denying module removal, a silly form of "DOS" against module removal. A try lock on the module removal ensures we give priority to module removal and interacting with sysfs attributes only comes second. Using a full lock could mean for instance that if you don't stop poking at sysfs files you cannot remove a module. Races between removal of sysfs files and the module are not possible given sysfs files are created by the same module, and when a sysfs file is being used kernfs prevents removal of the sysfs file. So if module removal is actually happening the removal would have to wait until the sysfs file operation is complete. This deadlock was first reported with the zram driver, however the live patching folks have acknowledged they have observed this as well with live patching, when a live patch is removed. I was then able to reproduce easily by creating a dedicated selftests. A sketch of how this can happen follows: CPU A CPU B whatever_store() module_unload mutex_lock(foo) mutex_lock(foo) del_gendisk(zram->disk); device_del() device_remove_groups() In this situation whatever_store() is waiting for the mutex foo to become unlocked, but that won't happen until module removal is complete. But module removal won't complete until the sysfs file being poked completes which is waiting for a lock already held. Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org> --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +- fs/kernfs/dir.c | 41 ++++++++++++++++++-- fs/kernfs/file.c | 7 +++- fs/kernfs/kernfs-internal.h | 1 + fs/kernfs/symlink.c | 3 +- fs/sysfs/dir.c | 4 +- fs/sysfs/file.c | 10 +++-- fs/sysfs/group.c | 1 + include/linux/kernfs.h | 11 ++++-- include/linux/sysfs.h | 52 ++++++++++++++++++++------ kernel/cgroup/cgroup.c | 2 +- 11 files changed, 106 insertions(+), 30 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index b57b3db9a6a7..d40c486f46b9 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -207,7 +207,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft) struct kernfs_node *kn; int ret; - kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, + kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, rft->kf_ops, rft, NULL, NULL); if (IS_ERR(kn)) @@ -2480,7 +2480,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name, struct kernfs_node *kn; int ret = 0; - kn = __kernfs_create_file(parent_kn, name, 0444, + kn = __kernfs_create_file(parent_kn, name, 0444, NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, &kf_mondata_ops, priv, NULL, NULL); if (IS_ERR(kn)) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index ba581429bf7b..f463878a5bf4 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -14,6 +14,7 @@ #include <linux/slab.h> #include <linux/security.h> #include <linux/hash.h> +#include <linux/module.h> #include "kernfs-internal.h" @@ -414,12 +415,32 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn) */ struct kernfs_node *kernfs_get_active(struct kernfs_node *kn) { + int v; + if (unlikely(!kn)) return NULL; if (!atomic_inc_unless_negative(&kn->active)) return NULL; + /* + * If a module created the kernfs_node, the module cannot possibly be + * removed if the above atomic_inc_unless_negative() succeeded. So the + * try_module_get() below is not to protect the lifetime of the module + * as that is already guaranteed. The try_module_get() below is used + * to ensure that we don't deadlock in case a kernfs operation and + * module removal used a shared lock. + * + * We use a try method to avoid userspace DOS'ing module removal with + * access to kernfs ops. + */ + if (!try_module_get(kn->owner)) { + v = atomic_dec_return(&kn->active); + if (unlikely(v == KN_DEACTIVATED_BIAS)) + wake_up_all(&kernfs_root(kn)->deactivate_waitq); + return NULL; + } + if (kernfs_lockdep(kn)) rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_); return kn; @@ -442,6 +463,13 @@ void kernfs_put_active(struct kernfs_node *kn) if (kernfs_lockdep(kn)) rwsem_release(&kn->dep_map, _RET_IP_); v = atomic_dec_return(&kn->active); + + /* + * We prevent module exit *until* we know for sure all possible + * kernfs ops are done. + */ + module_put(kn->owner); + if (likely(v != KN_DEACTIVATED_BIAS)) return; @@ -571,6 +599,7 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry) static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, unsigned flags) { @@ -606,6 +635,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, kn->name = name; kn->mode = mode; + kn->owner = owner; kn->flags = flags; if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) { @@ -639,13 +669,14 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *kernfs_new_node(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, unsigned flags) { struct kernfs_node *kn; kn = __kernfs_new_node(kernfs_root(parent), parent, - name, mode, uid, gid, flags); + name, mode, owner, uid, gid, flags); if (kn) { kernfs_get(parent); kn->parent = parent; @@ -926,7 +957,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops, root->id_highbits = 1; kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO, - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, + NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR); if (!kn) { idr_destroy(&root->ino_idr); @@ -965,6 +996,7 @@ void kernfs_destroy_root(struct kernfs_root *root) * @parent: parent in which to create a new directory * @name: name of the new directory * @mode: mode of the new directory + * @owner: if set, the module owner of this directory * @uid: uid of the new directory * @gid: gid of the new directory * @priv: opaque data associated with the new directory @@ -974,6 +1006,7 @@ void kernfs_destroy_root(struct kernfs_root *root) */ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, void *priv, const void *ns) { @@ -982,7 +1015,7 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, /* allocate */ kn = kernfs_new_node(parent, name, mode | S_IFDIR, - uid, gid, KERNFS_DIR); + owner, uid, gid, KERNFS_DIR); if (!kn) return ERR_PTR(-ENOMEM); @@ -1014,7 +1047,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, /* allocate */ kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR, - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR); + NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR); if (!kn) return ERR_PTR(-ENOMEM); diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index 4479c6580333..73be39ddc856 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -974,6 +974,7 @@ const struct file_operations kernfs_file_fops = { * @uid: uid of the file * @gid: gid of the file * @size: size of the file + * @owner: if set, the module owner of the file * @ops: kernfs operations for the file * @priv: private data for the file * @ns: optional namespace tag of the file @@ -983,7 +984,9 @@ const struct file_operations kernfs_file_fops = { */ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, const char *name, - umode_t mode, kuid_t uid, kgid_t gid, + umode_t mode, + struct module *owner, + kuid_t uid, kgid_t gid, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns, @@ -996,7 +999,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, flags = KERNFS_FILE; kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG, - uid, gid, flags); + owner, uid, gid, flags); if (!kn) return ERR_PTR(-ENOMEM); diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index 9e3abf597e2d..3551f62f1449 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -133,6 +133,7 @@ void kernfs_put_active(struct kernfs_node *kn); int kernfs_add_one(struct kernfs_node *kn); struct kernfs_node *kernfs_new_node(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, unsigned flags); diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c index 19a6c71c6ff5..1d781970bec7 100644 --- a/fs/kernfs/symlink.c +++ b/fs/kernfs/symlink.c @@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, gid = target->iattr->ia_gid; } - kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK); + kn = kernfs_new_node(parent, name, S_IFLNK|0777, target->owner, uid, + gid, KERNFS_LINK); if (!kn) return ERR_PTR(-ENOMEM); diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index b6b6796e1616..01a0a53ee18e 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -56,8 +56,8 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) kobject_get_ownership(kobj, &uid, &gid); - kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid, - kobj, ns); + kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, NULL, uid, + gid, kobj, ns); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, kobject_name(kobj)); diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c index 42dcf96881b6..5edc6e180a97 100644 --- a/fs/sysfs/file.c +++ b/fs/sysfs/file.c @@ -291,8 +291,9 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent, key = attr->key ?: (struct lock_class_key *)&attr->skey; #endif - kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid, - PAGE_SIZE, ops, (void *)attr, ns, key); + kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner, + uid, gid, PAGE_SIZE, ops, (void *)attr, ns, + key); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, attr->name); @@ -326,8 +327,9 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent, key = attr->key ?: (struct lock_class_key *)&attr->skey; #endif - kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid, - battr->size, ops, (void *)attr, ns, key); + kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner, + uid, gid, battr->size, ops, (void *)attr, ns, + key); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, attr->name); diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c index eeb0e3099421..283607c22eef 100644 --- a/fs/sysfs/group.c +++ b/fs/sysfs/group.c @@ -135,6 +135,7 @@ static int internal_create_group(struct kobject *kobj, int update, } else { kn = kernfs_create_dir_ns(kobj->sd, grp->name, S_IRWXU | S_IRUGO | S_IXUGO, + grp->owner, uid, gid, kobj, NULL); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index cd968ee2b503..8655abd53c85 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -160,6 +160,7 @@ struct kernfs_node { unsigned short flags; umode_t mode; + struct module *owner; struct kernfs_iattrs *iattr; }; @@ -369,12 +370,14 @@ void kernfs_destroy_root(struct kernfs_root *root); struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, void *priv, const void *ns); struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, const char *name); struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, loff_t size, const struct kernfs_ops *ops, @@ -471,13 +474,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { } static inline struct kernfs_node * kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, - umode_t mode, kuid_t uid, kgid_t gid, + umode_t mode, struct module *owner, + kuid_t uid, kgid_t gid, void *priv, const void *ns) { return ERR_PTR(-ENOSYS); } static inline struct kernfs_node * __kernfs_create_file(struct kernfs_node *parent, const char *name, - umode_t mode, kuid_t uid, kgid_t gid, + umode_t mode, struct module *owner, + kuid_t uid, kgid_t gid, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns, struct lock_class_key *key) { return ERR_PTR(-ENOSYS); } @@ -564,7 +569,7 @@ static inline struct kernfs_node * kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode, void *priv) { - return kernfs_create_dir_ns(parent, name, mode, + return kernfs_create_dir_ns(parent, name, mode, parent->owner, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, priv, NULL); } diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index e3f1e8ac1f85..babbabb460dc 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -30,6 +30,7 @@ enum kobj_ns_type; struct attribute { const char *name; umode_t mode; + struct module *owner; #ifdef CONFIG_DEBUG_LOCK_ALLOC bool ignore_lockdep:1; struct lock_class_key *key; @@ -80,6 +81,7 @@ do { \ * @attrs: Pointer to NULL terminated list of attributes. * @bin_attrs: Pointer to NULL terminated list of binary attributes. * Either attrs or bin_attrs or both must be provided. + * @module: If set, module responsible for this attribute group */ struct attribute_group { const char *name; @@ -89,6 +91,7 @@ struct attribute_group { struct bin_attribute *, int); struct attribute **attrs; struct bin_attribute **bin_attrs; + struct module *owner; }; /* @@ -100,38 +103,52 @@ struct attribute_group { #define __ATTR(_name, _mode, _show, _store) { \ .attr = {.name = __stringify(_name), \ - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \ + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \ + .owner = THIS_MODULE, \ + }, \ .show = _show, \ .store = _store, \ } #define __ATTR_PREALLOC(_name, _mode, _show, _store) { \ .attr = {.name = __stringify(_name), \ - .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\ + .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\ + .owner = THIS_MODULE, \ + }, \ .show = _show, \ .store = _store, \ } #define __ATTR_RO(_name) { \ - .attr = { .name = __stringify(_name), .mode = 0444 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0444, \ + .owner = THIS_MODULE, \ + }, \ .show = _name##_show, \ } #define __ATTR_RO_MODE(_name, _mode) { \ .attr = { .name = __stringify(_name), \ - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \ + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \ + .owner = THIS_MODULE, \ + }, \ .show = _name##_show, \ } #define __ATTR_RW_MODE(_name, _mode) { \ .attr = { .name = __stringify(_name), \ - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \ + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \ + .owner = THIS_MODULE, \ + }, \ .show = _name##_show, \ .store = _name##_store, \ } #define __ATTR_WO(_name) { \ - .attr = { .name = __stringify(_name), .mode = 0200 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0200, \ + .owner = THIS_MODULE, \ + }, \ .store = _name##_store, \ } @@ -141,8 +158,11 @@ struct attribute_group { #ifdef CONFIG_DEBUG_LOCK_ALLOC #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) { \ - .attr = {.name = __stringify(_name), .mode = _mode, \ - .ignore_lockdep = true }, \ + .attr = {.name = __stringify(_name), \ + .mode = _mode, \ + .ignore_lockdep = true, \ + .owner = THIS_MODULE, \ + }, \ .show = _show, \ .store = _store, \ } @@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = { \ #define ATTRIBUTE_GROUPS(_name) \ static const struct attribute_group _name##_group = { \ .attrs = _name##_attrs, \ + .owner = THIS_MODULE, \ }; \ __ATTRIBUTE_GROUPS(_name) @@ -199,20 +220,29 @@ struct bin_attribute { /* macros to create static binary attributes easier */ #define __BIN_ATTR(_name, _mode, _read, _write, _size) { \ - .attr = { .name = __stringify(_name), .mode = _mode }, \ + .attr = { .name = __stringify(_name), \ + .mode = _mode, \ + .owner = THIS_MODULE, \ + }, \ .read = _read, \ .write = _write, \ .size = _size, \ } #define __BIN_ATTR_RO(_name, _size) { \ - .attr = { .name = __stringify(_name), .mode = 0444 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0444, \ + .owner = THIS_MODULE, \ + }, \ .read = _name##_read, \ .size = _size, \ } #define __BIN_ATTR_WO(_name, _size) { \ - .attr = { .name = __stringify(_name), .mode = 0200 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0200, \ + .owner = THIS_MODULE, \ + }, \ .write = _name##_write, \ .size = _size, \ } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 8afa8690d288..b9886db9ecdd 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3949,7 +3949,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp, key = &cft->lockdep_key; #endif kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name), - cgroup_file_mode(cft), + cgroup_file_mode(cft), NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, cft->kf_ops, cft, NULL, key); -- 2.30.2

4 years, 2 months

4
10
0 0

[RESEND PATCH net-next v7 0/3] Add RFC-7597 Section 5.1 PSID support

by Cole Dishington

Resend to add v7 to subject to detect cover letter in patchwork. Cole Dishington (3): net: netfilter: Add RFC-7597 Section 5.1 PSID support xtables API net: netfilter: Add RFC-7597 Section 5.1 PSID support selftests: netfilter: Add RFC-7597 Section 5.1 PSID selftests include/uapi/linux/netfilter/nf_nat.h | 3 +- net/netfilter/nf_nat_core.c | 39 +++- net/netfilter/nf_nat_masquerade.c | 27 ++- net/netfilter/xt_MASQUERADE.c | 44 ++++- .../netfilter/nat_masquerade_psid.sh | 182 ++++++++++++++++++ 5 files changed, 283 insertions(+), 12 deletions(-) create mode 100644 tools/testing/selftests/netfilter/nat_masquerade_psid.sh -- 2.33.0

4 years, 3 months

1
3
0 0

[PATCH] selftests: net: af_unix: Fix makefile to use TEST_GEN_PROGS

by Shuah Khan

Makefile uses TEST_PROGS instead of TEST_GEN_PROGS to define executables. TEST_PROGS is for shell scripts that need to be installed and run by the common lib.mk framework. The common framework doesn't touch TEST_PROGS when it does build and clean. As a result "make kselftest-clean" and "make clean" fail to remove executables. Run and install work because the common framework runs and installs TEST_PROGS. Build works because the Makefile defines "all" rule which is unnecessary if TEST_GEN_PROGS is used. Use TEST_GEN_PROGS so the common framework can handle build/run/ install/clean properly. Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org> --- tools/testing/selftests/net/af_unix/Makefile | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/tools/testing/selftests/net/af_unix/Makefile b/tools/testing/selftests/net/af_unix/Makefile index cfc7f4f97fd1..df341648f818 100644 --- a/tools/testing/selftests/net/af_unix/Makefile +++ b/tools/testing/selftests/net/af_unix/Makefile @@ -1,5 +1,2 @@ -##TEST_GEN_FILES := test_unix_oob -TEST_PROGS := test_unix_oob +TEST_GEN_PROGS := test_unix_oob include ../../lib.mk - -all: $(TEST_PROGS) -- 2.30.2

4 years, 3 months

2
1
0 0

[PATCH] selftests: net: af_unix: Fix incorrect args in test result msg

by Shuah Khan

Fix the args to fprintf(). Splitting the message ends up passing incorrect arg for "sigurg %d" and an extra arg overall. The test result message ends up incorrect. test_unix_oob.c: In function ‘main’: test_unix_oob.c:274:43: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘char *’ [-Wformat=] 274 | fprintf(stderr, "Test 3 failed, sigurg %d len %d OOB %c ", | ~^ | | | int | %s 275 | "atmark %d\n", signal_recvd, len, oob, atmark); | ~~~~~~~~~~~~~ | | | char * test_unix_oob.c:274:19: warning: too many arguments for format [-Wformat-extra-args] 274 | fprintf(stderr, "Test 3 failed, sigurg %d len %d OOB %c ", Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org> --- tools/testing/selftests/net/af_unix/test_unix_oob.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/af_unix/test_unix_oob.c b/tools/testing/selftests/net/af_unix/test_unix_oob.c index 0f3e3763f4f8..3dece8b29253 100644 --- a/tools/testing/selftests/net/af_unix/test_unix_oob.c +++ b/tools/testing/selftests/net/af_unix/test_unix_oob.c @@ -271,8 +271,9 @@ main(int argc, char **argv) read_oob(pfd, &oob); if (!signal_recvd || len != 127 || oob != '%' || atmark != 1) { - fprintf(stderr, "Test 3 failed, sigurg %d len %d OOB %c ", - "atmark %d\n", signal_recvd, len, oob, atmark); + fprintf(stderr, + "Test 3 failed, sigurg %d len %d OOB %c atmark %d\n", + signal_recvd, len, oob, atmark); die(1); } -- 2.30.2

4 years, 3 months

2
1
0 0

[PATCH v7 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal

by Luis Chamberlain

The ATTRIBUTE_GROUPS is typically used to avoid boiler plate code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was long due on the zram driver, however a recent fix for sysfs allows users of ATTRIBUTE_GROUPS to also associate a module to the group attribute. In zram's case this also means it allows us to fix a race which triggers a deadlock on the zram driver. This deadlock happens when a sysfs attribute use a lock also used on module removal. This happens when for instance a sysfs file on a driver is used, then at the same time we have module removal call trigger. The module removal call code holds a lock, and then the sysfs file entry waits for the same lock. While holding the lock the module removal tries to remove the sysfs entries, but these cannot be removed yet as one is waiting for a lock. This won't complete as the lock is already held. Likewise module removal cannot complete, and so we deadlock. Sysfs fixes this when the group attributes have a module associated to it, sysfs will *try* to get a refcount to the module when a shared lock is used, prior to mucking with a sysfs attribute. If this fails we just give up right away. This deadlock was first reported with the zram driver, a sketch of how this can happen follows: CPU A CPU B whatever_store() module_unload mutex_lock(foo) mutex_lock(foo) del_gendisk(zram->disk); device_del() device_remove_groups() In this situation whatever_store() is waiting for the mutex foo to become unlocked, but that won't happen until module removal is complete. But module removal won't complete until the sysfs file being poked completes which is waiting for a lock already held. This issue can be reproduced easily on the zram driver as follows: Loop 1 on one terminal: while true; do modprobe zram; modprobe -r zram; done Loop 2 on a second terminal: while true; do echo 1024 > /sys/block/zram0/disksize; echo 1 > /sys/block/zram0/reset; done Without this patch we end up in a deadlock, and the following stack trace is produced which hints to us what the issue was: INFO: task bash:888 blocked for more than 120 seconds. Tainted: G E 5.12.0-rc1-next-20210304+ #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:bash state:D stack: 0 pid: 888 ppid: 887 flags:<etc> Call Trace: __schedule+0x2e4/0x900 schedule+0x46/0xb0 schedule_preempt_disabled+0xa/0x10 __mutex_lock.constprop.0+0x2c3/0x490 ? _kstrtoull+0x35/0xd0 reset_store+0x6c/0x160 [zram] kernfs_fop_write_iter+0x124/0x1b0 new_sync_write+0x11c/0x1b0 vfs_write+0x1c2/0x260 ksys_write+0x5f/0xe0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f34f2c3df33 RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33 RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001 RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001 R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002 R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0 INFO: task modprobe:1104 can't die for more than 120 seconds. task:modprobe state:D stack: 0 pid: 1104 ppid: 916 flags:<etc> Call Trace: __schedule+0x2e4/0x900 schedule+0x46/0xb0 __kernfs_remove.part.0+0x228/0x2b0 ? finish_wait+0x80/0x80 kernfs_remove_by_name_ns+0x50/0x90 remove_files+0x2b/0x60 sysfs_remove_group+0x38/0x80 sysfs_remove_groups+0x29/0x40 device_remove_attrs+0x4a/0x80 device_del+0x183/0x3e0 ? mutex_lock+0xe/0x30 del_gendisk+0x27a/0x2d0 zram_remove+0x8a/0xb0 [zram] ? hot_remove_store+0xf0/0xf0 [zram] zram_remove_cb+0xd/0x10 [zram] idr_for_each+0x5e/0xd0 destroy_devices+0x39/0x6f [zram] __do_sys_delete_module+0x190/0x2a0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f32adf727d7 RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7 RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78 RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78 R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20 [0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org> --- drivers/block/zram/zram_drv.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index b26abcb955cc..60a55ae8cd91 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1902,14 +1902,7 @@ static struct attribute *zram_disk_attrs[] = { NULL, }; -static const struct attribute_group zram_disk_attr_group = { - .attrs = zram_disk_attrs, -}; - -static const struct attribute_group *zram_disk_attr_groups[] = { - &zram_disk_attr_group, - NULL, -}; +ATTRIBUTE_GROUPS(zram_disk); /* * Allocate and initialize new zram device. the function returns @@ -1981,7 +1974,7 @@ static int zram_add(void) blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX); blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue); - device_add_disk(NULL, zram->disk, zram_disk_attr_groups); + device_add_disk(NULL, zram->disk, zram_disk_groups); strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor)); -- 2.30.2

4 years, 3 months

1
0
0 0

[PATCH v7 11/12] zram: fix crashes with cpu hotplug multistate

by Luis Chamberlain

Provide a simple state machine to fix races with driver exit where we remove the CPU multistate callbacks and re-initialization / creation of new per CPU instances which should be managed by these callbacks. The zram driver makes use of cpu hotplug multistate support, whereby it associates a struct zcomp per CPU. Each struct zcomp represents a compression algorithm in charge of managing compression streams per CPU. Although a compiled zram driver only supports a fixed set of compression algorithms, each zram device gets a struct zcomp allocated per CPU. The "multi" in CPU hotplug multstate refers to these per cpu struct zcomp instances. Each of these will have the CPU hotplug callback called for it on CPU plug / unplug. The kernel's CPU hotplug multistate keeps a linked list of these different structures so that it will iterate over them on CPU transitions. By default at driver initialization we will create just one zram device (num_devices=1) and a zcomp structure then set for the now default lzo-rle comrpession algorithm. At driver removal we first remove each zram device, and so we destroy the associated struct zcomp per CPU. But since we expose sysfs attributes to create new devices or reset / initialize existing zram devices, we can easily end up re-initializing a struct zcomp for a zram device before the exit routine of the module removes the cpu hotplug callback. When this happens the kernel's CPU hotplug will detect that at least one instance (struct zcomp for us) exists. This can happen in the following situation: CPU 1 CPU 2 disksize_store(...); class_unregister(...); idr_for_each(...); zram_debugfs_destroy(); idr_destroy(...); unregister_blkdev(...); cpuhp_remove_multi_state(...); The warning comes up on cpuhp_remove_multi_state() when it sees that the state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list. In this case, that a struct zcom still exists, the driver allowed its creation per CPU even though we could have just freed them per CPU though a call on another CPU, and we are then later trying to remove the hotplug callback. Fix all this by providing a zram initialization boolean protected the shared in the driver zram_index_mutex, which we can use to annotate when sysfs attributes are safe to use or not -- once the driver is properly initialized. When the driver is going down we also are sure to not let userspace muck with attributes which may affect each per cpu struct zcomp. This also fixes a series of possible memory leaks. The crashes and memory leaks can easily be caused by issuing the zram02.sh script from the LTP project [0] in a loop in two separate windows: cd testcases/kernel/device-drivers/zram while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done You end up with a splat as follows: kernel: zram: Removed device: zram0 kernel: zram: Added device: zram0 kernel: zram0: detected capacity change from 0 to 209715200 kernel: Adding 104857596k swap on /dev/zram0. <etc> kernel: zram0: detected capacitky change from 209715200 to 0 kernel: zram0: detected capacity change from 0 to 209715200 kernel: ------------[ cut here ]------------ kernel: Error: Removing state 63 which has instances left. kernel: WARNING: CPU: 7 PID: 70457 at \ kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Modules linked in: zram(E-) zsmalloc(E) <etc> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G \ E 5.12.0-rc1-next-20210304 #3 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \ BIOS 1.14.0-2 04/01/2014 kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Code: <etc> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282 kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8 kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0 kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8 kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000 kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc> kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033 kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0 kernel: Call Trace: kernel: __cpuhp_remove_state+0x2e/0x80 kernel: __do_sys_delete_module+0x190/0x2a0 kernel: do_syscall_64+0x33/0x80 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae The "Error: Removing state 63 which has instances left" refers to the zram per CPU struct zcomp instances left. [0] https://github.com/linux-test-project/ltp.git Acked-by: Minchan Kim <minchan(a)kernel.org> Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org> --- drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++----- 1 file changed, 55 insertions(+), 8 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index f61910c65f0f..b26abcb955cc 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex); static int zram_major; static const char *default_compressor = CONFIG_ZRAM_DEF_COMP; +static bool zram_up; + /* Module params (documentation at end) */ static unsigned int num_devices = 1; /* @@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram) comp = zram->comp; disksize = zram->disksize; zram->disksize = 0; + zram->comp = NULL; set_capacity_and_notify(zram->disk, 0); part_stat_set_all(zram->disk->part0, 0); @@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev, struct zram *zram = dev_to_zram(dev); int err; + mutex_lock(&zram_index_mutex); + + if (!zram_up) { + err = -ENODEV; + goto out; + } + disksize = memparse(buf, NULL); - if (!disksize) - return -EINVAL; + if (!disksize) { + err = -EINVAL; + goto out; + } down_write(&zram->init_lock); if (init_done(zram)) { @@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev, set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); up_write(&zram->init_lock); + mutex_unlock(&zram_index_mutex); + return len; out_free_meta: zram_meta_free(zram, disksize); out_unlock: up_write(&zram->init_lock); +out: + mutex_unlock(&zram_index_mutex); return err; } @@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev, if (ret) return ret; - if (!do_reset) - return -EINVAL; + mutex_lock(&zram_index_mutex); + + if (!zram_up) { + len = -ENODEV; + goto out; + } + + if (!do_reset) { + len = -EINVAL; + goto out; + } zram = dev_to_zram(dev); bdev = zram->disk->part0; @@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev, /* Do not reset an active device or claimed device */ if (bdev->bd_openers || zram->claim) { mutex_unlock(&bdev->bd_disk->open_mutex); - return -EBUSY; + len = -EBUSY; + goto out; } /* From now on, anyone can't open /dev/zram[0-9] */ @@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev, zram->claim = false; mutex_unlock(&bdev->bd_disk->open_mutex); +out: + mutex_unlock(&zram_index_mutex); return len; } @@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class, int ret; mutex_lock(&zram_index_mutex); + if (!zram_up) { + mutex_unlock(&zram_index_mutex); + return -ENODEV; + } ret = zram_add(); mutex_unlock(&zram_index_mutex); @@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class, mutex_lock(&zram_index_mutex); + if (!zram_up) { + ret = -ENODEV; + goto out; + } + zram = idr_find(&zram_index_idr, dev_id); if (zram) { ret = zram_remove(zram); @@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class, ret = -ENODEV; } +out: mutex_unlock(&zram_index_mutex); return ret ? ret : count; } @@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data) static void destroy_devices(void) { + mutex_lock(&zram_index_mutex); + zram_up = false; class_unregister(&zram_control_class); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); zram_debugfs_destroy(); idr_destroy(&zram_index_idr); unregister_blkdev(zram_major, "zram"); cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); + mutex_unlock(&zram_index_mutex); } static int __init zram_init(void) @@ -2105,15 +2146,21 @@ static int __init zram_init(void) return -EBUSY; } + mutex_lock(&zram_index_mutex); + while (num_devices != 0) { - mutex_lock(&zram_index_mutex); ret = zram_add(); - mutex_unlock(&zram_index_mutex); - if (ret < 0) + if (ret < 0) { + mutex_unlock(&zram_index_mutex); goto out_error; + } num_devices--; } + zram_up = true; + + mutex_unlock(&zram_index_mutex); + return 0; out_error: -- 2.30.2

4 years, 3 months

1
0
0 0

[PATCH v7 10/12] test_sysfs: enable deadlock tests by default

by Luis Chamberlain

Now that sysfs has the deadlock race fixed with module removal, enable the deadlock tests module removal tests. They were left disabled by default as otherwise you would deadlock your system ./tools/testing/selftests/sysfs/sysfs.sh -t 0027 Running test: sysfs_test_0027 - run #0 Test for possible rmmod deadlock while writing x ... ok ./tools/testing/selftests/sysfs/sysfs.sh -t 0028 Running test: sysfs_test_0028 - run #0 Test for possible rmmod deadlock using rtnl_lock while writing x ... ok Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org> --- tools/testing/selftests/sysfs/sysfs.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh index f928635d0e35..4047ac48e764 100755 --- a/tools/testing/selftests/sysfs/sysfs.sh +++ b/tools/testing/selftests/sysfs/sysfs.sh @@ -60,8 +60,8 @@ ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block" ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block" ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block" ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block" -ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test -ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock +ALL_TESTS="$ALL_TESTS 0027:1:1:test_dev_x:block" # deadlock test +ALL_TESTS="$ALL_TESTS 0028:1:1:test_dev_x:block" # deadlock test with rntl_lock ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex -- 2.30.2

4 years, 3 months

1
0
0 0

[PATCH v7 08/12] fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns()

by Luis Chamberlain

If one ends up expanding on this line checkpatch will complain that the combination S_IRWXU|S_IRUGO|S_IXUGO should just be replaced with the octal 0755. Do that. This makes no functional changes. Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org> --- fs/sysfs/dir.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 59dffd5ca517..b6b6796e1616 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -56,8 +56,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) kobject_get_ownership(kobj, &uid, &gid); - kn = kernfs_create_dir_ns(parent, kobject_name(kobj), - S_IRWXU | S_IRUGO | S_IXUGO, uid, gid, + kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid, kobj, ns); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) -- 2.30.2

4 years, 3 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror September 2021