From: Jacky Bai ping.bai@nxp.com
[ Upstream commit 8051a993ce222a5158bccc6ac22ace9253dd71cb ]
Fix coverity Issue CID 250382: Resource leak (RESOURCE_LEAK). Add kfree when error return.
Signed-off-by: Jacky Bai ping.bai@nxp.com Reviewed-by: Peng Fan peng.fan@nxp.com Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org Link: https://lore.kernel.org/r/20231009083922.1942971-1-ping.bai@nxp.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/clocksource/timer-imx-gpt.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/drivers/clocksource/timer-imx-gpt.c b/drivers/clocksource/timer-imx-gpt.c index 6ec6d79b237ce..fcc12f72e907f 100644 --- a/drivers/clocksource/timer-imx-gpt.c +++ b/drivers/clocksource/timer-imx-gpt.c @@ -489,12 +489,16 @@ static int __init mxc_timer_init_dt(struct device_node *np, enum imx_gpt_type t return -ENOMEM;
imxtm->base = of_iomap(np, 0); - if (!imxtm->base) - return -ENXIO; + if (!imxtm->base) { + ret = -ENXIO; + goto err_kfree; + }
imxtm->irq = irq_of_parse_and_map(np, 0); - if (imxtm->irq <= 0) - return -EINVAL; + if (imxtm->irq <= 0) { + ret = -EINVAL; + goto err_kfree; + }
imxtm->clk_ipg = of_clk_get_by_name(np, "ipg");
@@ -507,11 +511,15 @@ static int __init mxc_timer_init_dt(struct device_node *np, enum imx_gpt_type t
ret = _mxc_timer_init(imxtm); if (ret) - return ret; + goto err_kfree;
initialized = 1;
return 0; + +err_kfree: + kfree(imxtm); + return ret; }
static int __init imx1_timer_init_dt(struct device_node *np)
From: Christian Brauner christian.brauner@ubuntu.com
[ Upstream commit 1c5976ef0f7ad76319df748ccb99a4c7ba2ba464 ]
Currently, registering a new binary type pins the binfmt_misc filesystem. Specifically, this means that as long as there is at least one binary type registered the binfmt_misc filesystem survives all umounts, i.e. the superblock is not destroyed. Meaning that a umount followed by another mount will end up with the same superblock and the same binary type handlers. This is a behavior we tend to discourage for any new filesystems (apart from a few special filesystems such as e.g. configfs or debugfs). A umount operation without the filesystem being pinned - by e.g. someone holding a file descriptor to an open file - should usually result in the destruction of the superblock and all associated resources. This makes introspection easier and leads to clearly defined, simple and clean semantics. An administrator can rely on the fact that a umount will guarantee a clean slate making it possible to reinitialize a filesystem. Right now all binary types would need to be explicitly deleted before that can happen.
This allows us to remove the heavy-handed calls to simple_pin_fs() and simple_release_fs() when creating and deleting binary types. This in turn allows us to replace the current brittle pinning mechanism abusing dget() which has caused a range of bugs judging from prior fixes in [2] and [3]. The additional dget() in load_misc_binary() pins the dentry but only does so for the sake to prevent ->evict_inode() from freeing the node when a user removes the binary type and kill_node() is run. Which would mean ->interpreter and ->interp_file would be freed causing a UAF.
This isn't really nicely documented nor is it very clean because it relies on simple_pin_fs() pinning the filesystem as long as at least one binary type exists. Otherwise it would cause load_misc_binary() to hold on to a dentry belonging to a superblock that has been shutdown. Replace that implicit pinning with a clean and simple per-node refcount and get rid of the ugly dget() pinning. A similar mechanism exists for e.g. binderfs (cf. [4]). All the cleanup work can now be done in ->evict_inode().
In a follow-up patch we will make it possible to use binfmt_misc in sandboxes. We will use the cleaner semantics where a umount for the filesystem will cause the superblock and all resources to be deallocated. In preparation for this apply the same semantics to the initial binfmt_misc mount. Note, that this is a user-visible change and as such a uapi change but one that we can reasonably risk. We've discussed this in earlier versions of this patchset (cf. [1]).
The main user and provider of binfmt_misc is systemd. Systemd provides binfmt_misc via autofs since it is configurable as a kernel module and is used by a few exotic packages and users. As such a binfmt_misc mount is triggered when /proc/sys/fs/binfmt_misc is accessed and is only provided on demand. Other autofs on demand filesystems include EFI ESP which systemd umounts if the mountpoint stays idle for a certain amount of time. This doesn't apply to the binfmt_misc autofs mount which isn't touched once it is mounted meaning this change can't accidently wipe binary type handlers without someone having explicitly unmounted binfmt_misc. After speaking to systemd folks they don't expect this change to affect them.
In line with our general policy, if we see a regression for systemd or other users with this change we will switch back to the old behavior for the initial binfmt_misc mount and have binary types pin the filesystem again. But while we touch this code let's take the chance and let's improve on the status quo.
[1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu [2]: commit 43a4f2619038 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()" [3]: commit 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()") [4]: commit f0fe2c0f050d ("binder: prevent UAF for binderfs devices II")
Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1) Cc: Sargun Dhillon sargun@sargun.me Cc: Serge Hallyn serge@hallyn.com Cc: Jann Horn jannh@google.com Cc: Henning Schild henning.schild@siemens.com Cc: Andrei Vagin avagin@gmail.com Cc: Al Viro viro@zeniv.linux.org.uk Cc: Laurent Vivier laurent@vivier.eu Cc: linux-fsdevel@vger.kernel.org Acked-by: Serge Hallyn serge@hallyn.com Signed-off-by: Christian Brauner christian.brauner@ubuntu.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Kees Cook keescook@chromium.org --- /* v2 */ - Christian Brauner christian.brauner@ubuntu.com: - Add more comments that explain what's going on. - Rename functions while changing them to better reflect what they are doing to make the code easier to understand. - In the first version when a specific binary type handler was removed either through a write to the entry's file or all binary type handlers were removed by a write to the binfmt_misc mount's status file all cleanup work happened during inode eviction. That includes removal of the relevant entries from entry list. While that works fine I disliked that model after thinking about it for a bit. Because it means that there was a window were someone has already removed a or all binary handlers but they could still be safely reached from load_misc_binary() when it has managed to take the read_lock() on the entries list while inode eviction was already happening. Again, that perfectly benign but it's cleaner to remove the binary handler from the list immediately meaning that ones the write to then entry's file or the binfmt_misc status file returns the binary type cannot be executed anymore. That gives stronger guarantees to the user. Signed-off-by: Sasha Levin sashal@kernel.org --- fs/binfmt_misc.c | 216 ++++++++++++++++++++++++++++++++++++----------- 1 file changed, 168 insertions(+), 48 deletions(-)
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index c19bf5c2fbec9..e768cd60ff999 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -58,12 +58,11 @@ typedef struct { char *name; struct dentry *dentry; struct file *interp_file; + refcount_t users; /* sync removal with load_misc_binary() */ } Node;
static DEFINE_RWLOCK(entries_lock); static struct file_system_type bm_fs_type; -static struct vfsmount *bm_mnt; -static int entry_count;
/* * Max length of the register string. Determined by: @@ -80,19 +79,23 @@ static int entry_count; */ #define MAX_REGISTER_LENGTH 1920
-/* - * Check if we support the binfmt - * if we do, return the node, else NULL - * locking is done in load_misc_binary +/** + * search_binfmt_handler - search for a binary handler for @bprm + * @misc: handle to binfmt_misc instance + * @bprm: binary for which we are looking for a handler + * + * Search for a binary type handler for @bprm in the list of registered binary + * type handlers. + * + * Return: binary type list entry on success, NULL on failure */ -static Node *check_file(struct linux_binprm *bprm) +static Node *search_binfmt_handler(struct linux_binprm *bprm) { char *p = strrchr(bprm->interp, '.'); - struct list_head *l; + Node *e;
/* Walk all the registered handlers. */ - list_for_each(l, &entries) { - Node *e = list_entry(l, Node, list); + list_for_each_entry(e, &entries, list) { char *s; int j;
@@ -121,9 +124,49 @@ static Node *check_file(struct linux_binprm *bprm) if (j == e->size) return e; } + return NULL; }
+/** + * get_binfmt_handler - try to find a binary type handler + * @misc: handle to binfmt_misc instance + * @bprm: binary for which we are looking for a handler + * + * Try to find a binfmt handler for the binary type. If one is found take a + * reference to protect against removal via bm_{entry,status}_write(). + * + * Return: binary type list entry on success, NULL on failure + */ +static Node *get_binfmt_handler(struct linux_binprm *bprm) +{ + Node *e; + + read_lock(&entries_lock); + e = search_binfmt_handler(bprm); + if (e) + refcount_inc(&e->users); + read_unlock(&entries_lock); + return e; +} + +/** + * put_binfmt_handler - put binary handler node + * @e: node to put + * + * Free node syncing with load_misc_binary() and defer final free to + * load_misc_binary() in case it is using the binary type handler we were + * requested to remove. + */ +static void put_binfmt_handler(Node *e) +{ + if (refcount_dec_and_test(&e->users)) { + if (e->flags & MISC_FMT_OPEN_FILE) + filp_close(e->interp_file, NULL); + kfree(e); + } +} + /* * the loader itself */ @@ -138,12 +181,7 @@ static int load_misc_binary(struct linux_binprm *bprm) if (!enabled) return retval;
- /* to keep locking time low, we copy the interpreter string */ - read_lock(&entries_lock); - fmt = check_file(bprm); - if (fmt) - dget(fmt->dentry); - read_unlock(&entries_lock); + fmt = get_binfmt_handler(bprm); if (!fmt) return retval;
@@ -237,7 +275,16 @@ static int load_misc_binary(struct linux_binprm *bprm) goto error;
ret: - dput(fmt->dentry); + + /* + * If we actually put the node here all concurrent calls to + * load_misc_binary() will have finished. We also know + * that for the refcount to be zero ->evict_inode() must have removed + * the node to be deleted from the list. All that is left for us is to + * close and free. + */ + put_binfmt_handler(fmt); + return retval; error: if (fd_binary > 0) @@ -598,30 +645,90 @@ static struct inode *bm_get_inode(struct super_block *sb, int mode) return inode; }
+/** + * bm_evict_inode - cleanup data associated with @inode + * @inode: inode to which the data is attached + * + * Cleanup the binary type handler data associated with @inode if a binary type + * entry is removed or the filesystem is unmounted and the super block is + * shutdown. + * + * If the ->evict call was not caused by a super block shutdown but by a write + * to remove the entry or all entries via bm_{entry,status}_write() the entry + * will have already been removed from the list. We keep the list_empty() check + * to make that explicit. +*/ static void bm_evict_inode(struct inode *inode) { Node *e = inode->i_private;
- if (e && e->flags & MISC_FMT_OPEN_FILE) - filp_close(e->interp_file, NULL); - clear_inode(inode); - kfree(e); + + if (e) { + write_lock(&entries_lock); + if (!list_empty(&e->list)) + list_del_init(&e->list); + write_unlock(&entries_lock); + put_binfmt_handler(e); + } }
-static void kill_node(Node *e) +/** + * unlink_binfmt_dentry - remove the dentry for the binary type handler + * @dentry: dentry associated with the binary type handler + * + * Do the actual filesystem work to remove a dentry for a registered binary + * type handler. Since binfmt_misc only allows simple files to be created + * directly under the root dentry of the filesystem we ensure that we are + * indeed passed a dentry directly beneath the root dentry, that the inode + * associated with the root dentry is locked, and that it is a regular file we + * are asked to remove. + */ +static void unlink_binfmt_dentry(struct dentry *dentry) { - struct dentry *dentry; + struct dentry *parent = dentry->d_parent; + struct inode *inode, *parent_inode; + + /* All entries are immediate descendants of the root dentry. */ + if (WARN_ON_ONCE(dentry->d_sb->s_root != parent)) + return;
+ /* We only expect to be called on regular files. */ + inode = d_inode(dentry); + if (WARN_ON_ONCE(!S_ISREG(inode->i_mode))) + return; + + /* The parent inode must be locked. */ + parent_inode = d_inode(parent); + if (WARN_ON_ONCE(!inode_is_locked(parent_inode))) + return; + + if (simple_positive(dentry)) { + dget(dentry); + simple_unlink(parent_inode, dentry); + d_delete(dentry); + dput(dentry); + } +} + +/** + * remove_binfmt_handler - remove a binary type handler + * @misc: handle to binfmt_misc instance + * @e: binary type handler to remove + * + * Remove a binary type handler from the list of binary type handlers and + * remove its associated dentry. This is called from + * binfmt_{entry,status}_write(). In the future, we might want to think about + * adding a proper ->unlink() method to binfmt_misc instead of forcing caller's + * to use writes to files in order to delete binary type handlers. But it has + * worked for so long that it's not a pressing issue. + */ +static void remove_binfmt_handler(Node *e) +{ write_lock(&entries_lock); list_del_init(&e->list); write_unlock(&entries_lock); - - dentry = e->dentry; - drop_nlink(d_inode(dentry)); - d_drop(dentry); - dput(dentry); - simple_release_fs(&bm_mnt, &entry_count); + unlink_binfmt_dentry(e->dentry); }
/* /<entry> */ @@ -648,8 +755,8 @@ bm_entry_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) static ssize_t bm_entry_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { - struct dentry *root; - Node *e = file_inode(file)->i_private; + struct inode *inode = file_inode(file); + Node *e = inode->i_private; int res = parse_command(buffer, count);
switch (res) { @@ -663,13 +770,22 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer, break; case 3: /* Delete this handler. */ - root = file_inode(file)->i_sb->s_root; - inode_lock(d_inode(root)); + inode = d_inode(inode->i_sb->s_root); + inode_lock(inode);
+ /* + * In order to add new element or remove elements from the list + * via bm_{entry,register,status}_write() inode_lock() on the + * root inode must be held. + * The lock is exclusive ensuring that the list can't be + * modified. Only load_misc_binary() can access but does so + * read-only. So we only need to take the write lock when we + * actually remove the entry from the list. + */ if (!list_empty(&e->list)) - kill_node(e); + remove_binfmt_handler(e);
- inode_unlock(d_inode(root)); + inode_unlock(inode); break; default: return res; @@ -728,13 +844,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, if (!inode) goto out2;
- err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count); - if (err) { - iput(inode); - inode = NULL; - goto out2; - } - + refcount_set(&e->users, 1); e->dentry = dget(dentry); inode->i_private = e; inode->i_fop = &bm_entry_operations; @@ -778,7 +888,8 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { int res = parse_command(buffer, count); - struct dentry *root; + Node *e, *next; + struct inode *inode;
switch (res) { case 1: @@ -791,13 +902,22 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer, break; case 3: /* Delete all handlers. */ - root = file_inode(file)->i_sb->s_root; - inode_lock(d_inode(root)); + inode = d_inode(file_inode(file)->i_sb->s_root); + inode_lock(inode);
- while (!list_empty(&entries)) - kill_node(list_first_entry(&entries, Node, list)); + /* + * In order to add new element or remove elements from the list + * via bm_{entry,register,status}_write() inode_lock() on the + * root inode must be held. + * The lock is exclusive ensuring that the list can't be + * modified. Only load_misc_binary() can access but does so + * read-only. So we only need to take the write lock when we + * actually remove the entry from the list. + */ + list_for_each_entry_safe(e, next, &entries, list) + remove_binfmt_handler(e);
- inode_unlock(d_inode(root)); + inode_unlock(inode); break; default: return res;
And just to be clear, please drop this (and the binfmt_elf change) from all -stable versions, not just 6.6. :)
Thanks!
-Kees
On Mon, Nov 06, 2023 at 06:17:23PM -0500, Sasha Levin wrote:
From: Christian Brauner christian.brauner@ubuntu.com
[ Upstream commit 1c5976ef0f7ad76319df748ccb99a4c7ba2ba464 ]
Currently, registering a new binary type pins the binfmt_misc filesystem. Specifically, this means that as long as there is at least one binary type registered the binfmt_misc filesystem survives all umounts, i.e. the superblock is not destroyed. Meaning that a umount followed by another mount will end up with the same superblock and the same binary type handlers. This is a behavior we tend to discourage for any new filesystems (apart from a few special filesystems such as e.g. configfs or debugfs). A umount operation without the filesystem being pinned - by e.g. someone holding a file descriptor to an open file - should usually result in the destruction of the superblock and all associated resources. This makes introspection easier and leads to clearly defined, simple and clean semantics. An administrator can rely on the fact that a umount will guarantee a clean slate making it possible to reinitialize a filesystem. Right now all binary types would need to be explicitly deleted before that can happen.
This allows us to remove the heavy-handed calls to simple_pin_fs() and simple_release_fs() when creating and deleting binary types. This in turn allows us to replace the current brittle pinning mechanism abusing dget() which has caused a range of bugs judging from prior fixes in [2] and [3]. The additional dget() in load_misc_binary() pins the dentry but only does so for the sake to prevent ->evict_inode() from freeing the node when a user removes the binary type and kill_node() is run. Which would mean ->interpreter and ->interp_file would be freed causing a UAF.
This isn't really nicely documented nor is it very clean because it relies on simple_pin_fs() pinning the filesystem as long as at least one binary type exists. Otherwise it would cause load_misc_binary() to hold on to a dentry belonging to a superblock that has been shutdown. Replace that implicit pinning with a clean and simple per-node refcount and get rid of the ugly dget() pinning. A similar mechanism exists for e.g. binderfs (cf. [4]). All the cleanup work can now be done in ->evict_inode().
In a follow-up patch we will make it possible to use binfmt_misc in sandboxes. We will use the cleaner semantics where a umount for the filesystem will cause the superblock and all resources to be deallocated. In preparation for this apply the same semantics to the initial binfmt_misc mount. Note, that this is a user-visible change and as such a uapi change but one that we can reasonably risk. We've discussed this in earlier versions of this patchset (cf. [1]).
The main user and provider of binfmt_misc is systemd. Systemd provides binfmt_misc via autofs since it is configurable as a kernel module and is used by a few exotic packages and users. As such a binfmt_misc mount is triggered when /proc/sys/fs/binfmt_misc is accessed and is only provided on demand. Other autofs on demand filesystems include EFI ESP which systemd umounts if the mountpoint stays idle for a certain amount of time. This doesn't apply to the binfmt_misc autofs mount which isn't touched once it is mounted meaning this change can't accidently wipe binary type handlers without someone having explicitly unmounted binfmt_misc. After speaking to systemd folks they don't expect this change to affect them.
In line with our general policy, if we see a regression for systemd or other users with this change we will switch back to the old behavior for the initial binfmt_misc mount and have binary types pin the filesystem again. But while we touch this code let's take the chance and let's improve on the status quo.
[2]: commit 43a4f2619038 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()" [3]: commit 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()") [4]: commit f0fe2c0f050d ("binder: prevent UAF for binderfs devices II")
Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1) Cc: Sargun Dhillon sargun@sargun.me Cc: Serge Hallyn serge@hallyn.com Cc: Jann Horn jannh@google.com Cc: Henning Schild henning.schild@siemens.com Cc: Andrei Vagin avagin@gmail.com Cc: Al Viro viro@zeniv.linux.org.uk Cc: Laurent Vivier laurent@vivier.eu Cc: linux-fsdevel@vger.kernel.org Acked-by: Serge Hallyn serge@hallyn.com Signed-off-by: Christian Brauner christian.brauner@ubuntu.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Kees Cook keescook@chromium.org
/* v2 */
- Christian Brauner christian.brauner@ubuntu.com:
- Add more comments that explain what's going on.
- Rename functions while changing them to better reflect what they are doing to make the code easier to understand.
- In the first version when a specific binary type handler was removed either through a write to the entry's file or all binary type handlers were removed by a write to the binfmt_misc mount's status file all cleanup work happened during inode eviction. That includes removal of the relevant entries from entry list. While that works fine I disliked that model after thinking about it for a bit. Because it means that there was a window were someone has already removed a or all binary handlers but they could still be safely reached from load_misc_binary() when it has managed to take the read_lock() on the entries list while inode eviction was already happening. Again, that perfectly benign but it's cleaner to remove the binary handler from the list immediately meaning that ones the write to then entry's file or the binfmt_misc status file returns the binary type cannot be executed anymore. That gives stronger guarantees to the user.
Signed-off-by: Sasha Levin sashal@kernel.org
fs/binfmt_misc.c | 216 ++++++++++++++++++++++++++++++++++++----------- 1 file changed, 168 insertions(+), 48 deletions(-)
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index c19bf5c2fbec9..e768cd60ff999 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -58,12 +58,11 @@ typedef struct { char *name; struct dentry *dentry; struct file *interp_file;
- refcount_t users; /* sync removal with load_misc_binary() */
} Node; static DEFINE_RWLOCK(entries_lock); static struct file_system_type bm_fs_type; -static struct vfsmount *bm_mnt; -static int entry_count; /*
- Max length of the register string. Determined by:
@@ -80,19 +79,23 @@ static int entry_count; */ #define MAX_REGISTER_LENGTH 1920 -/*
- Check if we support the binfmt
- if we do, return the node, else NULL
- locking is done in load_misc_binary
+/**
- search_binfmt_handler - search for a binary handler for @bprm
- @misc: handle to binfmt_misc instance
- @bprm: binary for which we are looking for a handler
- Search for a binary type handler for @bprm in the list of registered binary
- type handlers.
*/
- Return: binary type list entry on success, NULL on failure
-static Node *check_file(struct linux_binprm *bprm) +static Node *search_binfmt_handler(struct linux_binprm *bprm) { char *p = strrchr(bprm->interp, '.');
- struct list_head *l;
- Node *e;
/* Walk all the registered handlers. */
- list_for_each(l, &entries) {
Node *e = list_entry(l, Node, list);
- list_for_each_entry(e, &entries, list) { char *s; int j;
@@ -121,9 +124,49 @@ static Node *check_file(struct linux_binprm *bprm) if (j == e->size) return e; }
- return NULL;
} +/**
- get_binfmt_handler - try to find a binary type handler
- @misc: handle to binfmt_misc instance
- @bprm: binary for which we are looking for a handler
- Try to find a binfmt handler for the binary type. If one is found take a
- reference to protect against removal via bm_{entry,status}_write().
- Return: binary type list entry on success, NULL on failure
- */
+static Node *get_binfmt_handler(struct linux_binprm *bprm) +{
- Node *e;
- read_lock(&entries_lock);
- e = search_binfmt_handler(bprm);
- if (e)
refcount_inc(&e->users);
- read_unlock(&entries_lock);
- return e;
+}
+/**
- put_binfmt_handler - put binary handler node
- @e: node to put
- Free node syncing with load_misc_binary() and defer final free to
- load_misc_binary() in case it is using the binary type handler we were
- requested to remove.
- */
+static void put_binfmt_handler(Node *e) +{
- if (refcount_dec_and_test(&e->users)) {
if (e->flags & MISC_FMT_OPEN_FILE)
filp_close(e->interp_file, NULL);
kfree(e);
- }
+}
/*
- the loader itself
*/ @@ -138,12 +181,7 @@ static int load_misc_binary(struct linux_binprm *bprm) if (!enabled) return retval;
- /* to keep locking time low, we copy the interpreter string */
- read_lock(&entries_lock);
- fmt = check_file(bprm);
- if (fmt)
dget(fmt->dentry);
- read_unlock(&entries_lock);
- fmt = get_binfmt_handler(bprm); if (!fmt) return retval;
@@ -237,7 +275,16 @@ static int load_misc_binary(struct linux_binprm *bprm) goto error; ret:
- dput(fmt->dentry);
- /*
* If we actually put the node here all concurrent calls to
* load_misc_binary() will have finished. We also know
* that for the refcount to be zero ->evict_inode() must have removed
* the node to be deleted from the list. All that is left for us is to
* close and free.
*/
- put_binfmt_handler(fmt);
- return retval;
error: if (fd_binary > 0) @@ -598,30 +645,90 @@ static struct inode *bm_get_inode(struct super_block *sb, int mode) return inode; } +/**
- bm_evict_inode - cleanup data associated with @inode
- @inode: inode to which the data is attached
- Cleanup the binary type handler data associated with @inode if a binary type
- entry is removed or the filesystem is unmounted and the super block is
- shutdown.
- If the ->evict call was not caused by a super block shutdown but by a write
- to remove the entry or all entries via bm_{entry,status}_write() the entry
- will have already been removed from the list. We keep the list_empty() check
- to make that explicit.
+*/ static void bm_evict_inode(struct inode *inode) { Node *e = inode->i_private;
- if (e && e->flags & MISC_FMT_OPEN_FILE)
filp_close(e->interp_file, NULL);
- clear_inode(inode);
- kfree(e);
- if (e) {
write_lock(&entries_lock);
if (!list_empty(&e->list))
list_del_init(&e->list);
write_unlock(&entries_lock);
put_binfmt_handler(e);
- }
} -static void kill_node(Node *e) +/**
- unlink_binfmt_dentry - remove the dentry for the binary type handler
- @dentry: dentry associated with the binary type handler
- Do the actual filesystem work to remove a dentry for a registered binary
- type handler. Since binfmt_misc only allows simple files to be created
- directly under the root dentry of the filesystem we ensure that we are
- indeed passed a dentry directly beneath the root dentry, that the inode
- associated with the root dentry is locked, and that it is a regular file we
- are asked to remove.
- */
+static void unlink_binfmt_dentry(struct dentry *dentry) {
- struct dentry *dentry;
- struct dentry *parent = dentry->d_parent;
- struct inode *inode, *parent_inode;
- /* All entries are immediate descendants of the root dentry. */
- if (WARN_ON_ONCE(dentry->d_sb->s_root != parent))
return;
- /* We only expect to be called on regular files. */
- inode = d_inode(dentry);
- if (WARN_ON_ONCE(!S_ISREG(inode->i_mode)))
return;
- /* The parent inode must be locked. */
- parent_inode = d_inode(parent);
- if (WARN_ON_ONCE(!inode_is_locked(parent_inode)))
return;
- if (simple_positive(dentry)) {
dget(dentry);
simple_unlink(parent_inode, dentry);
d_delete(dentry);
dput(dentry);
- }
+}
+/**
- remove_binfmt_handler - remove a binary type handler
- @misc: handle to binfmt_misc instance
- @e: binary type handler to remove
- Remove a binary type handler from the list of binary type handlers and
- remove its associated dentry. This is called from
- binfmt_{entry,status}_write(). In the future, we might want to think about
- adding a proper ->unlink() method to binfmt_misc instead of forcing caller's
- to use writes to files in order to delete binary type handlers. But it has
- worked for so long that it's not a pressing issue.
- */
+static void remove_binfmt_handler(Node *e) +{ write_lock(&entries_lock); list_del_init(&e->list); write_unlock(&entries_lock);
- dentry = e->dentry;
- drop_nlink(d_inode(dentry));
- d_drop(dentry);
- dput(dentry);
- simple_release_fs(&bm_mnt, &entry_count);
- unlink_binfmt_dentry(e->dentry);
} /* /<entry> */ @@ -648,8 +755,8 @@ bm_entry_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) static ssize_t bm_entry_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) {
- struct dentry *root;
- Node *e = file_inode(file)->i_private;
- struct inode *inode = file_inode(file);
- Node *e = inode->i_private; int res = parse_command(buffer, count);
switch (res) { @@ -663,13 +770,22 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer, break; case 3: /* Delete this handler. */
root = file_inode(file)->i_sb->s_root;
inode_lock(d_inode(root));
inode = d_inode(inode->i_sb->s_root);
inode_lock(inode);
/*
* In order to add new element or remove elements from the list
* via bm_{entry,register,status}_write() inode_lock() on the
* root inode must be held.
* The lock is exclusive ensuring that the list can't be
* modified. Only load_misc_binary() can access but does so
* read-only. So we only need to take the write lock when we
* actually remove the entry from the list.
if (!list_empty(&e->list))*/
kill_node(e);
remove_binfmt_handler(e);
inode_unlock(d_inode(root));
break; default: return res;inode_unlock(inode);
@@ -728,13 +844,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, if (!inode) goto out2;
- err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
- if (err) {
iput(inode);
inode = NULL;
goto out2;
- }
- refcount_set(&e->users, 1); e->dentry = dget(dentry); inode->i_private = e; inode->i_fop = &bm_entry_operations;
@@ -778,7 +888,8 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { int res = parse_command(buffer, count);
- struct dentry *root;
- Node *e, *next;
- struct inode *inode;
switch (res) { case 1: @@ -791,13 +902,22 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer, break; case 3: /* Delete all handlers. */
root = file_inode(file)->i_sb->s_root;
inode_lock(d_inode(root));
inode = d_inode(file_inode(file)->i_sb->s_root);
inode_lock(inode);
while (!list_empty(&entries))
kill_node(list_first_entry(&entries, Node, list));
/*
* In order to add new element or remove elements from the list
* via bm_{entry,register,status}_write() inode_lock() on the
* root inode must be held.
* The lock is exclusive ensuring that the list can't be
* modified. Only load_misc_binary() can access but does so
* read-only. So we only need to take the write lock when we
* actually remove the entry from the list.
*/
list_for_each_entry_safe(e, next, &entries, list)
remove_binfmt_handler(e);
inode_unlock(d_inode(root));
break; default: return res;inode_unlock(inode);
-- 2.42.0
On Mon, Nov 06, 2023 at 04:05:46PM -0800, Kees Cook wrote:
And just to be clear, please drop this (and the binfmt_elf change) from all -stable versions, not just 6.6. :)
ack, thanks!
From: Ronald Wahl ronald.wahl@raritan.com
[ Upstream commit 6d3bc4c02d59996d1d3180d8ed409a9d7d5900e0 ]
On SAM9 hardware two cascaded 16 bit timers are used to form a 32 bit high resolution timer that is used as scheduler clock when the kernel has been configured that way (CONFIG_ATMEL_CLOCKSOURCE_TCB).
The driver initially triggers a reset-to-zero of the two timers but this reset is only performed on the next rising clock. For the first timer this is ok - it will be in the next 60ns (16MHz clock). For the chained second timer this will only happen after the first timer overflows, i.e. after 2^16 clocks (~4ms with a 16MHz clock). So with other words the scheduler clock resets to 0 after the first 2^16 clock cycles.
It looks like that the scheduler does not like this and behaves wrongly over its lifetime, e.g. some tasks are scheduled with a long delay. Why that is and if there are additional requirements for this behaviour has not been further analysed.
There is a simple fix for resetting the second timer as well when the first timer is reset and this is to set the ATMEL_TC_ASWTRG_SET bit in the Channel Mode register (CMR) of the first timer. This will also rise the TIOA line (clock input of the second timer) when a software trigger respective SYNC is issued.
Signed-off-by: Ronald Wahl ronald.wahl@raritan.com Acked-by: Alexandre Belloni alexandre.belloni@bootlin.com Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org Link: https://lore.kernel.org/r/20231007161803.31342-1-rwahl@gmx.de Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/clocksource/tcb_clksrc.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/clocksource/tcb_clksrc.c b/drivers/clocksource/tcb_clksrc.c index 9de47d4d2d9ef..e489730331a23 100644 --- a/drivers/clocksource/tcb_clksrc.c +++ b/drivers/clocksource/tcb_clksrc.c @@ -294,6 +294,7 @@ static void __init tcb_setup_dual_chan(struct atmel_tc *tc, int mck_divisor_idx) writel(mck_divisor_idx /* likely divide-by-8 */ | ATMEL_TC_WAVE | ATMEL_TC_WAVESEL_UP /* free-run */ + | ATMEL_TC_ASWTRG_SET /* TIOA0 rises at software trigger */ | ATMEL_TC_ACPA_SET /* TIOA0 rises at 0 */ | ATMEL_TC_ACPC_CLEAR, /* (duty cycle 50%) */ tcaddr + ATMEL_TC_REG(0, CMR));
From: "Mike Rapoport (IBM)" rppt@kernel.org
[ Upstream commit a1e2b8b36820d8c91275f207e77e91645b7c6836 ]
Qi Zheng reported crashes in a production environment and provided a simplified example as a reproducer:
| For example, if we use Qemu to start a two NUMA node kernel, | one of the nodes has 2M memory (less than NODE_MIN_SIZE), | and the other node has 2G, then we will encounter the | following panic: | | BUG: kernel NULL pointer dereference, address: 0000000000000000 | <...> | RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40 | <...> | Call Trace: | <TASK> | deactivate_slab() | bootstrap() | kmem_cache_init() | start_kernel() | secondary_startup_64_no_verify()
The crashes happen because of inconsistency between the nodemask that has nodes with less than 4MB as memoryless, and the actual memory fed into the core mm.
The commit:
9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring empty node in SRAT parsing")
... that introduced minimal size of a NUMA node does not explain why a node size cannot be less than 4MB and what boot failures this restriction might fix.
Fixes have been submitted to the core MM code to tighten up the memory topologies it accepts and to not crash on weird input:
mm: page_alloc: skip memoryless nodes entirely mm: memory_hotplug: drop memoryless node from fallback lists
Andrew has accepted them into the -mm tree, but there are no stable SHA1's yet.
This patch drops the limitation for minimal node size on x86:
- which works around the crash without the fixes to the core MM. - makes x86 topologies less weird, - removes an arbitrary and undocumented limitation on NUMA topologies.
[ mingo: Improved changelog clarity. ]
Reported-by: Qi Zheng zhengqi.arch@bytedance.com Tested-by: Mario Casquero mcasquer@redhat.com Signed-off-by: Mike Rapoport (IBM) rppt@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Acked-by: David Hildenbrand david@redhat.com Acked-by: Michal Hocko mhocko@suse.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Rik van Riel riel@surriel.com Link: https://lore.kernel.org/r/ZS+2qqjEO5/867br@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/include/asm/numa.h | 7 ------- arch/x86/mm/numa.c | 7 ------- 2 files changed, 14 deletions(-)
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index bbfde3d2662f4..4bcd9d0c7bee7 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -11,13 +11,6 @@
#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
-/* - * Too small node sizes may confuse the VM badly. Usually they - * result from BIOS bugs. So dont recognize nodes as standalone - * NUMA entities that have less than this amount of RAM listed: - */ -#define NODE_MIN_SIZE (4*1024*1024) - extern int numa_off;
/* diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 15661129794c0..53b733b2fba10 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -585,13 +585,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) if (start >= end) continue;
- /* - * Don't confuse VM with a node that doesn't have the - * minimum amount of memory: - */ - if (end && (end - start) < NODE_MIN_SIZE) - continue; - alloc_node_data(nid); }
linux-stable-mirror@lists.linaro.org