From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Dongliang reported:
I found that in the latest version, the nodes of tracefs have been
changed to dynamically created.
This has caused me to encounter a problem where the gid I specified in
the mounting parameters cannot apply to all files, as in the following
situation:
/data/tmp/events # mount | grep tracefs
tracefs on /data/tmp type tracefs (rw,seclabel,relatime,gid=3012)
gid 3012 = readtracefs
/data/tmp # ls -lh
total 0
-r--r----- 1 root readtracefs 0 1970-01-01 08:00 README
-r--r----- 1 root readtracefs 0 1970-01-01 08:00 available_events
ums9621_1h10:/data/tmp/events # ls -lh
total 0
drwxr-xr-x 2 root root 0 2023-12-19 00:56 alarmtimer
drwxr-xr-x 2 root root 0 2023-12-19 00:56 asoc
It will prevent certain applications from accessing tracefs properly, I
try to avoid this issue by making the following modifications.
To fix this, have the files created default to taking the ownership of
the parent dentry unless the ownership was previously set by the user.
Link: https://lore.kernel.org/linux-trace-kernel/1703063706-30539-1-git-send-emai…
Link: https://lore.kernel.org/linux-trace-kernel/20231220105017.1489d790@gandalf.…
Cc: stable(a)vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Hongyu Jin <hongyu.jin(a)unisoc.com>
Fixes: 28e12c09f5aa0 ("eventfs: Save ownership and mode")
Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Reported-by: Dongliang Cui <cuidongliang390(a)gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
fs/tracefs/event_inode.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 43e237864a42..2ccc849a5bda 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -148,7 +148,8 @@ static const struct file_operations eventfs_file_operations = {
.release = eventfs_release,
};
-static void update_inode_attr(struct inode *inode, struct eventfs_attr *attr, umode_t mode)
+static void update_inode_attr(struct dentry *dentry, struct inode *inode,
+ struct eventfs_attr *attr, umode_t mode)
{
if (!attr) {
inode->i_mode = mode;
@@ -162,9 +163,13 @@ static void update_inode_attr(struct inode *inode, struct eventfs_attr *attr, um
if (attr->mode & EVENTFS_SAVE_UID)
inode->i_uid = attr->uid;
+ else
+ inode->i_uid = d_inode(dentry->d_parent)->i_uid;
if (attr->mode & EVENTFS_SAVE_GID)
inode->i_gid = attr->gid;
+ else
+ inode->i_gid = d_inode(dentry->d_parent)->i_gid;
}
/**
@@ -206,7 +211,7 @@ static struct dentry *create_file(const char *name, umode_t mode,
return eventfs_failed_creating(dentry);
/* If the user updated the directory's attributes, use them */
- update_inode_attr(inode, attr, mode);
+ update_inode_attr(dentry, inode, attr, mode);
inode->i_op = &eventfs_file_inode_operations;
inode->i_fop = fop;
@@ -242,7 +247,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
return eventfs_failed_creating(dentry);
/* If the user updated the directory's attributes, use them */
- update_inode_attr(inode, &ei->attr, S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
+ update_inode_attr(dentry, inode, &ei->attr,
+ S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
inode->i_op = &eventfs_root_dir_inode_operations;
inode->i_fop = &eventfs_file_operations;
--
2.42.0
Dear Stable,
> Lee pointed out issue found by syscaller [0] hitting BUG in prog array
> map poke update in prog_array_map_poke_run function due to error value
> returned from bpf_arch_text_poke function.
>
> There's race window where bpf_arch_text_poke can fail due to missing
> bpf program kallsym symbols, which is accounted for with check for
> -EINVAL in that BUG_ON call.
>
> The problem is that in such case we won't update the tail call jump
> and cause imbalance for the next tail call update check which will
> fail with -EBUSY in bpf_arch_text_poke.
>
> I'm hitting following race during the program load:
>
> CPU 0 CPU 1
>
> bpf_prog_load
> bpf_check
> do_misc_fixups
> prog_array_map_poke_track
>
> map_update_elem
> bpf_fd_array_map_update_elem
> prog_array_map_poke_run
>
> bpf_arch_text_poke returns -EINVAL
>
> bpf_prog_kallsyms_add
>
> After bpf_arch_text_poke (CPU 1) fails to update the tail call jump, the next
> poke update fails on expected jump instruction check in bpf_arch_text_poke
> with -EBUSY and triggers the BUG_ON in prog_array_map_poke_run.
>
> Similar race exists on the program unload.
>
> Fixing this by moving the update to bpf_arch_poke_desc_update function which
> makes sure we call __bpf_arch_text_poke that skips the bpf address check.
>
> Each architecture has slightly different approach wrt looking up bpf address
> in bpf_arch_text_poke, so instead of splitting the function or adding new
> 'checkip' argument in previous version, it seems best to move the whole
> map_poke_run update as arch specific code.
>
> [0] https://syzkaller.appspot.com/bug?extid=97a4fe20470e9bc30810
>
> Cc: Lee Jones <lee(a)kernel.org>
> Cc: Maciej Fijalkowski <maciej.fijalkowski(a)intel.com>
> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
> Reported-by: syzbot+97a4fe20470e9bc30810(a)syzkaller.appspotmail.com
> Acked-by: Yonghong Song <yonghong.song(a)linux.dev>
> Signed-off-by: Jiri Olsa <jolsa(a)kernel.org>
> ---
> arch/x86/net/bpf_jit_comp.c | 46 +++++++++++++++++++++++++++++
> include/linux/bpf.h | 3 ++
> kernel/bpf/arraymap.c | 58 +++++++------------------------------
> 3 files changed, 59 insertions(+), 48 deletions(-)
Please could we have this backported?
Guided by the Fixes: tag.
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 8c10d9abc239..e89e415aa743 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -3025,3 +3025,49 @@ void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp
> #endif
> WARN(1, "verification of programs using bpf_throw should have failed\n");
> }
> +
> +void bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke,
> + struct bpf_prog *new, struct bpf_prog *old)
> +{
> + u8 *old_addr, *new_addr, *old_bypass_addr;
> + int ret;
> +
> + old_bypass_addr = old ? NULL : poke->bypass_addr;
> + old_addr = old ? (u8 *)old->bpf_func + poke->adj_off : NULL;
> + new_addr = new ? (u8 *)new->bpf_func + poke->adj_off : NULL;
> +
> + /*
> + * On program loading or teardown, the program's kallsym entry
> + * might not be in place, so we use __bpf_arch_text_poke to skip
> + * the kallsyms check.
> + */
> + if (new) {
> + ret = __bpf_arch_text_poke(poke->tailcall_target,
> + BPF_MOD_JUMP,
> + old_addr, new_addr);
> + BUG_ON(ret < 0);
> + if (!old) {
> + ret = __bpf_arch_text_poke(poke->tailcall_bypass,
> + BPF_MOD_JUMP,
> + poke->bypass_addr,
> + NULL);
> + BUG_ON(ret < 0);
> + }
> + } else {
> + ret = __bpf_arch_text_poke(poke->tailcall_bypass,
> + BPF_MOD_JUMP,
> + old_bypass_addr,
> + poke->bypass_addr);
> + BUG_ON(ret < 0);
> + /* let other CPUs finish the execution of program
> + * so that it will not possible to expose them
> + * to invalid nop, stack unwind, nop state
> + */
> + if (!ret)
> + synchronize_rcu();
> + ret = __bpf_arch_text_poke(poke->tailcall_target,
> + BPF_MOD_JUMP,
> + old_addr, NULL);
> + BUG_ON(ret < 0);
> + }
> +}
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 6762dac3ef76..cff5bb08820e 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -3175,6 +3175,9 @@ enum bpf_text_poke_type {
> int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
> void *addr1, void *addr2);
>
> +void bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke,
> + struct bpf_prog *new, struct bpf_prog *old);
> +
> void *bpf_arch_text_copy(void *dst, void *src, size_t len);
> int bpf_arch_text_invalidate(void *dst, size_t len);
>
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 2058e89b5ddd..c85ff9162a5c 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -1012,11 +1012,16 @@ static void prog_array_map_poke_untrack(struct bpf_map *map,
> mutex_unlock(&aux->poke_mutex);
> }
>
> +void __weak bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke,
> + struct bpf_prog *new, struct bpf_prog *old)
> +{
> + WARN_ON_ONCE(1);
> +}
> +
> static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
> struct bpf_prog *old,
> struct bpf_prog *new)
> {
> - u8 *old_addr, *new_addr, *old_bypass_addr;
> struct prog_poke_elem *elem;
> struct bpf_array_aux *aux;
>
> @@ -1025,7 +1030,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
>
> list_for_each_entry(elem, &aux->poke_progs, list) {
> struct bpf_jit_poke_descriptor *poke;
> - int i, ret;
> + int i;
>
> for (i = 0; i < elem->aux->size_poke_tab; i++) {
> poke = &elem->aux->poke_tab[i];
> @@ -1044,21 +1049,10 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
> * activated, so tail call updates can arrive from here
> * while JIT is still finishing its final fixup for
> * non-activated poke entries.
> - * 3) On program teardown, the program's kallsym entry gets
> - * removed out of RCU callback, but we can only untrack
> - * from sleepable context, therefore bpf_arch_text_poke()
> - * might not see that this is in BPF text section and
> - * bails out with -EINVAL. As these are unreachable since
> - * RCU grace period already passed, we simply skip them.
> - * 4) Also programs reaching refcount of zero while patching
> + * 3) Also programs reaching refcount of zero while patching
> * is in progress is okay since we're protected under
> * poke_mutex and untrack the programs before the JIT
> - * buffer is freed. When we're still in the middle of
> - * patching and suddenly kallsyms entry of the program
> - * gets evicted, we just skip the rest which is fine due
> - * to point 3).
> - * 5) Any other error happening below from bpf_arch_text_poke()
> - * is a unexpected bug.
> + * buffer is freed.
> */
> if (!READ_ONCE(poke->tailcall_target_stable))
> continue;
> @@ -1068,39 +1062,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
> poke->tail_call.key != key)
> continue;
>
> - old_bypass_addr = old ? NULL : poke->bypass_addr;
> - old_addr = old ? (u8 *)old->bpf_func + poke->adj_off : NULL;
> - new_addr = new ? (u8 *)new->bpf_func + poke->adj_off : NULL;
> -
> - if (new) {
> - ret = bpf_arch_text_poke(poke->tailcall_target,
> - BPF_MOD_JUMP,
> - old_addr, new_addr);
> - BUG_ON(ret < 0 && ret != -EINVAL);
> - if (!old) {
> - ret = bpf_arch_text_poke(poke->tailcall_bypass,
> - BPF_MOD_JUMP,
> - poke->bypass_addr,
> - NULL);
> - BUG_ON(ret < 0 && ret != -EINVAL);
> - }
> - } else {
> - ret = bpf_arch_text_poke(poke->tailcall_bypass,
> - BPF_MOD_JUMP,
> - old_bypass_addr,
> - poke->bypass_addr);
> - BUG_ON(ret < 0 && ret != -EINVAL);
> - /* let other CPUs finish the execution of program
> - * so that it will not possible to expose them
> - * to invalid nop, stack unwind, nop state
> - */
> - if (!ret)
> - synchronize_rcu();
> - ret = bpf_arch_text_poke(poke->tailcall_target,
> - BPF_MOD_JUMP,
> - old_addr, NULL);
> - BUG_ON(ret < 0 && ret != -EINVAL);
> - }
> + bpf_arch_poke_desc_update(poke, new, old);
> }
> }
> }
> --
> 2.43.0
>
--
Lee Jones [李琼斯]
Syzkaller reports "memory leak in p9pdu_readf" in 5.10 stable releases.
I've attached reproducers in Bugzilla [1].
The problem has been fixed by the following patch which can be applied
to the 5.10 branch.
The fix is already present in all stable branches starting from 5.15.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=218235
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
On 32-bit systems, we'll lose the top bits of index because arithmetic
will be performed in unsigned long instead of unsigned long long. This
affects files over 4GB in size.
Fixes: 6100e34b2526 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
Cc: stable(a)vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
---
mm/memory-failure.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 82e15baabb48..455093f73a70 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1704,7 +1704,7 @@ static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
* mapping being torn down is communicated in siginfo, see
* kill_proc()
*/
- loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
+ loff_t start = ((loff_t)index << PAGE_SHIFT) & ~(size - 1);
unmap_mapping_range(mapping, start, size, 0);
}
--
2.42.0