December 2021 - Linux-stable-mirror

[PATCH] hwmon: (dell-smm) Fix warning on /proc/i8k creation error

by Armin Wolf

commit dbd3e6eaf3d813939b28e8a66e29d81cdc836445 upstream. The removal function is called regardless of whether /proc/i8k was created successfully or not, the later causing a WARN() on module removal. Fix that by only calling the removal function if /proc/i8k was created successfully. Since the original patch depends on the driver registering a platform device, the backported patch stores the return value of proc_create() and only calls proc_remove_entry() on exit if proc_create() was successful. Tested on a Inspiron 3505 for kernel 5.10. Cc: <stable(a)vger.kernel.org> # 5.10.x Signed-off-by: Armin Wolf <W_Armin(a)gmx.de> --- drivers/hwmon/dell-smm-hwmon.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/hwmon/dell-smm-hwmon.c b/drivers/hwmon/dell-smm-hwmon.c index 63b74e781c5d..87f401100466 100644 --- a/drivers/hwmon/dell-smm-hwmon.c +++ b/drivers/hwmon/dell-smm-hwmon.c @@ -603,15 +603,18 @@ static const struct proc_ops i8k_proc_ops = { .proc_ioctl = i8k_ioctl, }; +static struct proc_dir_entry *entry; + static void __init i8k_init_procfs(void) { /* Register the proc entry */ - proc_create("i8k", 0, NULL, &i8k_proc_ops); + entry = proc_create("i8k", 0, NULL, &i8k_proc_ops); } static void __exit i8k_exit_procfs(void) { - remove_proc_entry("i8k", NULL); + if (entry) + remove_proc_entry("i8k", NULL); } #else -- 2.30.2

3 years, 6 months

2
1
0 0

Re: [PATCH] [fuse] alloc_page nofs avoid deadlock

by Ed Tsai

On Tue, 2021-09-28 at 23:25 +0800, Miklos Szeredi wrote: > On Fri, Sep 24, 2021 at 09:52:35AM +0200, Miklos Szeredi wrote: > > On Fri, 24 Sept 2021 at 05:52, Ed Tsai <ed.tsai(a)mediatek.com> > > wrote: > > > > > > On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote: > > > > On Tue, 13 Jul 2021 at 04:42, Ed Tsai <ed.tsai(a)mediatek.com> > > > > wrote: > > > > > > > > > > On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote: > > > > > > On Thu, 3 Jun 2021 at 14:52, chenguanyou < > > > > > > chenguanyou9338(a)gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > ABA deadlock > > > > > > > > > > > > > > PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: > > > > > > > "Thread-21" > > > > > > > 0 [ffffff802d16b400] __switch_to at ffffff8008086a4c > > > > > > > 1 [ffffff802d16b470] __schedule at ffffff80091ffe58 > > > > > > > 2 [ffffff802d16b4d0] schedule at ffffff8009200348 > > > > > > > 3 [ffffff802d16b4f0] bit_wait at ffffff8009201098 > > > > > > > 4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34 > > > > > > > 5 [ffffff802d16b5b0] inode_wait_for_writeback at > > > > > > > ffffff800830e1e8 > > > > > > > 6 [ffffff802d16b5e0] evict at ffffff80082fb15c > > > > > > > 7 [ffffff802d16b620] iput at ffffff80082f9270 > > > > > > > 8 [ffffff802d16b680] dentry_unlink_inode at > > > > > > > ffffff80082f4c90 > > > > > > > 9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710 > > > > > > > 10 [ffffff802d16b6d0] shrink_dentry_list at > > > > > > > ffffff80082f1c34 > > > > > > > 11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8 > > > > > > > 12 [ffffff802d16b770] super_cache_scan at > > > > > > > ffffff80082d55ac > > > > > > > 13 [ffffff802d16b860] shrink_slab at ffffff8008266170 > > > > > > > 14 [ffffff802d16b900] shrink_node at ffffff800826b420 > > > > > > > 15 [ffffff802d16b980] do_try_to_free_pages at > > > > > > > ffffff8008268460 > > > > > > > 16 [ffffff802d16ba60] try_to_free_pages at > > > > > > > ffffff80082680d0 > > > > > > > 17 [ffffff802d16bbe0] __alloc_pages_nodemask at > > > > > > > ffffff8008256514 > > > > > > > 18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268 > > > > > > > 19 [ffffff802d16bd00] fuse_dev_do_read at > > > > > > > ffffff8008437654 > > > > > > > 20 [ffffff802d16bdc0] fuse_dev_splice_read at > > > > > > > ffffff8008436f40 > > > > > > > 21 [ffffff802d16be60] sys_splice at ffffff8008315d18 > > > > > > > 22 [ffffff802d16bff0] __sys_trace at ffffff8008084014 > > > > > > > > > > > > > > PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND: > > > > > > > "kworker/u16:8" > > > > > > > 0 [ffffff802e793650] __switch_to at ffffff8008086a4c > > > > > > > 1 [ffffff802e7936c0] __schedule at ffffff80091ffe58 > > > > > > > 2 [ffffff802e793720] schedule at ffffff8009200348 > > > > > > > 3 [ffffff802e793770] __fuse_request_send at > > > > > > > ffffff8008435760 > > > > > > > 4 [ffffff802e7937b0] fuse_simple_request at > > > > > > > ffffff8008435b14 > > > > > > > 5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0 > > > > > > > 6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc > > > > > > > 7 [ffffff802e793980] __writeback_single_inode at > > > > > > > ffffff8008312740 > > > > > > > 8 [ffffff802e793aa0] writeback_sb_inodes at > > > > > > > ffffff80083117e4 > > > > > > > 9 [ffffff802e793b00] __writeback_inodes_wb at > > > > > > > ffffff8008311d98 > > > > > > > 10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc > > > > > > > 11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8 > > > > > > > 12 [ffffff802e793d90] process_one_work at > > > > > > > ffffff80080e4fac > > > > > > > 13 [ffffff802e793e00] worker_thread at ffffff80080e5670 > > > > > > > 14 [ffffff802e793e60] kthread at ffffff80080eb650 > > > > > > > > > > > > The issue is real. > > > > > > > > > > > > The fix, however, is not the right one. The fundamental > > > > > > problem > > > > > > is > > > > > > that fuse_write_inode() blocks on a request to userspace. > > > > > > > > > > > > This is the same issue that fuse_writepage/fuse_writepages > > > > > > face. In > > > > > > that case the solution was to copy the page contents to a > > > > > > temporary > > > > > > buffer and return immediately as if the writeback already > > > > > > completed. > > > > > > > > > > > > Something similar needs to be done here: send the > > > > > > FUSE_SETATTR > > > > > > request > > > > > > asynchronously and return immediately from > > > > > > fuse_write_inode(). The > > > > > > tricky part is to make sure that multiple time updates for > > > > > > the > > > > > > same > > > > > > inode aren't mixed up... > > > > > > > > > > > > Thanks, > > > > > > Miklos > > > > > > > > > > Dear Szeredi, > > > > > > > > > > Writeback thread calls fuse_write_inode() and wait for user > > > > > Daemon > > > > > to > > > > > complete this write inode request. The user daemon will > > > > > alloc_page() > > > > > after taking this request, and a deadlock could happen when > > > > > we try > > > > > to > > > > > shrink dentry list under memory pressure. > > > > > > > > > > We (Mediatek) glad to work on this issue for mainline and > > > > > also LTS. > > > > > So > > > > > another problem is that we should not change the protocol or > > > > > feature > > > > > for stable kernel. > > > > > > > > > > Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip > > > > > the > > > > > dentry > > > > > shirnker. It works but degrade the alloc_page success rate. > > > > > In a > > > > > more > > > > > fundamental way, we could cache the contents and return > > > > > immediately. > > > > > But how to ensure the request will be done successfully, > > > > > e.g., > > > > > always > > > > > retry if it fails from daemon. > > > > > > > > Key is where the the dirty metadata is flushed. To prevent > > > > deadlock > > > > it must not be flushed from memory reclaim, so must make sure > > > > that it > > > > is flushed on close(2) and munmap(2) and not dirtied after > > > > that. > > > > > > > > I'm working on this currently and hope to get it ready for the > > > > next > > > > merge window. > > > > > > > > Thanks, > > > > Miklos > > > > > > Hi Miklos, > > > > > > I'm not sure whether it has already been resolved in mainline. > > > If it still WIP, please cc me on future emails. > > > > Hi, > > > > This is taking a bit longer, unfortunately, but I already have > > something in testing and currently cleaning it up for review. Hope > > to > > post a series today or early next week. > > > Here's a minimal patch. It's been through some iterations and some > testing, but > more review and testing is definitely welcome. > > Chenguanyou, can you please verify that it fixes the deadlock? > > Thanks, > Miklos > > --- > From: Miklos Szeredi <mszeredi(a)redhat.com> > Subject: fuse: make sure reclaim doesn't write the inode > > In writeback cache mode mtime/ctime updates are cached, and flushed > to the > server using the ->write_inode() callback. > > Closing the file will result in a dirty inode being immediately > written, > but in other cases the inode can remain dirty after all references > are > dropped. This result in the inode being written back from reclaim, > which > can deadlock on a regular allocation while the request is being > served. > > The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, > because > serving a request involves unrelated userspace process(es). > > Instead do the same as for dirty pages: make sure the inode is > written > before the last reference is gone. > > - fuse_vma_close(): flush times in addition to the dirty pages > > - fallocate(2)/copy_file_range(2): these call file_update_time() or > file_modified(), so flush the inode before returning from the call > > - unlink(2), link(2) and rename(2): these call fuse_update_ctime(), > so > flush the ctime directly from this helper > > Reported-by: chenguanyou <chenguanyou(a)xiaomi.com> > Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com> > --- > fs/fuse/dir.c | 8 ++++++++ > fs/fuse/file.c | 24 +++++++++++++++++++++--- > fs/fuse/fuse_i.h | 1 + > 3 files changed, 30 insertions(+), 3 deletions(-) > > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -738,12 +738,20 @@ static int fuse_symlink(struct user_name > return create_new_entry(fm, &args, dir, entry, S_IFLNK); > } > > +void fuse_flush_time_update(struct inode *inode) > +{ > + int err = sync_inode_metadata(inode, 1); > + > + mapping_set_error(inode->i_mapping, err); > +} > + > void fuse_update_ctime(struct inode *inode) > { > fuse_invalidate_attr(inode); > if (!IS_NOCMTIME(inode)) { > inode->i_ctime = current_time(inode); > mark_inode_dirty_sync(inode); > + fuse_flush_time_update(inode); > } > } > > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -1847,6 +1847,17 @@ int fuse_write_inode(struct inode *inode > struct fuse_file *ff; > int err; > > + /* > + * Inode is always written before the last reference is dropped > and > + * hence this should not be reached from reclaim. > + * > + * Writing back the inode from reclaim can deadlock if the > request > + * processing itself needs an allocation. Allocations > triggering > + * reclaim while serving a request can't be prevented, because > it can > + * involve any number of unrelated userspace processes. > + */ > + WARN_ON(wbc->for_reclaim); > + > ff = __fuse_write_file_get(fi); > err = fuse_flush_times(inode, ff); > if (ff) > @@ -2339,12 +2350,15 @@ static int fuse_launder_page(struct page > } > > /* > - * Write back dirty pages now, because there may not be any suitable > - * open files later > + * Write back dirty data/metadata now (there may not be any suitable > + * open files later for data) > */ > static void fuse_vma_close(struct vm_area_struct *vma) > { > - filemap_write_and_wait(vma->vm_file->f_mapping); > + int err; > + > + err = write_inode_now(vma->vm_file->f_mapping->host, 1); > + mapping_set_error(vma->vm_file->f_mapping, err); > } > > /* > @@ -3001,6 +3015,8 @@ static long fuse_file_fallocate(struct f > if (lock_inode) > inode_unlock(inode); > > + fuse_flush_time_update(inode); > + > return err; > } > > @@ -3110,6 +3126,8 @@ static ssize_t __fuse_copy_file_range(st > inode_unlock(inode_out); > file_accessed(file_in); > > + fuse_flush_time_update(inode_out); > + > return err; > } > > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -1145,6 +1145,7 @@ int fuse_allow_current_process(struct fu > > u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id); > > +void fuse_flush_time_update(struct inode *inode); > void fuse_update_ctime(struct inode *inode); > > int fuse_update_attributes(struct inode *inode, struct file *file); Hi Mikloz, Greg, This deadlock issue could be raised in high memory pressure and the patch has been merged in commit 5c791fe ("fuse: make sure reclaim doesn't write the inode"). Can we take it to the LTS version? Best, Ed Tsai

3 years, 6 months

2
3
0 0

[PATCH] bpf: fix panic due to oob in bpf_prog_test_run_skb

by Connor O'Brien

From: Daniel Borkmann <daniel(a)iogearbox.net> commit 6e6fddc78323533be570873abb728b7e0ba7e024 upstream. sykzaller triggered several panics similar to the below: [...] [ 248.851531] BUG: KASAN: use-after-free in _copy_to_user+0x5c/0x90 [ 248.857656] Read of size 985 at addr ffff8808017ffff2 by task a.out/1425 [...] [ 248.865902] CPU: 1 PID: 1425 Comm: a.out Not tainted 4.18.0-rc4+ #13 [ 248.865903] Hardware name: Supermicro SYS-5039MS-H12TRF/X11SSE-F, BIOS 2.1a 03/08/2018 [ 248.865905] Call Trace: [ 248.865910] dump_stack+0xd6/0x185 [ 248.865911] ? show_regs_print_info+0xb/0xb [ 248.865913] ? printk+0x9c/0xc3 [ 248.865915] ? kmsg_dump_rewind_nolock+0xe4/0xe4 [ 248.865919] print_address_description+0x6f/0x270 [ 248.865920] kasan_report+0x25b/0x380 [ 248.865922] ? _copy_to_user+0x5c/0x90 [ 248.865924] check_memory_region+0x137/0x190 [ 248.865925] kasan_check_read+0x11/0x20 [ 248.865927] _copy_to_user+0x5c/0x90 [ 248.865930] bpf_test_finish.isra.8+0x4f/0xc0 [ 248.865932] bpf_prog_test_run_skb+0x6a0/0xba0 [...] After scrubbing the BPF prog a bit from the noise, turns out it called bpf_skb_change_head() for the lwt_xmit prog with headroom of 2. Nothing wrong in that, however, this was run with repeat >> 0 in bpf_prog_test_run_skb() and the same skb thus keeps changing until the pskb_expand_head() called from skb_cow() keeps bailing out in atomic alloc context with -ENOMEM. So upon return we'll basically have 0 headroom left yet blindly do the __skb_push() of 14 bytes and keep copying data from there in bpf_test_finish() out of bounds. Fix to check if we have enough headroom and if pskb_expand_head() fails, bail out with error. Another bug independent of this fix (but related in triggering above) is that BPF_PROG_TEST_RUN should be reworked to reset the skb/xdp buffer to it's original state from input as otherwise repeating the same test in a loop won't work for benchmarking when underlying input buffer is getting changed by the prog each time and reused for the next run leading to unexpected results. Fixes: 1cf1cae963c2 ("bpf: introduce BPF_PROG_TEST_RUN command") Reported-by: syzbot+709412e651e55ed96498(a)syzkaller.appspotmail.com Reported-by: syzbot+54f39d6ab58f39720a55(a)syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net> Signed-off-by: Alexei Starovoitov <ast(a)kernel.org> [connoro: drop test_verifier.c changes not applicable to 4.14] Signed-off-by: Connor O'Brien <connoro(a)google.com> --- Hello, This is a backport for the 4.14 stable tree. Thanks, Connor net/bpf/test_run.c | 17 ++++++++++++++--- tools/testing/selftests/bpf/test_verifier.c | 18 ++++++++++++++++++ 2 files changed, 32 insertions(+), 3 deletions(-) diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 6be41a44d688..4f3c08583d8c 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -96,6 +96,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, u32 size = kattr->test.data_size_in; u32 repeat = kattr->test.repeat; u32 retval, duration; + int hh_len = ETH_HLEN; struct sk_buff *skb; void *data; int ret; @@ -131,12 +132,22 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, skb_reset_network_header(skb); if (is_l2) - __skb_push(skb, ETH_HLEN); + __skb_push(skb, hh_len); if (is_direct_pkt_access) bpf_compute_data_end(skb); retval = bpf_test_run(prog, skb, repeat, &duration); - if (!is_l2) - __skb_push(skb, ETH_HLEN); + if (!is_l2) { + if (skb_headroom(skb) < hh_len) { + int nhead = HH_DATA_ALIGN(hh_len - skb_headroom(skb)); + + if (pskb_expand_head(skb, nhead, 0, GFP_USER)) { + kfree_skb(skb); + return -ENOMEM; + } + } + memset(__skb_push(skb, hh_len), 0, hh_len); + } + size = skb->len; /* bpf program can never convert linear skb to non-linear */ if (WARN_ON_ONCE(skb_is_nonlinear(skb))) diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index d4f611546fc0..0846345fe1e5 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -4334,6 +4334,24 @@ static struct bpf_test tests[] = { .result = ACCEPT, .prog_type = BPF_PROG_TYPE_LWT_XMIT, }, + { + "make headroom for LWT_XMIT", + .insns = { + BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), + BPF_MOV64_IMM(BPF_REG_2, 34), + BPF_MOV64_IMM(BPF_REG_3, 0), + BPF_EMIT_CALL(BPF_FUNC_skb_change_head), + /* split for s390 to succeed */ + BPF_MOV64_REG(BPF_REG_1, BPF_REG_6), + BPF_MOV64_IMM(BPF_REG_2, 42), + BPF_MOV64_IMM(BPF_REG_3, 0), + BPF_EMIT_CALL(BPF_FUNC_skb_change_head), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_LWT_XMIT, + }, { "invalid access of tc_classid for LWT_IN", .insns = { -- 2.34.1.173.g76aa8bc2d0-goog

3 years, 6 months

2
1
0 0

[PATCH] bpf: Fix integer overflow in argument calculation for bpf_map_area_alloc

by Connor O'Brien

From: Bui Quang Minh <minhquangbui99(a)gmail.com> commit 7dd5d437c258bbf4cc15b35229e5208b87b8b4e0 upstream. In 32-bit architecture, the result of sizeof() is a 32-bit integer so the expression becomes the multiplication between 2 32-bit integer which can potentially leads to integer overflow. As a result, bpf_map_area_alloc() allocates less memory than needed. Fix this by casting 1 operand to u64. Fixes: 0d2c4f964050 ("bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash maps") Fixes: 99c51064fb06 ("devmap: Use bpf_map_area_alloc() for allocating hash buckets") Fixes: 546ac1ffb70d ("bpf: add devmap, a map for storing net device references") Signed-off-by: Bui Quang Minh <minhquangbui99(a)gmail.com> Signed-off-by: Alexei Starovoitov <ast(a)kernel.org> Link: https://lore.kernel.org/bpf/20210613143440.71975-1-minhquangbui99@gmail.com Signed-off-by: Connor O'Brien <connoro(a)google.com> --- Hello, This is for the 5.4 and 5.10 kernels. Thanks, Connor kernel/bpf/devmap.c | 4 ++-- net/core/sock_map.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index 6684696fa457..4b2819b0a05a 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -94,7 +94,7 @@ static struct hlist_head *dev_map_create_hash(unsigned int entries, int i; struct hlist_head *hash; - hash = bpf_map_area_alloc(entries * sizeof(*hash), numa_node); + hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node); if (hash != NULL) for (i = 0; i < entries; i++) INIT_HLIST_HEAD(&hash[i]); @@ -159,7 +159,7 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr) spin_lock_init(&dtab->index_lock); } else { - dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries * + dtab->netdev_map = bpf_map_area_alloc((u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *), dtab->map.numa_node); if (!dtab->netdev_map) diff --git a/net/core/sock_map.c b/net/core/sock_map.c index df52061f99f7..2646e8f98f67 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -48,7 +48,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr) if (err) goto free_stab; - stab->sks = bpf_map_area_alloc(stab->map.max_entries * + stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries * sizeof(struct sock *), stab->map.numa_node); if (stab->sks) -- 2.34.1.173.g76aa8bc2d0-goog

3 years, 6 months

2
1
0 0

[PATCH 5.4] selinux: fix race condition when computing ocontext SIDs

by Vijay Balakrishna

From: Ondrej Mosnacek <omosnace(a)redhat.com> commit cbfcd13be5cb2a07868afe67520ed181956579a7 upstream. Current code contains a lot of racy patterns when converting an ocontext's context structure to an SID. This is being done in a "lazy" fashion, such that the SID is looked up in the SID table only when it's first needed and then cached in the "sid" field of the ocontext structure. However, this is done without any locking or memory barriers and is thus unsafe. Between commits 24ed7fdae669 ("selinux: use separate table for initial SID lookup") and 66f8e2f03c02 ("selinux: sidtab reverse lookup hash table"), this race condition lead to an actual observable bug, because a pointer to the shared sid field was passed directly to sidtab_context_to_sid(), which was using this location to also store an intermediate value, which could have been read by other threads and interpreted as an SID. In practice this caused e.g. new mounts to get a wrong (seemingly random) filesystem context, leading to strange denials. This bug has been spotted in the wild at least twice, see [1] and [2]. Fix the race condition by making all the racy functions use a common helper that ensures the ocontext::sid accesses are made safely using the appropriate SMP constructs. Note that security_netif_sid() was populating the sid field of both contexts stored in the ocontext, but only the first one was actually used. The SELinux wiki's documentation on the "netifcon" policy statement [3] suggests that using only the first context is intentional. I kept only the handling of the first context here, as there is really no point in doing the SID lookup for the unused one. I wasn't able to reproduce the bug mentioned above on any kernel that includes commit 66f8e2f03c02, even though it has been reported that the issue occurs with that commit, too, just less frequently. Thus, I wasn't able to verify that this patch fixes the issue, but it makes sense to avoid the race condition regardless. [1] https://github.com/containers/container-selinux/issues/89 [2] https://lists.fedoraproject.org/archives/list/selinux@lists.fedoraproject.o… [3] https://selinuxproject.org/page/NetworkStatements#netifcon Cc: stable(a)vger.kernel.org Cc: Xinjie Zheng <xinjie(a)google.com> Reported-by: Sujithra Periasamy <sujithra(a)google.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Ondrej Mosnacek <omosnace(a)redhat.com> Signed-off-by: Paul Moore <paul(a)paul-moore.com> (cherry picked from commit cbfcd13be5cb2a07868afe67520ed181956579a7) [vijayb: Backport contextual differences are due to v5.10 RCU related changes are not in 5.4] Signed-off-by: Vijay Balakrishna <vijayb(a)linux.microsoft.com> --- We have kernel crashes with stack traces related to selinux security context to sid in 5.4 -- https://lore.kernel.org/all/af058f59-ce8a-7648-25e8-f8b8a2dbb0ba@linux.micr… Unfortunately we don't have a on-demand repro. We are hoping this patch would help in addressing a possible race in 5.4. [ 6.222870] Unable to handle kernel access to user memory outside uaccess routines at virtual address 000000000000000c [ 6.222875] Mem abort info: [ 6.222876] ESR = 0x96000004 [ 6.222878] EC = 0x25: DABT (current EL), IL = 32 bits [ 6.222879] SET = 0, FnV = 0 [ 6.222881] EA = 0, S1PTW = 0 [ 6.222881] Data abort info: [ 6.222883] ISV = 0, ISS = 0x00000004 [ 6.222884] CM = 0, WnR = 0 [ 6.222887] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000965148000 [ 6.222888] [000000000000000c] pgd=0000000000000000 [ 6.222893] Internal error: Oops: 96000004 [#1] SMP [ 6.227931] Modules linked in: bnxt_en pcie_iproc_platform pcie_iproc diagbe(O) [ 6.235480] CPU: 6 PID: 1 Comm: systemd Tainted: G O 5.4.144-xx #1 [ 6.244632] Hardware name: Overlake (DT) [ 6.248677] pstate: 80400005 (Nzcv daif +PAN -UAO) [ 6.253629] pc : sidtab_context_to_sid+0x154/0x600 [ 6.258570] lr : sidtab_context_to_sid+0x150/0x600 [ 6.263510] sp : ffff80001005b7e0 [ 6.266928] x29: ffff80001005b7e0 x28: 0000000000000000 [ 6.272406] x27: 0000000000000000 x26: ffff80001005b8d8 [ 6.277884] x25: ffff80001005b8f0 x24: ffff250b25230000 [ 6.283362] x23: ffff80001005b9a4 x22: ffffd429fedb9808 [ 6.288841] x21: ffff80001005b8c0 x20: 0000000000000118 [ 6.294319] x19: 0000000000000000 x18: 0000000000000000 [ 6.299797] x17: 0000000000000000 x16: 0000000000000000 [ 6.305275] x15: 0000000000000000 x14: 0000000000000000 [ 6.310753] x13: 0000000000000000 x12: 0000000000000010 [ 6.316231] x11: 0000000000000010 x10: 0101010101010101 [ 6.321710] x9 : fffffffffffffffe x8 : 7f7f7f7f7f7f7f7f [ 6.327188] x7 : fefefefefeff735e x6 : 0000808080808080 [ 6.332667] x5 : 0000000000000000 x4 : ffff250b25230000 [ 6.338144] x3 : ffff80001005b8c0 x2 : 0000000000000000 [ 6.343622] x1 : 0000000000000119 x0 : 0000000000000000 [ 6.349100] Call trace: [ 6.351625] sidtab_context_to_sid+0x154/0x600 [ 6.356207] security_context_to_sid_core.isra.21+0x190/0x250 [ 6.362133] security_context_to_sid+0x54/0x68 [ 6.366715] selinux_kernfs_init_security+0xd0/0x210 [ 6.371838] security_kernfs_init_security+0x40/0x60 [ 6.376961] __kernfs_new_node+0x174/0x218 [ 6.381185] kernfs_new_node+0x60/0x90 [ 6.385051] __kernfs_create_file+0x60/0x300 [ 6.389457] cgroup_addrm_files+0x14c/0x308 [ 6.393770] css_populate_dir+0x7c/0x168 [ 6.397815] cgroup_apply_control_enable+0x100/0x348 [ 6.402934] cgroup_mkdir+0x380/0x520 [ 6.406710] kernfs_iop_mkdir+0x94/0xf0 [ 6.410666] vfs_mkdir+0xf4/0x1c0 [ 6.414084] do_mkdirat+0x98/0x110 [ 6.417590] __arm64_sys_mkdirat+0x28/0x38 [ 6.421817] el0_svc_handler+0x90/0x138 [ 6.425773] el0_svc+0x8/0x208 [ 6.428925] Code: 2a1403e1 aa1803e0 97fffd81 aa0003fc (b9400c00) [ 6.435219] ---[ end trace bb81d12a8eb77133 ]--- --- security/selinux/ss/services.c | 159 ++++++++++++++++++--------------- 1 file changed, 87 insertions(+), 72 deletions(-) diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c index f62adf3cfce8..a0afe49309c8 100644 --- a/security/selinux/ss/services.c +++ b/security/selinux/ss/services.c @@ -2250,6 +2250,43 @@ size_t security_policydb_len(struct selinux_state *state) return len; } +/** + * ocontext_to_sid - Helper to safely get sid for an ocontext + * @sidtab: SID table + * @c: ocontext structure + * @index: index of the context entry (0 or 1) + * @out_sid: pointer to the resulting SID value + * + * For all ocontexts except OCON_ISID the SID fields are populated + * on-demand when needed. Since updating the SID value is an SMP-sensitive + * operation, this helper must be used to do that safely. + * + * WARNING: This function may return -ESTALE, indicating that the caller + * must retry the operation after re-acquiring the policy pointer! + */ +static int ocontext_to_sid(struct sidtab *sidtab, struct ocontext *c, + size_t index, u32 *out_sid) +{ + int rc; + u32 sid; + + /* Ensure the associated sidtab entry is visible to this thread. */ + sid = smp_load_acquire(&c->sid[index]); + if (!sid) { + rc = sidtab_context_to_sid(sidtab, &c->context[index], &sid); + if (rc) + return rc; + + /* + * Ensure the new sidtab entry is visible to other threads + * when they see the SID. + */ + smp_store_release(&c->sid[index], sid); + } + *out_sid = sid; + return 0; +} + /** * security_port_sid - Obtain the SID for a port. * @protocol: protocol number @@ -2262,10 +2299,12 @@ int security_port_sid(struct selinux_state *state, struct policydb *policydb; struct sidtab *sidtab; struct ocontext *c; - int rc = 0; + int rc; read_lock(&state->ss->policy_rwlock); +retry: + rc = 0; policydb = &state->ss->policydb; sidtab = state->ss->sidtab; @@ -2279,14 +2318,11 @@ int security_port_sid(struct selinux_state *state, } if (c) { - if (!c->sid[0]) { - rc = sidtab_context_to_sid(sidtab, - &c->context[0], - &c->sid[0]); - if (rc) - goto out; - } - *out_sid = c->sid[0]; + rc = ocontext_to_sid(sidtab, c, 0, out_sid); + if (rc == -ESTALE) + goto retry; + if (rc) + goto out; } else { *out_sid = SECINITSID_PORT; } @@ -2308,10 +2344,12 @@ int security_ib_pkey_sid(struct selinux_state *state, struct policydb *policydb; struct sidtab *sidtab; struct ocontext *c; - int rc = 0; + int rc; read_lock(&state->ss->policy_rwlock); +retry: + rc = 0; policydb = &state->ss->policydb; sidtab = state->ss->sidtab; @@ -2326,14 +2364,11 @@ int security_ib_pkey_sid(struct selinux_state *state, } if (c) { - if (!c->sid[0]) { - rc = sidtab_context_to_sid(sidtab, - &c->context[0], - &c->sid[0]); - if (rc) - goto out; - } - *out_sid = c->sid[0]; + rc = ocontext_to_sid(sidtab, c, 0, out_sid); + if (rc == -ESTALE) + goto retry; + if (rc) + goto out; } else *out_sid = SECINITSID_UNLABELED; @@ -2354,10 +2389,12 @@ int security_ib_endport_sid(struct selinux_state *state, struct policydb *policydb; struct sidtab *sidtab; struct ocontext *c; - int rc = 0; + int rc; read_lock(&state->ss->policy_rwlock); +retry: + rc = 0; policydb = &state->ss->policydb; sidtab = state->ss->sidtab; @@ -2373,14 +2410,11 @@ int security_ib_endport_sid(struct selinux_state *state, } if (c) { - if (!c->sid[0]) { - rc = sidtab_context_to_sid(sidtab, - &c->context[0], - &c->sid[0]); - if (rc) - goto out; - } - *out_sid = c->sid[0]; + rc = ocontext_to_sid(sidtab, c, 0, out_sid); + if (rc == -ESTALE) + goto retry; + if (rc) + goto out; } else *out_sid = SECINITSID_UNLABELED; @@ -2399,11 +2433,13 @@ int security_netif_sid(struct selinux_state *state, { struct policydb *policydb; struct sidtab *sidtab; - int rc = 0; + int rc; struct ocontext *c; read_lock(&state->ss->policy_rwlock); +retry: + rc = 0; policydb = &state->ss->policydb; sidtab = state->ss->sidtab; @@ -2415,19 +2451,11 @@ int security_netif_sid(struct selinux_state *state, } if (c) { - if (!c->sid[0] || !c->sid[1]) { - rc = sidtab_context_to_sid(sidtab, - &c->context[0], - &c->sid[0]); - if (rc) - goto out; - rc = sidtab_context_to_sid(sidtab, - &c->context[1], - &c->sid[1]); - if (rc) - goto out; - } - *if_sid = c->sid[0]; + rc = ocontext_to_sid(sidtab, c, 0, if_sid); + if (rc == -ESTALE) + goto retry; + if (rc) + goto out; } else *if_sid = SECINITSID_NETIF; @@ -2469,6 +2497,7 @@ int security_node_sid(struct selinux_state *state, read_lock(&state->ss->policy_rwlock); +retry: policydb = &state->ss->policydb; sidtab = state->ss->sidtab; @@ -2511,14 +2540,11 @@ int security_node_sid(struct selinux_state *state, } if (c) { - if (!c->sid[0]) { - rc = sidtab_context_to_sid(sidtab, - &c->context[0], - &c->sid[0]); - if (rc) - goto out; - } - *out_sid = c->sid[0]; + rc = ocontext_to_sid(sidtab, c, 0, out_sid); + if (rc == -ESTALE) + goto retry; + if (rc) + goto out; } else { *out_sid = SECINITSID_NODE; } @@ -2677,7 +2703,7 @@ static inline int __security_genfs_sid(struct selinux_state *state, u16 sclass; struct genfs *genfs; struct ocontext *c; - int rc, cmp = 0; + int cmp = 0; while (path[0] == '/' && path[1] == '/') path++; @@ -2691,9 +2717,8 @@ static inline int __security_genfs_sid(struct selinux_state *state, break; } - rc = -ENOENT; if (!genfs || cmp) - goto out; + return -ENOENT; for (c = genfs->head; c; c = c->next) { len = strlen(c->u.name); @@ -2702,20 +2727,10 @@ static inline int __security_genfs_sid(struct selinux_state *state, break; } - rc = -ENOENT; if (!c) - goto out; - - if (!c->sid[0]) { - rc = sidtab_context_to_sid(sidtab, &c->context[0], &c->sid[0]); - if (rc) - goto out; - } + return -ENOENT; - *sid = c->sid[0]; - rc = 0; -out: - return rc; + return ocontext_to_sid(sidtab, c, 0, sid); } /** @@ -2750,13 +2765,15 @@ int security_fs_use(struct selinux_state *state, struct super_block *sb) { struct policydb *policydb; struct sidtab *sidtab; - int rc = 0; + int rc; struct ocontext *c; struct superblock_security_struct *sbsec = sb->s_security; const char *fstype = sb->s_type->name; read_lock(&state->ss->policy_rwlock); +retry: + rc = 0; policydb = &state->ss->policydb; sidtab = state->ss->sidtab; @@ -2769,13 +2786,11 @@ int security_fs_use(struct selinux_state *state, struct super_block *sb) if (c) { sbsec->behavior = c->v.behavior; - if (!c->sid[0]) { - rc = sidtab_context_to_sid(sidtab, &c->context[0], - &c->sid[0]); - if (rc) - goto out; - } - sbsec->sid = c->sid[0]; + rc = ocontext_to_sid(sidtab, c, 0, &sbsec->sid); + if (rc == -ESTALE) + goto retry; + if (rc) + goto out; } else { rc = __security_genfs_sid(state, fstype, "/", SECCLASS_DIR, &sbsec->sid); -- 2.30.2

3 years, 6 months

2
1
0 0

FAILED: patch "[PATCH] staging: most: dim2: use device release method" failed to apply to 5.15-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 5.15-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From d445aa402d60014a37a199fae2bba379696b007d Mon Sep 17 00:00:00 2001 From: Nikita Yushchenko <nikita.yoush(a)cogentembedded.com> Date: Tue, 5 Oct 2021 17:34:50 +0300 Subject: [PATCH] staging: most: dim2: use device release method Commit 723de0f9171e ("staging: most: remove device from interface structure") moved registration of driver-provided struct device to the most subsystem. This updated dim2 driver as well. However, struct device passed to register_device() becomes refcounted, and must not be explicitly deallocated, but must provide release method instead. Which is incompatible with managing it via devres. This patch makes the device structure allocated without devres, adds device release method, and moves device destruction there. Fixes: 723de0f9171e ("staging: most: remove device from interface structure") Signed-off-by: Nikita Yushchenko <nikita.yoush(a)cogentembedded.com> Link: https://lore.kernel.org/r/20211005143448.8660-2-nikita.yoush@cogentembedded… Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> diff --git a/drivers/staging/most/dim2/dim2.c b/drivers/staging/most/dim2/dim2.c index 96cb5280a385..bd102329d8c8 100644 --- a/drivers/staging/most/dim2/dim2.c +++ b/drivers/staging/most/dim2/dim2.c @@ -727,6 +727,23 @@ static int get_dim2_clk_speed(const char *clock_speed, u8 *val) return -EINVAL; } +static void dim2_release(struct device *d) +{ + struct dim2_hdm *dev = container_of(d, struct dim2_hdm, dev); + unsigned long flags; + + kthread_stop(dev->netinfo_task); + + spin_lock_irqsave(&dim_lock, flags); + dim_shutdown(); + spin_unlock_irqrestore(&dim_lock, flags); + + if (dev->disable_platform) + dev->disable_platform(to_platform_device(d->parent)); + + kfree(dev); +} + /* * dim2_probe - dim2 probe handler * @pdev: platform device structure @@ -748,7 +765,7 @@ static int dim2_probe(struct platform_device *pdev) enum { MLB_INT_IDX, AHB0_INT_IDX }; - dev = devm_kzalloc(&pdev->dev, sizeof(*dev), GFP_KERNEL); + dev = kzalloc(sizeof(*dev), GFP_KERNEL); if (!dev) return -ENOMEM; @@ -760,19 +777,21 @@ static int dim2_probe(struct platform_device *pdev) "microchip,clock-speed", &clock_speed); if (ret) { dev_err(&pdev->dev, "missing dt property clock-speed\n"); - return ret; + goto err_free_dev; } ret = get_dim2_clk_speed(clock_speed, &dev->clk_speed); if (ret) { dev_err(&pdev->dev, "bad dt property clock-speed\n"); - return ret; + goto err_free_dev; } res = platform_get_resource(pdev, IORESOURCE_MEM, 0); dev->io_base = devm_ioremap_resource(&pdev->dev, res); - if (IS_ERR(dev->io_base)) - return PTR_ERR(dev->io_base); + if (IS_ERR(dev->io_base)) { + ret = PTR_ERR(dev->io_base); + goto err_free_dev; + } of_id = of_match_node(dim2_of_match, pdev->dev.of_node); pdata = of_id->data; @@ -780,7 +799,7 @@ static int dim2_probe(struct platform_device *pdev) if (pdata->enable) { ret = pdata->enable(pdev); if (ret) - return ret; + goto err_free_dev; } dev->disable_platform = pdata->disable; if (pdata->fcnt) @@ -875,24 +894,19 @@ static int dim2_probe(struct platform_device *pdev) dev->most_iface.request_netinfo = request_netinfo; dev->most_iface.driver_dev = &pdev->dev; dev->most_iface.dev = &dev->dev; - dev->dev.init_name = "dim2_state"; + dev->dev.init_name = dev->name; dev->dev.parent = &pdev->dev; + dev->dev.release = dim2_release; - ret = most_register_interface(&dev->most_iface); - if (ret) { - dev_err(&pdev->dev, "failed to register MOST interface\n"); - goto err_stop_thread; - } - - return 0; + return most_register_interface(&dev->most_iface); -err_stop_thread: - kthread_stop(dev->netinfo_task); err_shutdown_dim: dim_shutdown(); err_disable_platform: if (dev->disable_platform) dev->disable_platform(pdev); +err_free_dev: + kfree(dev); return ret; } @@ -906,17 +920,8 @@ static int dim2_probe(struct platform_device *pdev) static int dim2_remove(struct platform_device *pdev) { struct dim2_hdm *dev = platform_get_drvdata(pdev); - unsigned long flags; most_deregister_interface(&dev->most_iface); - kthread_stop(dev->netinfo_task); - - spin_lock_irqsave(&dim_lock, flags); - dim_shutdown(); - spin_unlock_irqrestore(&dim_lock, flags); - - if (dev->disable_platform) - dev->disable_platform(pdev); return 0; }

3 years, 6 months

3
2
0 0

[PATCH 5.10] KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req

by Vitaly Kuznetsov

From: Sean Christopherson <seanjc(a)google.com> commit 3244867af8c065e51969f1bffe732d3ebfd9a7d2 upstream. Do not bail early if there are no bits set in the sparse banks for a non-sparse, a.k.a. "all CPUs", IPI request. Per the Hyper-V spec, it is legal to have a variable length of '0', e.g. VP_SET's BankContents in this case, if the request can be serviced without the extra info. It is possible that for a given invocation of a hypercall that does accept variable sized input headers that all the header input fits entirely within the fixed size header. In such cases the variable sized input header is zero-sized and the corresponding bits in the hypercall input should be set to zero. Bailing early results in KVM failing to send IPIs to all CPUs as expected by the guest. Fixes: 214ff83d4473 ("KVM: x86: hyperv: implement PV IPI send hypercalls") Cc: stable(a)vger.kernel.org Signed-off-by: Sean Christopherson <seanjc(a)google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets(a)redhat.com> Message-Id: <20211207220926.718794-2-seanjc(a)google.com> Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets(a)redhat.com> --- arch/x86/kvm/hyperv.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index bb39f493447c..328f37e4fd3a 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -1641,11 +1641,13 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *current_vcpu, u64 ingpa, u64 outgpa, all_cpus = send_ipi_ex.vp_set.format == HV_GENERIC_SET_ALL; + if (all_cpus) + goto check_and_send_ipi; + if (!sparse_banks_len) goto ret_success; - if (!all_cpus && - kvm_read_guest(kvm, + if (kvm_read_guest(kvm, ingpa + offsetof(struct hv_send_ipi_ex, vp_set.bank_contents), sparse_banks, @@ -1653,6 +1655,7 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *current_vcpu, u64 ingpa, u64 outgpa, return HV_STATUS_INVALID_HYPERCALL_INPUT; } +check_and_send_ipi: if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR)) return HV_STATUS_INVALID_HYPERCALL_INPUT; -- 2.33.1

3 years, 6 months

2
1
0 0

Re: [LKP] Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

by Linus Torvalds

On Mon, Dec 13, 2021 at 10:37 AM Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > > So I'll just apply the patch. Thanks for the report and the testing Done, it's commit e386dfc56f83 ("fget: clarify and improve __fget_files() implementation") in my tree now. I didn't mark it as "Fixes:" or for stable, because I can't imagine that it matters in real life. But then it struck me that Greg has mentioned that he ends up getting a lot of performance regression reports for people testing stable and they can be distracting. So I'm adding a stable cc here just so people are aware of this as a "yeah, will-it-scale.poll2 performance regression has been reported, has a fix available if somebody cares". Linus

3 years, 6 months

2
1
0 0

stable-rc/queue/4.14 baseline: 110 runs, 1 regressions (v4.14.258-7-g93489bfff549)

by kernelci.org bot

stable-rc/queue/4.14 baseline: 110 runs, 1 regressions (v4.14.258-7-g93489bfff549) Regressions Summary ------------------- platform | arch | lab | compiler | defconfig | regressions ---------+------+---------------+----------+---------------------+------------ panda | arm | lab-collabora | gcc-10 | omap2plus_defconfig | 1 Details: https://kernelci.org/test/job/stable-rc/branch/queue%2F4.14/kernel/v4.14.25… Test: baseline Tree: stable-rc Branch: queue/4.14 Describe: v4.14.258-7-g93489bfff549 URL: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git SHA: 93489bfff5495e498b3932e011b0221ff242e0b7 Test Regressions ---------------- platform | arch | lab | compiler | defconfig | regressions ---------+------+---------------+----------+---------------------+------------ panda | arm | lab-collabora | gcc-10 | omap2plus_defconfig | 1 Details: https://kernelci.org/test/plan/id/61b9a662b9fba185e3397136 Results: 4 PASS, 1 FAIL, 1 SKIP Full config: omap2plus_defconfig Compiler: gcc-10 (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-4.14/v4.14.258-7-g93489bfff54… HTML log: https://storage.kernelci.org//stable-rc/queue-4.14/v4.14.258-7-g93489bfff54… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/buildroot-baseline/2021… * baseline.dmesg.emerg: https://kernelci.org/test/case/id/61b9a662b9fba185e3397139 failing since 1 day (last pass: v4.14.257-33-gcf9830f3ce18, first fail: v4.14.257-53-gbe1979ab4cd9) 2 lines 2021-12-15T08:24:48.380136 kern :emerg : BUG: spinlock bad magic on CPU#0, udevd/95 2021-12-15T08:24:48.389438 kern :emerg : lock: emif_lock+0x0/0xffffed3c [emif], .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1 2021-12-15T08:24:48.405831 [ 19.961425] <LAVA_SIGNAL_TESTCASE TEST_CASE_ID=emerg RESULT=fail UNITS=lines MEASUREMENT=2>

3 years, 6 months

1
0
0 0

+ mm-fix-panic-in-__alloc_pages.patch added to -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm: fix panic in __alloc_pages has been added to the -mm tree. Its filename is mm-fix-panic-in-__alloc_pages.patch This patch should soon appear at https://ozlabs.org/~akpm/mmots/broken-out/mm-fix-panic-in-__alloc_pages.pat… and later at https://ozlabs.org/~akpm/mmotm/broken-out/mm-fix-panic-in-__alloc_pages.pat… Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Alexey Makhalov <amakhalov(a)vmware.com> Subject: mm: fix panic in __alloc_pages There is a kernel panic caused by pcpu_alloc_pages() passing offlined and uninitialized node to alloc_pages_node() leading to panic by NULL dereferencing uninitialized NODE_DATA(nid). CPU2 has been hot-added BUG: unable to handle page fault for address: 0000000000001608 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 1 Comm: systemd Tainted: G E 5.15.0-rc7+ #11 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW RIP: 0010:__alloc_pages+0x127/0x290 Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2 RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246 RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2 RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600 R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2 FS: 00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0 Call Trace: pcpu_alloc_pages.constprop.0+0xe4/0x1c0 pcpu_populate_chunk+0x33/0xb0 pcpu_alloc+0x4d3/0x6f0 __alloc_percpu_gfp+0xd/0x10 alloc_mem_cgroup_per_node_info+0x54/0xb0 mem_cgroup_alloc+0xed/0x2f0 mem_cgroup_css_alloc+0x33/0x2f0 css_create+0x3a/0x1f0 cgroup_apply_control_enable+0x12b/0x150 cgroup_mkdir+0xdd/0x110 kernfs_iop_mkdir+0x4f/0x80 vfs_mkdir+0x178/0x230 do_mkdirat+0xfd/0x120 __x64_sys_mkdir+0x47/0x70 ? syscall_exit_to_user_mode+0x21/0x50 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Panic can be easily reproduced by disabling udev rule for automatic onlining hot added CPU followed by CPU with memoryless node (NUMA node with CPU only) hot add. Hot adding CPU and memoryless node does not bring the node to online state. Memoryless node will be onlined only during the onlining its CPU. Node can be in one of the following states: 1. not present.(nid == NUMA_NO_NODE) 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0, NODE_DATA(nid) == NULL) 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0, NODE_DATA(nid) != NULL) Percpu code is doing allocations for all possible CPUs. The issue happens when it serves hot added but not yet onlined CPU when its node is in 2nd state. This node is not ready to use, fallback to numa_mem_id(). Link: https://lkml.kernel.org/r/20211108202325.20304-1-amakhalov@vmware.com Signed-off-by: Alexey Makhalov <amakhalov(a)vmware.com> Reviewed-by: David Hildenbrand <david(a)redhat.com> Cc: David Hildenbrand <david(a)redhat.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Oscar Salvador <osalvador(a)suse.de> Cc: Dennis Zhou <dennis(a)kernel.org> Cc: Tejun Heo <tj(a)kernel.org> Cc: Christoph Lameter <cl(a)linux.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/percpu-vm.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) --- a/mm/percpu-vm.c~mm-fix-panic-in-__alloc_pages +++ a/mm/percpu-vm.c @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_ gfp_t gfp) { unsigned int cpu, tcpu; - int i; + int i, nid; gfp |= __GFP_HIGHMEM; for_each_possible_cpu(cpu) { + nid = cpu_to_node(cpu); + if (nid == NUMA_NO_NODE || !node_online(nid)) + nid = numa_mem_id(); + for (i = page_start; i < page_end; i++) { struct page **pagep = &pages[pcpu_page_idx(cpu, i)]; - *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0); + *pagep = alloc_pages_node(nid, gfp, 0); if (!*pagep) goto err; } _ Patches currently in -mm which might be from amakhalov(a)vmware.com are mm-fix-panic-in-__alloc_pages.patch

3 years, 6 months

7
15
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror December 2021