From: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp
[ Upstream commit f123dc86388cb669c3d6322702dc441abc35c31e ]
syzbot is reporting sleep in atomic context in SysV filesystem [1], for sb_bread() is called with rw_spinlock held.
A "write_lock(&pointers_lock) => read_lock(&pointers_lock) deadlock" bug and a "sb_bread() with write_lock(&pointers_lock)" bug were introduced by "Replace BKL for chain locking with sysvfs-private rwlock" in Linux 2.5.12.
Then, "[PATCH] err1-40: sysvfs locking fix" in Linux 2.6.8 fixed the former bug by moving pointers_lock lock to the callers, but instead introduced a "sb_bread() with read_lock(&pointers_lock)" bug (which made this problem easier to hit).
Al Viro suggested that why not to do like get_branch()/get_block()/ find_shared() in Minix filesystem does. And doing like that is almost a revert of "[PATCH] err1-40: sysvfs locking fix" except that get_branch() from with find_shared() is called without write_lock(&pointers_lock).
Reported-by: syzbot syzbot+69b40dc5fd40f32c199f@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=69b40dc5fd40f32c199f Suggested-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Link: https://lore.kernel.org/r/0d195f93-a22a-49a2-0020-103534d6f7f6@I-love.SAKURA... Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/sysv/itree.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/fs/sysv/itree.c b/fs/sysv/itree.c index 9925cfe571595..17c7d76770a0a 100644 --- a/fs/sysv/itree.c +++ b/fs/sysv/itree.c @@ -82,9 +82,6 @@ static inline sysv_zone_t *block_end(struct buffer_head *bh) return (sysv_zone_t*)((char*)bh->b_data + bh->b_size); }
-/* - * Requires read_lock(&pointers_lock) or write_lock(&pointers_lock) - */ static Indirect *get_branch(struct inode *inode, int depth, int offsets[], @@ -104,15 +101,18 @@ static Indirect *get_branch(struct inode *inode, bh = sb_bread(sb, block); if (!bh) goto failure; + read_lock(&pointers_lock); if (!verify_chain(chain, p)) goto changed; add_chain(++p, bh, (sysv_zone_t*)bh->b_data + *++offsets); + read_unlock(&pointers_lock); if (!p->key) goto no_block; } return NULL;
changed: + read_unlock(&pointers_lock); brelse(bh); *err = -EAGAIN; goto no_block; @@ -218,9 +218,7 @@ static int get_block(struct inode *inode, sector_t iblock, struct buffer_head *b goto out;
reread: - read_lock(&pointers_lock); partial = get_branch(inode, depth, offsets, chain, &err); - read_unlock(&pointers_lock);
/* Simplest case - block found, no allocation needed */ if (!partial) { @@ -290,9 +288,9 @@ static Indirect *find_shared(struct inode *inode, *top = 0; for (k = depth; k > 1 && !offsets[k-1]; k--) ; + partial = get_branch(inode, k, offsets, chain, &err);
write_lock(&pointers_lock); - partial = get_branch(inode, k, offsets, chain, &err); if (!partial) partial = chain + k-1; /*
From: "Paul E. McKenney" paulmck@kernel.org
[ Upstream commit 2eb52fa8900e642b3b5054c4bf9776089d2a935f ]
The context-switch-time check for RCU Tasks Trace quiescence expects current->trc_reader_special.b.need_qs to be zero, and if so, updates it to TRC_NEED_QS_CHECKED. This is backwards, because if this value is zero, there is no RCU Tasks Trace grace period in flight, an thus no need for a quiescent state. Instead, when a grace period starts, this field is set to TRC_NEED_QS.
This commit therefore changes the check from zero to TRC_NEED_QS.
Reported-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Paul E. McKenney paulmck@kernel.org Tested-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/rcupdate.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index d2507168b9c7b..e7474d7833424 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -205,9 +205,9 @@ void rcu_tasks_trace_qs_blkd(struct task_struct *t); do { \ int ___rttq_nesting = READ_ONCE((t)->trc_reader_nesting); \ \ - if (likely(!READ_ONCE((t)->trc_reader_special.b.need_qs)) && \ + if (unlikely(READ_ONCE((t)->trc_reader_special.b.need_qs) == TRC_NEED_QS) && \ likely(!___rttq_nesting)) { \ - rcu_trc_cmpxchg_need_qs((t), 0, TRC_NEED_QS_CHECKED); \ + rcu_trc_cmpxchg_need_qs((t), TRC_NEED_QS, TRC_NEED_QS_CHECKED); \ } else if (___rttq_nesting && ___rttq_nesting != INT_MIN && \ !READ_ONCE((t)->trc_reader_special.b.blocked)) { \ rcu_tasks_trace_qs_blkd(t); \
From: "Paul E. McKenney" paulmck@kernel.org
[ Upstream commit bfe93930ea1ea3c6c115a7d44af6e4fea609067e ]
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce.
In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact.
This commit therefore adds the data structures that will be needed to rely on these quiescent states and to eliminate these deadlocks.
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin chenzhongjin@huawei.com Reported-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Paul E. McKenney paulmck@kernel.org Tested-by: Yang Jihong yangjihong1@huawei.com Tested-by: Chen Zhongjin chenzhongjin@huawei.com Reviewed-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/sched.h | 2 ++ kernel/rcu/tasks.h | 2 ++ 2 files changed, 4 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 0cac69902ec58..ffcd100de169c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -848,6 +848,8 @@ struct task_struct { u8 rcu_tasks_idx; int rcu_tasks_idle_cpu; struct list_head rcu_tasks_holdout_list; + int rcu_tasks_exit_cpu; + struct list_head rcu_tasks_exit_list; #endif /* #ifdef CONFIG_TASKS_RCU */
#ifdef CONFIG_TASKS_TRACE_RCU diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index b5d5b6cf093a7..919c22698569e 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -30,6 +30,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp); * @rtp_irq_work: IRQ work queue for deferred wakeups. * @barrier_q_head: RCU callback for barrier operation. * @rtp_blkd_tasks: List of tasks blocked as readers. + * @rtp_exit_list: List of tasks in the latter portion of do_exit(). * @cpu: CPU number corresponding to this entry. * @rtpp: Pointer to the rcu_tasks structure. */ @@ -42,6 +43,7 @@ struct rcu_tasks_percpu { struct irq_work rtp_irq_work; struct rcu_head barrier_q_head; struct list_head rtp_blkd_tasks; + struct list_head rtp_exit_list; int cpu; struct rcu_tasks *rtpp; };
From: "Paul E. McKenney" paulmck@kernel.org
[ Upstream commit 6b70399f9ef3809f6e308fd99dd78b072c1bd05c ]
This commit continues the elimination of deadlocks involving do_exit() and RCU tasks by causing exit_tasks_rcu_start() to add the current task to a per-CPU list and causing exit_tasks_rcu_stop() to remove the current task from whatever list it is on. These lists will be used to track tasks that are exiting, while still accounting for any RCU-tasks quiescent states that these tasks pass though.
[ paulmck: Apply Frederic Weisbecker feedback. ]
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin chenzhongjin@huawei.com Reported-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Paul E. McKenney paulmck@kernel.org Tested-by: Yang Jihong yangjihong1@huawei.com Tested-by: Chen Zhongjin chenzhongjin@huawei.com Reviewed-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/rcu/tasks.h | 43 +++++++++++++++++++++++++++++++++---------- 1 file changed, 33 insertions(+), 10 deletions(-)
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 919c22698569e..76ad0bc2b3312 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -1014,25 +1014,48 @@ EXPORT_SYMBOL_GPL(show_rcu_tasks_classic_gp_kthread); #endif // !defined(CONFIG_TINY_RCU)
/* - * Contribute to protect against tasklist scan blind spot while the - * task is exiting and may be removed from the tasklist. See - * corresponding synchronize_srcu() for further details. + * Protect against tasklist scan blind spot while the task is exiting and + * may be removed from the tasklist. Do this by adding the task to yet + * another list. + * + * Note that the task will remove itself from this list, so there is no + * need for get_task_struct(), except in the case where rcu_tasks_pertask() + * adds it to the holdout list, in which case rcu_tasks_pertask() supplies + * the needed get_task_struct(). */ -void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu) +void exit_tasks_rcu_start(void) { - current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu); + unsigned long flags; + struct rcu_tasks_percpu *rtpcp; + struct task_struct *t = current; + + WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list)); + preempt_disable(); + rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu); + t->rcu_tasks_exit_cpu = smp_processor_id(); + raw_spin_lock_irqsave_rcu_node(rtpcp, flags); + if (!rtpcp->rtp_exit_list.next) + INIT_LIST_HEAD(&rtpcp->rtp_exit_list); + list_add(&t->rcu_tasks_exit_list, &rtpcp->rtp_exit_list); + raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags); + preempt_enable(); }
/* - * Contribute to protect against tasklist scan blind spot while the - * task is exiting and may be removed from the tasklist. See - * corresponding synchronize_srcu() for further details. + * Remove the task from the "yet another list" because do_exit() is now + * non-preemptible, allowing synchronize_rcu() to wait beyond this point. */ -void exit_tasks_rcu_stop(void) __releases(&tasks_rcu_exit_srcu) +void exit_tasks_rcu_stop(void) { + unsigned long flags; + struct rcu_tasks_percpu *rtpcp; struct task_struct *t = current;
- __srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx); + WARN_ON_ONCE(list_empty(&t->rcu_tasks_exit_list)); + rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, t->rcu_tasks_exit_cpu); + raw_spin_lock_irqsave_rcu_node(rtpcp, flags); + list_del_init(&t->rcu_tasks_exit_list); + raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags); }
/*
From: Roman Smirnov r.smirnov@omp.ru
[ Upstream commit 93f52fbeaf4b676b21acfe42a5152620e6770d02 ]
The expression dst->nr_samples + src->nr_samples may have zero value on overflow. It is necessary to add a check to avoid division by zero.
Found by Linux Verification Center (linuxtesting.org) with Svace.
Signed-off-by: Roman Smirnov r.smirnov@omp.ru Reviewed-by: Sergey Shtylyov s.shtylyov@omp.ru Link: https://lore.kernel.org/r/20240305134509.23108-1-r.smirnov@omp.ru Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- block/blk-stat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-stat.c b/block/blk-stat.c index da9407b7d4abf..41be89ecaf20e 100644 --- a/block/blk-stat.c +++ b/block/blk-stat.c @@ -28,7 +28,7 @@ void blk_rq_stat_init(struct blk_rq_stat *stat) /* src is a per-cpu stat, mean isn't initialized */ void blk_rq_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src) { - if (!src->nr_samples) + if (dst->nr_samples + src->nr_samples <= dst->nr_samples) return;
dst->min = min(dst->min, src->min);
From: Baolin Wang baolin.wang@linux.alibaba.com
[ Upstream commit 8b3d838139bcd1e552f1899191f734264ce2a1a5 ]
We met a kernel crash issue when running stress-ng testing, and the system crashes when printing the dentry name in dump_mapping().
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 pc : dentry_name+0xd8/0x224 lr : pointer+0x22c/0x370 sp : ffff800025f134c0 ...... Call trace: dentry_name+0xd8/0x224 pointer+0x22c/0x370 vsnprintf+0x1ec/0x730 vscnprintf+0x2c/0x60 vprintk_store+0x70/0x234 vprintk_emit+0xe0/0x24c vprintk_default+0x3c/0x44 vprintk_func+0x84/0x2d0 printk+0x64/0x88 __dump_page+0x52c/0x530 dump_page+0x14/0x20 set_migratetype_isolate+0x110/0x224 start_isolate_page_range+0xc4/0x20c offline_pages+0x124/0x474 memory_block_offline+0x44/0xf4 memory_subsys_offline+0x3c/0x70 device_offline+0xf0/0x120 ......
The root cause is that, one thread is doing page migration, and we will use the target page's ->mapping field to save 'anon_vma' pointer between page unmap and page move, and now the target page is locked and refcount is 1.
Currently, there is another stress-ng thread performing memory hotplug, attempting to offline the target page that is being migrated. It discovers that the refcount of this target page is 1, preventing the offline operation, thus proceeding to dump the page. However, page_mapping() of the target page may return an incorrect file mapping to crash the system in dump_mapping(), since the target page->mapping only saves 'anon_vma' pointer without setting PAGE_MAPPING_ANON flag.
The page migration issue has been fixed by commit d1adb25df711 ("mm: migrate: fix getting incorrect page mapping during page migration"). In addition, Matthew suggested we should also improve dump_mapping()'s robustness to resilient against the kernel crash [1].
With checking the 'dentry.parent' and 'dentry.d_name.name' used by dentry_name(), I can see dump_mapping() will output the invalid dentry instead of crashing the system when this issue is reproduced again.
[12211.189128] page:fffff7de047741c0 refcount:1 mapcount:0 mapping:ffff989117f55ea0 index:0x1 pfn:0x211dd07 [12211.189144] aops:0x0 ino:1 invalid dentry:74786574206e6870 [12211.189148] flags: 0x57ffffc0000001(locked|node=1|zone=2|lastcpupid=0x1fffff) [12211.189150] page_type: 0xffffffff() [12211.189153] raw: 0057ffffc0000001 0000000000000000 dead000000000122 ffff989117f55ea0 [12211.189154] raw: 0000000000000001 0000000000000001 00000001ffffffff 0000000000000000 [12211.189155] page dumped because: unmovable page
[1] https://lore.kernel.org/all/ZXxn%2F0oixJxxAnpF@casper.infradead.org/
Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Link: https://lore.kernel.org/r/937ab1f87328516821d39be672b6bc18861d9d3e.170539142... Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c index 8cfda7a6d5900..5ad22efb5def4 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -589,7 +589,8 @@ void dump_mapping(const struct address_space *mapping) }
dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias); - if (get_kernel_nofault(dentry, dentry_ptr)) { + if (get_kernel_nofault(dentry, dentry_ptr) || + !dentry.d_parent || !dentry.d_name.name) { pr_warn("aops:%ps ino:%lx invalid dentry:%px\n", a_ops, ino, dentry_ptr); return;
From: Keith Busch kbusch@kernel.org
[ Upstream commit 7e80eb792bd7377a20f204943ac31c77d859be89 ]
The memory allocated for the identification is freed on failure. Set it to NULL so the caller doesn't have a pointer to that freed address.
Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Keith Busch kbusch@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/nvme/host/core.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 0c088db944706..20c79cc67ce54 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1363,8 +1363,10 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
error = nvme_submit_sync_cmd(dev->admin_q, &c, *id, sizeof(struct nvme_id_ctrl)); - if (error) + if (error) { kfree(*id); + *id = NULL; + } return error; }
@@ -1493,6 +1495,7 @@ static int nvme_identify_ns(struct nvme_ctrl *ctrl, unsigned nsid, if (error) { dev_warn(ctrl->device, "Identify namespace failed (%d)\n", error); kfree(*id); + *id = NULL; } return error; }
linux-stable-mirror@lists.linaro.org