From: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp
[ Upstream commit f123dc86388cb669c3d6322702dc441abc35c31e ]
syzbot is reporting sleep in atomic context in SysV filesystem [1], for sb_bread() is called with rw_spinlock held.
A "write_lock(&pointers_lock) => read_lock(&pointers_lock) deadlock" bug and a "sb_bread() with write_lock(&pointers_lock)" bug were introduced by "Replace BKL for chain locking with sysvfs-private rwlock" in Linux 2.5.12.
Then, "[PATCH] err1-40: sysvfs locking fix" in Linux 2.6.8 fixed the former bug by moving pointers_lock lock to the callers, but instead introduced a "sb_bread() with read_lock(&pointers_lock)" bug (which made this problem easier to hit).
Al Viro suggested that why not to do like get_branch()/get_block()/ find_shared() in Minix filesystem does. And doing like that is almost a revert of "[PATCH] err1-40: sysvfs locking fix" except that get_branch() from with find_shared() is called without write_lock(&pointers_lock).
Reported-by: syzbot syzbot+69b40dc5fd40f32c199f@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=69b40dc5fd40f32c199f Suggested-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Link: https://lore.kernel.org/r/0d195f93-a22a-49a2-0020-103534d6f7f6@I-love.SAKURA... Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/sysv/itree.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/fs/sysv/itree.c b/fs/sysv/itree.c index 410ab2a44d2f6..19bcb51a22036 100644 --- a/fs/sysv/itree.c +++ b/fs/sysv/itree.c @@ -83,9 +83,6 @@ static inline sysv_zone_t *block_end(struct buffer_head *bh) return (sysv_zone_t*)((char*)bh->b_data + bh->b_size); }
-/* - * Requires read_lock(&pointers_lock) or write_lock(&pointers_lock) - */ static Indirect *get_branch(struct inode *inode, int depth, int offsets[], @@ -105,15 +102,18 @@ static Indirect *get_branch(struct inode *inode, bh = sb_bread(sb, block); if (!bh) goto failure; + read_lock(&pointers_lock); if (!verify_chain(chain, p)) goto changed; add_chain(++p, bh, (sysv_zone_t*)bh->b_data + *++offsets); + read_unlock(&pointers_lock); if (!p->key) goto no_block; } return NULL;
changed: + read_unlock(&pointers_lock); brelse(bh); *err = -EAGAIN; goto no_block; @@ -219,9 +219,7 @@ static int get_block(struct inode *inode, sector_t iblock, struct buffer_head *b goto out;
reread: - read_lock(&pointers_lock); partial = get_branch(inode, depth, offsets, chain, &err); - read_unlock(&pointers_lock);
/* Simplest case - block found, no allocation needed */ if (!partial) { @@ -291,9 +289,9 @@ static Indirect *find_shared(struct inode *inode, *top = 0; for (k = depth; k > 1 && !offsets[k-1]; k--) ; + partial = get_branch(inode, k, offsets, chain, &err);
write_lock(&pointers_lock); - partial = get_branch(inode, k, offsets, chain, &err); if (!partial) partial = chain + k-1; /*
From: "Paul E. McKenney" paulmck@kernel.org
[ Upstream commit 2eb52fa8900e642b3b5054c4bf9776089d2a935f ]
The context-switch-time check for RCU Tasks Trace quiescence expects current->trc_reader_special.b.need_qs to be zero, and if so, updates it to TRC_NEED_QS_CHECKED. This is backwards, because if this value is zero, there is no RCU Tasks Trace grace period in flight, an thus no need for a quiescent state. Instead, when a grace period starts, this field is set to TRC_NEED_QS.
This commit therefore changes the check from zero to TRC_NEED_QS.
Reported-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Paul E. McKenney paulmck@kernel.org Tested-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/rcupdate.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 0746b1b0b6639..16f519914415e 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -184,9 +184,9 @@ void rcu_tasks_trace_qs_blkd(struct task_struct *t); do { \ int ___rttq_nesting = READ_ONCE((t)->trc_reader_nesting); \ \ - if (likely(!READ_ONCE((t)->trc_reader_special.b.need_qs)) && \ + if (unlikely(READ_ONCE((t)->trc_reader_special.b.need_qs) == TRC_NEED_QS) && \ likely(!___rttq_nesting)) { \ - rcu_trc_cmpxchg_need_qs((t), 0, TRC_NEED_QS_CHECKED); \ + rcu_trc_cmpxchg_need_qs((t), TRC_NEED_QS, TRC_NEED_QS_CHECKED); \ } else if (___rttq_nesting && ___rttq_nesting != INT_MIN && \ !READ_ONCE((t)->trc_reader_special.b.blocked)) { \ rcu_tasks_trace_qs_blkd(t); \
From: "Paul E. McKenney" paulmck@kernel.org
[ Upstream commit bfe93930ea1ea3c6c115a7d44af6e4fea609067e ]
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce.
In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact.
This commit therefore adds the data structures that will be needed to rely on these quiescent states and to eliminate these deadlocks.
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin chenzhongjin@huawei.com Reported-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Paul E. McKenney paulmck@kernel.org Tested-by: Yang Jihong yangjihong1@huawei.com Tested-by: Chen Zhongjin chenzhongjin@huawei.com Reviewed-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/sched.h | 2 ++ kernel/rcu/tasks.h | 2 ++ 2 files changed, 4 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index ffe8f618ab869..5eeebed2dd9ba 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -858,6 +858,8 @@ struct task_struct { u8 rcu_tasks_idx; int rcu_tasks_idle_cpu; struct list_head rcu_tasks_holdout_list; + int rcu_tasks_exit_cpu; + struct list_head rcu_tasks_exit_list; #endif /* #ifdef CONFIG_TASKS_RCU */
#ifdef CONFIG_TASKS_TRACE_RCU diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 732ad5b39946a..b7d5f27570532 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -32,6 +32,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp); * @rtp_irq_work: IRQ work queue for deferred wakeups. * @barrier_q_head: RCU callback for barrier operation. * @rtp_blkd_tasks: List of tasks blocked as readers. + * @rtp_exit_list: List of tasks in the latter portion of do_exit(). * @cpu: CPU number corresponding to this entry. * @rtpp: Pointer to the rcu_tasks structure. */ @@ -46,6 +47,7 @@ struct rcu_tasks_percpu { struct irq_work rtp_irq_work; struct rcu_head barrier_q_head; struct list_head rtp_blkd_tasks; + struct list_head rtp_exit_list; int cpu; struct rcu_tasks *rtpp; };
From: "Paul E. McKenney" paulmck@kernel.org
[ Upstream commit 46faf9d8e1d52e4a91c382c6c72da6bd8e68297b ]
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce.
In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact.
This commit therefore initializes the data structures that will be needed to rely on these quiescent states and to eliminate these deadlocks.
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin chenzhongjin@huawei.com Reported-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Paul E. McKenney paulmck@kernel.org Tested-by: Yang Jihong yangjihong1@huawei.com Tested-by: Chen Zhongjin chenzhongjin@huawei.com Reviewed-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- init/init_task.c | 1 + kernel/fork.c | 1 + kernel/rcu/tasks.h | 2 ++ 3 files changed, 4 insertions(+)
diff --git a/init/init_task.c b/init/init_task.c index 7ecb458eb3da6..4daee6d761c86 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -147,6 +147,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = { .rcu_tasks_holdout = false, .rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list), .rcu_tasks_idle_cpu = -1, + .rcu_tasks_exit_list = LIST_HEAD_INIT(init_task.rcu_tasks_exit_list), #endif #ifdef CONFIG_TASKS_TRACE_RCU .trc_reader_nesting = 0, diff --git a/kernel/fork.c b/kernel/fork.c index 0d944e92a43ff..af7203be1d2d1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1976,6 +1976,7 @@ static inline void rcu_copy_process(struct task_struct *p) p->rcu_tasks_holdout = false; INIT_LIST_HEAD(&p->rcu_tasks_holdout_list); p->rcu_tasks_idle_cpu = -1; + INIT_LIST_HEAD(&p->rcu_tasks_exit_list); #endif /* #ifdef CONFIG_TASKS_RCU */ #ifdef CONFIG_TASKS_TRACE_RCU p->trc_reader_nesting = 0; diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index b7d5f27570532..4a5d562e31892 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -277,6 +277,8 @@ static void cblist_init_generic(struct rcu_tasks *rtp) rtpcp->rtpp = rtp; if (!rtpcp->rtp_blkd_tasks.next) INIT_LIST_HEAD(&rtpcp->rtp_blkd_tasks); + if (!rtpcp->rtp_exit_list.next) + INIT_LIST_HEAD(&rtpcp->rtp_exit_list); }
pr_info("%s: Setting shift to %d and lim to %d rcu_task_cb_adjust=%d.\n", rtp->name,
From: "Paul E. McKenney" paulmck@kernel.org
[ Upstream commit 6b70399f9ef3809f6e308fd99dd78b072c1bd05c ]
This commit continues the elimination of deadlocks involving do_exit() and RCU tasks by causing exit_tasks_rcu_start() to add the current task to a per-CPU list and causing exit_tasks_rcu_stop() to remove the current task from whatever list it is on. These lists will be used to track tasks that are exiting, while still accounting for any RCU-tasks quiescent states that these tasks pass though.
[ paulmck: Apply Frederic Weisbecker feedback. ]
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin chenzhongjin@huawei.com Reported-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Paul E. McKenney paulmck@kernel.org Tested-by: Yang Jihong yangjihong1@huawei.com Tested-by: Chen Zhongjin chenzhongjin@huawei.com Reviewed-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/rcu/tasks.h | 43 +++++++++++++++++++++++++++++++++---------- 1 file changed, 33 insertions(+), 10 deletions(-)
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 4a5d562e31892..68a8adf7de8e9 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -1151,25 +1151,48 @@ struct task_struct *get_rcu_tasks_gp_kthread(void) EXPORT_SYMBOL_GPL(get_rcu_tasks_gp_kthread);
/* - * Contribute to protect against tasklist scan blind spot while the - * task is exiting and may be removed from the tasklist. See - * corresponding synchronize_srcu() for further details. + * Protect against tasklist scan blind spot while the task is exiting and + * may be removed from the tasklist. Do this by adding the task to yet + * another list. + * + * Note that the task will remove itself from this list, so there is no + * need for get_task_struct(), except in the case where rcu_tasks_pertask() + * adds it to the holdout list, in which case rcu_tasks_pertask() supplies + * the needed get_task_struct(). */ -void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu) +void exit_tasks_rcu_start(void) { - current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu); + unsigned long flags; + struct rcu_tasks_percpu *rtpcp; + struct task_struct *t = current; + + WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list)); + preempt_disable(); + rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu); + t->rcu_tasks_exit_cpu = smp_processor_id(); + raw_spin_lock_irqsave_rcu_node(rtpcp, flags); + if (!rtpcp->rtp_exit_list.next) + INIT_LIST_HEAD(&rtpcp->rtp_exit_list); + list_add(&t->rcu_tasks_exit_list, &rtpcp->rtp_exit_list); + raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags); + preempt_enable(); }
/* - * Contribute to protect against tasklist scan blind spot while the - * task is exiting and may be removed from the tasklist. See - * corresponding synchronize_srcu() for further details. + * Remove the task from the "yet another list" because do_exit() is now + * non-preemptible, allowing synchronize_rcu() to wait beyond this point. */ -void exit_tasks_rcu_stop(void) __releases(&tasks_rcu_exit_srcu) +void exit_tasks_rcu_stop(void) { + unsigned long flags; + struct rcu_tasks_percpu *rtpcp; struct task_struct *t = current;
- __srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx); + WARN_ON_ONCE(list_empty(&t->rcu_tasks_exit_list)); + rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, t->rcu_tasks_exit_cpu); + raw_spin_lock_irqsave_rcu_node(rtpcp, flags); + list_del_init(&t->rcu_tasks_exit_list); + raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags); }
/*
From: "Paul E. McKenney" paulmck@kernel.org
[ Upstream commit 1612160b91272f5b1596f499584d6064bf5be794 ]
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce.
In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact.
This commit therefore eliminates these deadlock by replacing the SRCU-based wait for do_exit() completion with per-CPU lists of tasks currently exiting. A given task will be on one of these per-CPU lists for the same period of time that this task would previously have been in the previous SRCU read-side critical section. These lists enable RCU Tasks to find the tasks that have already been removed from the tasks list, but that must nevertheless be waited upon.
The RCU Tasks grace period gathers any of these do_exit() tasks that it must wait on, and adds them to the list of holdouts. Per-CPU locking and get_task_struct() are used to synchronize addition to and removal from these lists.
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin chenzhongjin@huawei.com Reported-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Paul E. McKenney paulmck@kernel.org Tested-by: Yang Jihong yangjihong1@huawei.com Tested-by: Chen Zhongjin chenzhongjin@huawei.com Reviewed-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/rcu/tasks.h | 44 ++++++++++++++++++++++++++++---------------- 1 file changed, 28 insertions(+), 16 deletions(-)
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 68a8adf7de8e9..4dc355b2ac229 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -146,8 +146,6 @@ static struct rcu_tasks rt_name = \ }
#ifdef CONFIG_TASKS_RCU -/* Track exiting tasks in order to allow them to be waited for. */ -DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
/* Report delay in synchronize_srcu() completion in rcu_tasks_postscan(). */ static void tasks_rcu_exit_srcu_stall(struct timer_list *unused); @@ -855,10 +853,12 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp) // number of voluntary context switches, and add that task to the // holdout list. // rcu_tasks_postscan(): -// Invoke synchronize_srcu() to ensure that all tasks that were -// in the process of exiting (and which thus might not know to -// synchronize with this RCU Tasks grace period) have completed -// exiting. +// Gather per-CPU lists of tasks in do_exit() to ensure that all +// tasks that were in the process of exiting (and which thus might +// not know to synchronize with this RCU Tasks grace period) have +// completed exiting. The synchronize_rcu() in rcu_tasks_postgp() +// will take care of any tasks stuck in the non-preemptible region +// of do_exit() following its call to exit_tasks_rcu_stop(). // check_all_holdout_tasks(), repeatedly until holdout list is empty: // Scans the holdout list, attempting to identify a quiescent state // for each task on the list. If there is a quiescent state, the @@ -871,8 +871,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp) // with interrupts disabled. // // For each exiting task, the exit_tasks_rcu_start() and -// exit_tasks_rcu_finish() functions begin and end, respectively, the SRCU -// read-side critical sections waited for by rcu_tasks_postscan(). +// exit_tasks_rcu_finish() functions add and remove, respectively, the +// current task to a per-CPU list of tasks that rcu_tasks_postscan() must +// wait on. This is necessary because rcu_tasks_postscan() must wait on +// tasks that have already been removed from the global list of tasks. // // Pre-grace-period update-side code is ordered before the grace // via the raw_spin_lock.*rcu_node(). Pre-grace-period read-side code @@ -936,9 +938,13 @@ static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop) } }
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func); +DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks"); + /* Processing between scanning taskslist and draining the holdout list. */ static void rcu_tasks_postscan(struct list_head *hop) { + int cpu; int rtsi = READ_ONCE(rcu_task_stall_info);
if (!IS_ENABLED(CONFIG_TINY_RCU)) { @@ -952,9 +958,9 @@ static void rcu_tasks_postscan(struct list_head *hop) * this, divide the fragile exit path part in two intersecting * read side critical sections: * - * 1) An _SRCU_ read side starting before calling exit_notify(), - * which may remove the task from the tasklist, and ending after - * the final preempt_disable() call in do_exit(). + * 1) A task_struct list addition before calling exit_notify(), + * which may remove the task from the tasklist, with the + * removal after the final preempt_disable() call in do_exit(). * * 2) An _RCU_ read side starting with the final preempt_disable() * call in do_exit() and ending with the final call to schedule() @@ -963,7 +969,17 @@ static void rcu_tasks_postscan(struct list_head *hop) * This handles the part 1). And postgp will handle part 2) with a * call to synchronize_rcu(). */ - synchronize_srcu(&tasks_rcu_exit_srcu); + + for_each_possible_cpu(cpu) { + struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, cpu); + struct task_struct *t; + + raw_spin_lock_irq_rcu_node(rtpcp); + list_for_each_entry(t, &rtpcp->rtp_exit_list, rcu_tasks_exit_list) + if (list_empty(&t->rcu_tasks_holdout_list)) + rcu_tasks_pertask(t, hop); + raw_spin_unlock_irq_rcu_node(rtpcp); + }
if (!IS_ENABLED(CONFIG_TINY_RCU)) del_timer_sync(&tasks_rcu_exit_srcu_stall_timer); @@ -1031,7 +1047,6 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp) * * In addition, this synchronize_rcu() waits for exiting tasks * to complete their final preempt_disable() region of execution, - * cleaning up after synchronize_srcu(&tasks_rcu_exit_srcu), * enforcing the whole region before tasklist removal until * the final schedule() with TASK_DEAD state to be an RCU TASKS * read side critical section. @@ -1039,9 +1054,6 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp) synchronize_rcu(); }
-void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func); -DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks"); - static void tasks_rcu_exit_srcu_stall(struct timer_list *unused) { #ifndef CONFIG_TINY_RCU
From: "Paul E. McKenney" paulmck@kernel.org
[ Upstream commit 0bb11a372fc8d7006b4d0f42a2882939747bdbff ]
The current code will scan the entirety of each per-CPU list of exiting tasks in ->rtp_exit_list with interrupts disabled. This is normally just fine, because each CPU typically won't have very many tasks in this state. However, if a large number of tasks block late in do_exit(), these lists could be arbitrarily long. Low probability, perhaps, but it really could happen.
This commit therefore occasionally re-enables interrupts while traversing these lists, inserting a dummy element to hold the current place in the list. In kernels built with CONFIG_PREEMPT_RT=y, this re-enabling happens after each list element is processed, otherwise every one-to-two jiffies.
[ paulmck: Apply Frederic Weisbecker feedback. ]
Link: https://lore.kernel.org/all/ZdeI_-RfdLR8jlsm@localhost.localdomain/
Signed-off-by: Paul E. McKenney paulmck@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Sebastian Siewior bigeasy@linutronix.de Cc: Anna-Maria Behnsen anna-maria@linutronix.de Cc: Steven Rostedt rostedt@goodmis.org Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/rcu/tasks.h | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 4dc355b2ac229..7e05edf144225 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -971,13 +971,33 @@ static void rcu_tasks_postscan(struct list_head *hop) */
for_each_possible_cpu(cpu) { + unsigned long j = jiffies + 1; struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, cpu); struct task_struct *t; + struct task_struct *t1; + struct list_head tmp;
raw_spin_lock_irq_rcu_node(rtpcp); - list_for_each_entry(t, &rtpcp->rtp_exit_list, rcu_tasks_exit_list) + list_for_each_entry_safe(t, t1, &rtpcp->rtp_exit_list, rcu_tasks_exit_list) { if (list_empty(&t->rcu_tasks_holdout_list)) rcu_tasks_pertask(t, hop); + + // RT kernels need frequent pauses, otherwise + // pause at least once per pair of jiffies. + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && time_before(jiffies, j)) + continue; + + // Keep our place in the list while pausing. + // Nothing else traverses this list, so adding a + // bare list_head is OK. + list_add(&tmp, &t->rcu_tasks_exit_list); + raw_spin_unlock_irq_rcu_node(rtpcp); + cond_resched(); // For CONFIG_PREEMPT=n kernels + raw_spin_lock_irq_rcu_node(rtpcp); + t1 = list_entry(tmp.next, struct task_struct, rcu_tasks_exit_list); + list_del(&tmp); + j = jiffies + 1; + } raw_spin_unlock_irq_rcu_node(rtpcp); }
From: Roman Smirnov r.smirnov@omp.ru
[ Upstream commit 93f52fbeaf4b676b21acfe42a5152620e6770d02 ]
The expression dst->nr_samples + src->nr_samples may have zero value on overflow. It is necessary to add a check to avoid division by zero.
Found by Linux Verification Center (linuxtesting.org) with Svace.
Signed-off-by: Roman Smirnov r.smirnov@omp.ru Reviewed-by: Sergey Shtylyov s.shtylyov@omp.ru Link: https://lore.kernel.org/r/20240305134509.23108-1-r.smirnov@omp.ru Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- block/blk-stat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-stat.c b/block/blk-stat.c index 7ff76ae6c76a9..e42c263e53fb9 100644 --- a/block/blk-stat.c +++ b/block/blk-stat.c @@ -27,7 +27,7 @@ void blk_rq_stat_init(struct blk_rq_stat *stat) /* src is a per-cpu stat, mean isn't initialized */ void blk_rq_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src) { - if (!src->nr_samples) + if (dst->nr_samples + src->nr_samples <= dst->nr_samples) return;
dst->min = min(dst->min, src->min);
From: Baolin Wang baolin.wang@linux.alibaba.com
[ Upstream commit 8b3d838139bcd1e552f1899191f734264ce2a1a5 ]
We met a kernel crash issue when running stress-ng testing, and the system crashes when printing the dentry name in dump_mapping().
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 pc : dentry_name+0xd8/0x224 lr : pointer+0x22c/0x370 sp : ffff800025f134c0 ...... Call trace: dentry_name+0xd8/0x224 pointer+0x22c/0x370 vsnprintf+0x1ec/0x730 vscnprintf+0x2c/0x60 vprintk_store+0x70/0x234 vprintk_emit+0xe0/0x24c vprintk_default+0x3c/0x44 vprintk_func+0x84/0x2d0 printk+0x64/0x88 __dump_page+0x52c/0x530 dump_page+0x14/0x20 set_migratetype_isolate+0x110/0x224 start_isolate_page_range+0xc4/0x20c offline_pages+0x124/0x474 memory_block_offline+0x44/0xf4 memory_subsys_offline+0x3c/0x70 device_offline+0xf0/0x120 ......
The root cause is that, one thread is doing page migration, and we will use the target page's ->mapping field to save 'anon_vma' pointer between page unmap and page move, and now the target page is locked and refcount is 1.
Currently, there is another stress-ng thread performing memory hotplug, attempting to offline the target page that is being migrated. It discovers that the refcount of this target page is 1, preventing the offline operation, thus proceeding to dump the page. However, page_mapping() of the target page may return an incorrect file mapping to crash the system in dump_mapping(), since the target page->mapping only saves 'anon_vma' pointer without setting PAGE_MAPPING_ANON flag.
The page migration issue has been fixed by commit d1adb25df711 ("mm: migrate: fix getting incorrect page mapping during page migration"). In addition, Matthew suggested we should also improve dump_mapping()'s robustness to resilient against the kernel crash [1].
With checking the 'dentry.parent' and 'dentry.d_name.name' used by dentry_name(), I can see dump_mapping() will output the invalid dentry instead of crashing the system when this issue is reproduced again.
[12211.189128] page:fffff7de047741c0 refcount:1 mapcount:0 mapping:ffff989117f55ea0 index:0x1 pfn:0x211dd07 [12211.189144] aops:0x0 ino:1 invalid dentry:74786574206e6870 [12211.189148] flags: 0x57ffffc0000001(locked|node=1|zone=2|lastcpupid=0x1fffff) [12211.189150] page_type: 0xffffffff() [12211.189153] raw: 0057ffffc0000001 0000000000000000 dead000000000122 ffff989117f55ea0 [12211.189154] raw: 0000000000000001 0000000000000001 00000001ffffffff 0000000000000000 [12211.189155] page dumped because: unmovable page
[1] https://lore.kernel.org/all/ZXxn%2F0oixJxxAnpF@casper.infradead.org/
Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Link: https://lore.kernel.org/r/937ab1f87328516821d39be672b6bc18861d9d3e.170539142... Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c index 91048c4c9c9e7..6d0d542303638 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -588,7 +588,8 @@ void dump_mapping(const struct address_space *mapping) }
dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias); - if (get_kernel_nofault(dentry, dentry_ptr)) { + if (get_kernel_nofault(dentry, dentry_ptr) || + !dentry.d_parent || !dentry.d_name.name) { pr_warn("aops:%ps ino:%lx invalid dentry:%px\n", a_ops, ino, dentry_ptr); return;
From: Zqiang qiang.zhang1211@gmail.com
[ Upstream commit dda98810b552fc6bf650f4270edeebdc2f28bd3f ]
For the kernels built with CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y and CONFIG_RCU_LAZY=y, the following scenarios will trigger WARN_ON_ONCE() in the rcu_nocb_bypass_lock() and rcu_nocb_wait_contended() functions:
CPU2 CPU11 kthread rcu_nocb_cb_kthread ksys_write rcu_do_batch vfs_write rcu_torture_timer_cb proc_sys_write __kmem_cache_free proc_sys_call_handler kmemleak_free drop_caches_sysctl_handler delete_object_full drop_slab __delete_object shrink_slab put_object lazy_rcu_shrink_scan call_rcu rcu_nocb_flush_bypass __call_rcu_commn rcu_nocb_bypass_lock raw_spin_trylock(&rdp->nocb_bypass_lock) fail atomic_inc(&rdp->nocb_lock_contended); rcu_nocb_wait_contended WARN_ON_ONCE(smp_processor_id() != rdp->cpu); WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended)) | |_ _ _ _ _ _ _ _ _ _same rdp and rdp->cpu != 11_ _ _ _ _ _ _ _ _ __|
Reproduce this bug with "echo 3 > /proc/sys/vm/drop_caches".
This commit therefore uses rcu_nocb_try_flush_bypass() instead of rcu_nocb_flush_bypass() in lazy_rcu_shrink_scan(). If the nocb_bypass queue is being flushed, then rcu_nocb_try_flush_bypass will return directly.
Signed-off-by: Zqiang qiang.zhang1211@gmail.com Reviewed-by: Joel Fernandes (Google) joel@joelfernandes.org Reviewed-by: Frederic Weisbecker frederic@kernel.org Reviewed-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/rcu/tree_nocb.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h index 4efbf7333d4e1..d430b4656f59e 100644 --- a/kernel/rcu/tree_nocb.h +++ b/kernel/rcu/tree_nocb.h @@ -1383,7 +1383,7 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) rcu_nocb_unlock_irqrestore(rdp, flags); continue; } - WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false)); + rcu_nocb_try_flush_bypass(rdp, jiffies); rcu_nocb_unlock_irqrestore(rdp, flags); wake_nocb_gp(rdp, false); sc->nr_to_scan -= _count;
From: Keith Busch kbusch@kernel.org
[ Upstream commit 7e80eb792bd7377a20f204943ac31c77d859be89 ]
The memory allocated for the identification is freed on failure. Set it to NULL so the caller doesn't have a pointer to that freed address.
Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Keith Busch kbusch@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/nvme/host/core.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 0a96362912ced..39ee3036bd516 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1398,8 +1398,10 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
error = nvme_submit_sync_cmd(dev->admin_q, &c, *id, sizeof(struct nvme_id_ctrl)); - if (error) + if (error) { kfree(*id); + *id = NULL; + } return error; }
@@ -1528,6 +1530,7 @@ int nvme_identify_ns(struct nvme_ctrl *ctrl, unsigned nsid, if (error) { dev_warn(ctrl->device, "Identify namespace failed (%d)\n", error); kfree(*id); + *id = NULL; } return error; }
linux-stable-mirror@lists.linaro.org