Re: [PATCH] mm: mglru: Fix soft lockup attributed to scanning folios

7 Mar 2024

On Thu,  7 Mar 2024 11:19:52 +0800 Yafang Shao laoar.shao@gmail.com wrote:
...
After we enabled mglru on our 384C1536GB production servers, we
encountered frequent soft lockups attributed to scanning folios.
The soft lockup as follows,
...
There were a total of 22 tasks waiting for this spinlock
(RDI: ffff99d2b6ff9050):
crash> foreach RU bt | grep -B 8  queued_spin_lock_slowpath |  grep "RDI: ffff99d2b6ff9050" | wc -l
 22
If we're holding the lock for this long then there's a possibility of
getting hit by the NMI watchdog also.
...
Additionally, two other threads were also engaged in scanning folios, one
with 19 waiters and the other with 15 waiters.
To address this issue under heavy reclaim conditions, we introduced a
hotfix version of the fix, incorporating cond_resched() in scan_folios().
Following the application of this hotfix to our servers, the soft lockup
issue ceased.
...

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4367,6 +4367,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 
   		if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH)
   			break;


	spin_unlock_irq(&lruvec->lru_lock);


	cond_resched();


	spin_lock_irq(&lruvec->lru_lock);

}

Presumably wrapping this with `if (need_resched())' will save some work.
This lock is held for a reason.  I'd like to see an analysis of why
this change is safe.

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] mm: mglru: Fix soft lockup attributed to scanning folios