On Thu, 7 Mar 2024 11:19:52 +0800 Yafang Shao laoar.shao@gmail.com wrote:
After we enabled mglru on our 384C1536GB production servers, we encountered frequent soft lockups attributed to scanning folios.
The soft lockup as follows,
...
There were a total of 22 tasks waiting for this spinlock (RDI: ffff99d2b6ff9050):
crash> foreach RU bt | grep -B 8 queued_spin_lock_slowpath | grep "RDI: ffff99d2b6ff9050" | wc -l 22
If we're holding the lock for this long then there's a possibility of getting hit by the NMI watchdog also.
Additionally, two other threads were also engaged in scanning folios, one with 19 waiters and the other with 15 waiters.
To address this issue under heavy reclaim conditions, we introduced a hotfix version of the fix, incorporating cond_resched() in scan_folios(). Following the application of this hotfix to our servers, the soft lockup issue ceased.
...
--- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4367,6 +4367,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) break;
spin_unlock_irq(&lruvec->lru_lock);
cond_resched();
}spin_lock_irq(&lruvec->lru_lock);
Presumably wrapping this with `if (need_resched())' will save some work.
This lock is held for a reason. I'd like to see an analysis of why this change is safe.