Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

9 Oct 2025

On Thu, Oct 9, 2025 at 5:10 AM Chris Li chrisl@kernel.org wrote:
...
Hi Kairui,
First of all, your title is a bit misleading:
"do not perform synchronous discard during allocation"
You still do the synchronous discard, just limited to order 0 failing.
Also your commit did not describe the behavior change of this patch.
The behavior change is that, it now prefers to allocate from the
fragment list before waiting for the discard. Which I feel is not
justified.
After reading your patch, I feel that you still do the synchronous
discard, just now you do it with less lock held.
I suggest you just fix the lock held issue without changing the
discard ordering behavior.
On Mon, Oct 6, 2025 at 1:03 PM Kairui Song ryncsn@gmail.com wrote:
...
From: Kairui Song kasong@tencent.com
Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
fast path"), swap allocation is protected by a local lock, which means
we can't do any sleeping calls during allocation.
However, the discard routine is not taken well care of. When the swap
allocator failed to find any usable cluster, it would look at the
pending discard cluster and try to issue some blocking discards. It may
not necessarily sleep, but the cond_resched at the bio layer indicates
this is wrong when combined with a local lock. And the bio GFP flag used
for discard bio is also wrong (not atomic).
If lock is the issue, let's fix the lock issue.
...
It's arguable whether this synchronous discard is helpful at all. In
most cases, the async discard is good enough. And the swap allocator is
doing very differently at organizing the clusters since the recent
change, so it is very rare to see discard clusters piling up.
Very rare does not mean this never happens. If you have a cluster on
the discarding queue, I think it is better to wait for the discard to
complete before using the fragmented list, to reduce the
fragmentation. So it seems the real issue is holding a lock while
doing the block discard?
...
So far, no issues have been observed or reported with typical SSD setups
under months of high pressure. This issue was found during my code
review. But by hacking the kernel a bit: adding a mdelay(100) in the
async discard path, this issue will be observable with WARNING triggered
by the wrong GFP and cond_resched in the bio layer.
I think that makes an assumption on how slow the SSD discard is. Some
SSD can be really slow. We want our kernel to work for those slow
discard SSD cases as well.
...
So let's fix this issue in a safe way: remove the synchronous discard in
the swap allocation path. And when order 0 is failing with all cluster
list drained on all swap devices, try to do a discard following the swap
I don't feel that changing the discard behavior is justified here, the
real fix is discarding with less lock held. Am I missing something?
If I understand correctly, we should be able to keep the current
discard ordering behavior, discard before the fragment list. But with
less lock held as your current patch does.
I suggest the allocation here detects there is a discard pending and
running out of free blocks. Return there and indicate the need to
discard. The caller performs the discard without holding the lock,
similar to what you do with the order == 0 case.
Thanks for the suggestion. Right, that sounds even better. My initial
though was that maybe we can just remove this discard completely since
it rarely helps, and if the SSD is really that slow, OOM under heavy
pressure might even be an acceptable behaviour. But to make it safer,
I made it do discard only when order 0 is failing so the code is
simpler.
Let me sent a V2 to handle the discard carefully to reduce potential impact.
...
...
device priority list. If any discards released some cluster, try the
allocation again. This way, we can still avoid OOM due to swap failure
if the hardware is very slow and memory pressure is extremely high.
Cc: stable@vger.kernel.org
Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
Signed-off-by: Kairui Song kasong@tencent.com

mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index cb2392ed8e0e..0d1924f6f495 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
                        goto done;
        }

  /*


   * We don't have free cluster but have some clusters in discarding,


   * do discard now and reclaim them.


   */


  if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))


          goto new_cluster;



Assume you follow my suggestion.
Change this to some function to detect if there is a pending discard
on this device. Return to the caller indicating that you need a
discard for this device that has a pending discard.
Checking `!list_empty(si->discard_clusters)` should be good enough.

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation