Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

17 Dec 2023

      On Thu, Dec 14, 2023 at 4:51 PM Yu Zhao yuzhao@google.com wrote:
...
On Thu, Dec 14, 2023 at 11:38 AM Kairui Song ryncsn@gmail.com wrote:
...
Yu Zhao yuzhao@google.com 于2023年12月14日周四 11:09写道：
...
On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
...
On Tue, Dec 12, 2023 at 8:03 PM Kairui Song ryncsn@gmail.com wrote:
...
Kairui Song ryncsn@gmail.com 于2023年12月12日周二 14:52写道：
...
Yu Zhao yuzhao@google.com 于2023年12月12日周二 06:07写道：
>
> On Fri, Dec 8, 2023 at 1:24 AM Kairui Song ryncsn@gmail.com wrote:
> >
> > Yu Zhao yuzhao@google.com 于2023年12月8日周五 14:14写道：
> > >
> > > Unmapped folios accessed through file descriptors can be
> > > underprotected. Those folios are added to the oldest generation based
> > > on:
> > > 1. The fact that they are less costly to reclaim (no need to walk the
> > >    rmap and flush the TLB) and have less impact on performance (don't
> > >    cause major PFs and can be non-blocking if needed again).
> > > 2. The observation that they are likely to be single-use. E.g., for
> > >    client use cases like Android, its apps parse configuration files
> > >    and store the data in heap (anon); for server use cases like MySQL,
> > >    it reads from InnoDB files and holds the cached data for tables in
> > >    buffer pools (anon).
> > >
> > > However, the oldest generation can be very short lived, and if so, it
> > > doesn't provide the PID controller with enough time to respond to a
> > > surge of refaults. (Note that the PID controller uses weighted
> > > refaults and those from evicted generations only take a half of the
> > > whole weight.) In other words, for a short lived generation, the
> > > moving average smooths out the spike quickly.
> > >
> > > To fix the problem:
> > > 1. For folios that are already on LRU, if they can be beyond the
> > >    tracking range of tiers, i.e., five accesses through file
> > >    descriptors, move them to the second oldest generation to give them
> > >    more time to age. (Note that tiers are used by the PID controller
> > >    to statistically determine whether folios accessed multiple times
> > >    through file descriptors are worth protecting.)
> > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > >    that they are not too close to the tail. The effect of this is
> > >    similar to the above.
> > >
> > > On Android, launching 55 apps sequentially:
> > >                            Before     After      Change
> > >   workingset_refault_anon  25641024   25598972   0%
> > >   workingset_refault_file  115016834  106178438  -8%
> >
> > Hi Yu,
> >
> > Thanks you for your amazing works on MGLRU.
> >
> > I believe this is the similar issue I was trying to resolve previously:
> > https://lwn.net/Articles/945266/
> > The idea is to use refault distance to decide if the page should be
> > place in oldest generation or some other gen, which per my test,
> > worked very well, and we have been using refault distance for MGLRU in
> > multiple workloads.
> >
> > There are a few issues left in my previous RFC series, like anon pages
> > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > cases, but unfortunately it seems didn't get too much attention
> > upstream.
> >
> > I think both this patch and my previous series are for solving the
> > file pages underpertected issue, and I did a quick test using this
> > series, for mongodb test, refault distance seems still a better
> > solution (I'm not saying these two optimization are mutually exclusive
> > though, just they do have some conflicts in implementation and solving
> > similar problem):
> >
> > Previous result:
> > ==================================================================
> > Execution Results after 905 seconds
> > ------------------------------------------------------------------
> >                   Executed        Time (µs)       Rate
> >   STOCK_LEVEL     2542            27121571486.2   0.09 txn/s
> > ------------------------------------------------------------------
> >   TOTAL           2542            27121571486.2   0.09 txn/s
> >
> > This patch:
> > ==================================================================
> > Execution Results after 900 seconds
> > ------------------------------------------------------------------
> >                   Executed        Time (µs)       Rate
> >   STOCK_LEVEL     1594            27061522574.4   0.06 txn/s
> > ------------------------------------------------------------------
> >   TOTAL           1594            27061522574.4   0.06 txn/s
> >
> > Unpatched version is always around ~500.
>
> Thanks for the test results!
>
> > I think there are a few points here:
> > - Refault distance make use of page shadow so it can better
> > distinguish evicted pages of different access pattern (re-access
> > distance).
> > - Throttled refault distance can help hold part of workingset when
> > memory is too small to hold the whole workingset.
> >
> > So maybe part of this patch and the bits of previous series can be
> > combined to work better on this issue, how do you think?
>
> I'll try to find some time this week to look at your RFC. It'd be a
Hi Yu,
I'm working on V4 of the RFC now, which just update some comments, and
skip anon page re-activation in refault path for mglru which was not
very helpful, only some tiny adjustment.
And I found it easier to test with fio, using following test script:
#!/bin/bash
swapoff -a
modprobe brd rd_nr=1 rd_size=16777216
mkfs.ext4 /dev/ram0
mount /dev/ram0 /mnt
mkdir -p /sys/fs/cgroup/benchmark
cd /sys/fs/cgroup/benchmark
echo 4G > memory.max
echo $$ > cgroup.procs
echo 3 > /proc/sys/vm/drop_caches
fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
          --buffered=1 --ioengine=io_uring --iodepth=128 \
          --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
          --rw=randread --random_distribution=zipf:0.5 --norandommap \
          --time_based --ramp_time=5m --runtime=5m --group_reporting
zipf:0.5 is used here to simulate a cached read with slight bias
towards certain pages.
Unpatched 6.7-rc4:
Run status group 0 (all jobs):
   READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
(6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
Patched with RFC v4:
Run status group 0 (all jobs):
   READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
(7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
Patched with this series:
Run status group 0 (all jobs):
   READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
(7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
MGLRU off:
Run status group 0 (all jobs):
   READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
(6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec

If I change zipf:0.5 to random:

Unpatched 6.7-rc4:
Patched with this series:
Run status group 0 (all jobs):
   READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
(6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
Patched with RFC v4:
Run status group 0 (all jobs):
   READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
(6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
Patched with this series:
Run status group 0 (all jobs):
   READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
(6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
MGLRU off:
Run status group 0 (all jobs):
   READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
(5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
test I provided before uses a SATA SSD so it will have a much higher
impact. I'll provides a script to setup the test case and run it, it's
more complex to setup than fio since involving setting up multiple
replicas and auth and hundreds of GB of test fixtures, I'm currently
occupied by some other tasks but will try best to send them out as
soon as possible.
Thanks! Apparently your RFC did show better IOPS with both access
patterns, which was a surprise to me because it had higher refaults
and usually higher refautls result in worse performance.
So I'm still trying to figure out why it turned out the opposite. My
current guess is that:

It had a very small but stable inactive LRU list, which was able to

fit into the L3 cache entirely.
2. It counted few folios as workingset and therefore incurred less
overhead from CONFIG_PSI and/or CONFIG_TASK_DELAY_ACCT.
Did you save workingset_refault_file when you ran the test? If so, can
you check the difference between this series and your RFC?
It seems I was right about #1 above. After I scaled your test up by 20x,
I saw my series performed ~5% faster with zipf and ~9% faster with random
accesses.
Hi Yu,
Thank you so much for testing and sharing this result.
I'm not sure about #1, the ramdisk size, access data, are far larger
than L3 (16M on my CPU) even in down scaled test, and both random/zipf
shows similar result.
It's the LRU list not pages. IOW, the kernel data structure, not the
content in LRU pages. Does it make sense?
FYI. Willy just reminded me that he explained it a lot better than I
did: https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/
...
...
...
IOW, I made rd_size from 16GB to 320GB, memory.max from 4GB to 80GB,
--numjobs from 12 to 60 and --size from 1GB to 4GB.
Would you be able to try a larger configuration like above instead?

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache