Re: [PATCH v3 0/7] mm: workingset reporting

26 Aug 2024

      On Tue, Aug 20, 2024 at 6:00 AM Gregory Price gourry@gourry.net wrote:
...
On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote:
...
This patch series provides workingset reporting of user pages in
lruvecs, of which coldness can be tracked by accessed bits and fd
references. However, the concept of workingset applies generically to
all types of memory, which could be kernel slab caches, discardable
userspace caches (databases), or CXL.mem. Therefore, data sources might
come from slab shrinkers, device drivers, or the userspace. IMO, the
kernel should provide a set of workingset interfaces that should be
generic enough to accommodate the various use cases, and be extensible
to potential future use cases. The current proposed interfaces are not
sufficient in that regard, but I would like to start somewhere, solicit
feedback, and iterate.
... snip ...
...
Use cases
Promotion/Demotion
If different mechanisms are used for promition and demotion, workingset
information can help connect the two and avoid pages being migrated back
and forth.
For example, given a promotion hot page threshold defined in reaccess
distance of N seconds (promote pages accessed more often than every N
seconds). The threshold N should be set so that ~80% (e.g.) of pages on
the fast memory node passes the threshold. This calculation can be done
with workingset reports.
To be directly useful for promotion policies, the workingset report
interfaces need to be extended to report hotness and gather hotness
information from the devices[1].
[1]
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-...
Sysfs and Cgroup Interfaces
The interfaces are detailed in the patches that introduce them. The main
idea here is we break down the workingset per-node per-memcg into time
intervals (ms), e.g.
1000 anon=137368 file=24530
20000 anon=34342 file=0
30000 anon=353232 file=333608
40000 anon=407198 file=206052
9223372036854775807 anon=4925624 file=892892
I realize this does not generalize well to hotness information, but I
lack the intuition for an abstraction that presents hotness in a useful
way. Based on a recent proposal for move_phys_pages[2], it seems like
userspace tiering software would like to move specific physical pages,
instead of informing the kernel "move x number of hot pages to y
device". Please advise.
[2]
https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge....
Just as a note on this work, this is really a testing interface.  The
end-goal is not to merge such an interface that is user-facing like
move_phys_pages, but instead to have something like a triggered kernel
task that has a directive of "Promote X pages from Device A".
This work is more of an open collaboration for prototyping such that we
don't have to plumb it through the kernel from the start and assess the
usefulness of the hardware hotness collection mechanism.
Understood. I think we previously had this exchange and I forgot to
remove the mentions from the cover letter.
...

More generally on promotion, I have been considering recently a problem
with promoting unmapped pagecache pages - since they are not subject to
NUMA hint faults.  I started looking at PG_accessed and PG_workingset as
a potential mechanism to trigger promotion - but i'm starting to see a
pattern of competing priorities between reclaim (LRU/MGLRU) logic and
promotion logic.
In this case, IMO hardware support would be good as it could provide
the kernel with exactly what pages are hot, and it would not care
whether a page is mapped or not. I recall there being some CXL
proposal on this, but I'm not sure whether it has settled into a
standard yet.
...
Reclaim is triggered largely under memory pressure - which means co-opting
reclaim logic for promotion is at best logically confusing, and at worst
likely to introduce regressions.  The LRU/MGLRU logic is written largely
for reclaim, not promotion.  This makes hacking promotion in after the
fact rather dubious - the design choices don't match.
One example: if a page moves from inactive->active (or old->young), we
could treat this as a page "becoming hot" and mark it for promotion, but
this potentially punishes pages on the "active/younger" lists which are
themselves hotter.
To avoid punishing pages on the "young" list, one could insert the
page into a "less young" generation, but it would be difficult to have
a fixed policy for this in the kernel, so it may be best for this to
be configurable via BPF. One could insert the page in the middle of
the active/inactive list, but that would in effect create multiple
generations.
...
I'm starting to think separate demotion/reclaim and promotion components
are warranted. This could take the form of a separate kernel worker that
occasionally gets scheduled to manage a promotion list, or even the
addition of a PG_promote flag to decouple reclaim and promotion logic
completely.  Separating the structures entirely would be good to allow
both demotion/reclaim and promotion to occur concurrently (although this
seems problematic under memory pressure).
Would like to know your thoughts here.  If we can decide to segregate
promotion and demotion logic, it might go a long way to simplify the
existing interfaces and formalize transactions between the two.
The two systems still have to interact, so separating the two would
essentially create a new policy that decides whether the
demotion/reclaim or the promotion policy is in effect. If promotion
could figure out where to insert the page in terms of generations,
wouldn't that be simpler?
...
(also if you're going to LPC, might be worth a chat in person)
I cannot make it to LPC. :( Sadness
Yuanchu

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v3 0/7] mm: workingset reporting

Use cases

Sysfs and Cgroup Interfaces