Changes from PATCH v1 -> v2:
- Updated selftest to use ksft_test_result_code instead of switch-case
(Muhammad Usama Anjum)
- Included more use cases in the cover letter
(Huang, Ying)
- Added documentation for sysfs and memcg interfaces
- Added an aging-specific struct lru_gen_mm_walk in struct pglist_data
to avoid allocating for each lruvec.
Changes from RFC v3 -> PATCH v1:
- Updated selftest to use ksft_print_msg instead of fprintf(stderr, ...)
(Muhammad Usama Anjum)
- Included more detail in patch skipping pmd_young with force_scan
(Huang, Ying)
- Deferred reaccess histogram as a followup
- Removed per-memcg page age interval configs for simplicity
Changes from RFC v2 -> RFC v3:
- Update to v6.8
- Added an aging kernel thread (gated behind config)
- Added basic selftests for sysfs interface files
- Track swapped out pages for reaccesses
- Refactoring and cleanup
- Dropped the virtio-balloon extension to make things manageable
Changes from RFC v1 -> RFC v2:
- Refactored the patchs into smaller pieces
- Renamed interfaces and functions from wss to wsr (Working Set Reporting)
- Fixed build errors when CONFIG_WSR is not set
- Changed working_set_num_bins to u8 for virtio-balloon
- Added support for per-NUMA node reporting for virtio-balloon
[rfc v1]
https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.co…
[rfc v2]
https://lore.kernel.org/linux-mm/20230621180454.973862-1-yuanchu@google.com/
[rfc v3]
https://lore.kernel.org/linux-mm/20240327213108.2384666-1-yuanchu@google.co…
This patch series provides workingset reporting of user pages in
lruvecs, of which coldness can be tracked by accessed bits and fd
references. However, the concept of workingset applies generically to
all types of memory, which could be kernel slab caches, discardable
userspace caches (databases), or CXL.mem. Therefore, data sources might
come from slab shrinkers, device drivers, or the userspace. IMO, the
kernel should provide a set of workingset interfaces that should be
generic enough to accommodate the various use cases, and be extensible
to potential future use cases. The current proposed interfaces are not
sufficient in that regard, but I would like to start somewhere, solicit
feedback, and iterate.
Use cases
==========
Job scheduling
On overcommitted hosts, workingset information allows the job scheduler
to right-size each job and land more jobs on the same host or NUMA node,
and in the case of a job with increasing workingset, policy decisions
can be made to migrate other jobs off the host/NUMA node, or oom-kill
the misbehaving job. If the job shape is very different from the machine
shape, knowing the workingset per-node can also help inform page
allocation policies.
Proactive reclaim
Workingset information allows the a container manager to proactively
reclaim memory while not impacting a job's performance. While PSI may
provide a reactive measure of when a proactive reclaim has reclaimed too
much, workingset reporting allows the policy to be more accurate and
flexible.
Ballooning (similar to proactive reclaim)
While this patch series does not extend the virtio-balloon device,
balloon policies benefit from workingset to more precisely determine
the size of the memory balloon. On desktops/laptops/mobile devices where
memory is scarce and overcommitted, the balloon sizing in multiple VMs
running on the same device can be orchestrated with workingset reports
from each one.
Promotion/Demotion
If different mechanisms are used for promition and demotion, workingset
information can help connect the two and avoid pages being migrated back
and forth.
For example, given a promotion hot page threshold defined in reaccess
distance of N seconds (promote pages accessed more often than every N
seconds). The threshold N should be set so that ~80% (e.g.) of pages on
the fast memory node passes the threshold. This calculation can be done
with workingset reports.
To be directly useful for promotion policies, the workingset report
interfaces need to be extended to report hotness and gather hotness
information from the devices[1].
[1]
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements…
Sysfs and Cgroup Interfaces
==========
The interfaces are detailed in the patches that introduce them. The main
idea here is we break down the workingset per-node per-memcg into time
intervals (ms), e.g.
1000 anon=137368 file=24530
20000 anon=34342 file=0
30000 anon=353232 file=333608
40000 anon=407198 file=206052
9223372036854775807 anon=4925624 file=892892
I realize this does not generalize well to hotness information, but I
lack the intuition for an abstraction that presents hotness in a useful
way. Based on a recent proposal for move_phys_pages[2], it seems like
userspace tiering software would like to move specific physical pages,
instead of informing the kernel "move x number of hot pages to y
device". Please advise.
[2]
https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge…
Implementation
==========
Currently, the reporting of user pages is based off of MGLRU, and
therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU
generations for a more fine-grained workingset report. I will make the
generation count configurable in the next version. The workingset
reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the
aging thread is behind CONFIG_WORKINGSET_REPORT_AGING.
Yuanchu Xie (8):
mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=true
mm: aggregate working set information into histograms
mm: use refresh interval to rate-limit workingset report aggregation
mm: report workingset during memory pressure driven scanning
mm: extend working set reporting to memcgs
mm: add kernel aging thread for workingset reporting
selftest: test system-wide workingset reporting
Docs/admin-guide/mm/workingset_report: document sysfs and memcg
interfaces
Documentation/admin-guide/mm/index.rst | 1 +
.../admin-guide/mm/workingset_report.rst | 105 ++++
drivers/base/node.c | 6 +
include/linux/memcontrol.h | 5 +
include/linux/mmzone.h | 9 +
include/linux/workingset_report.h | 97 +++
mm/Kconfig | 15 +
mm/Makefile | 2 +
mm/internal.h | 18 +
mm/memcontrol.c | 184 +++++-
mm/mm_init.c | 2 +
mm/mmzone.c | 2 +
mm/vmscan.c | 58 +-
mm/workingset_report.c | 561 ++++++++++++++++++
mm/workingset_report_aging.c | 127 ++++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 3 +
tools/testing/selftests/mm/run_vmtests.sh | 5 +
.../testing/selftests/mm/workingset_report.c | 306 ++++++++++
.../testing/selftests/mm/workingset_report.h | 39 ++
.../selftests/mm/workingset_report_test.c | 329 ++++++++++
21 files changed, 1869 insertions(+), 6 deletions(-)
create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
create mode 100644 include/linux/workingset_report.h
create mode 100644 mm/workingset_report.c
create mode 100644 mm/workingset_report_aging.c
create mode 100644 tools/testing/selftests/mm/workingset_report.c
create mode 100644 tools/testing/selftests/mm/workingset_report.h
create mode 100644 tools/testing/selftests/mm/workingset_report_test.c
--
2.45.1.467.gbab1589fc0-goog
The variable are never referenced in the code, just remove it
that this problem was discovered by reading code
Signed-off-by: Zhu Jun <zhujun2(a)cmss.chinamobile.com>
---
tools/testing/selftests/dma/dma_map_benchmark.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c
index 3fcea00961c0..c91b485ca99a 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -33,7 +33,6 @@ int main(int argc, char **argv)
int granule = 1;
int cmd = DMA_MAP_BENCHMARK;
- char *p;
while ((opt = getopt(argc, argv, "t:s:n:b:d:x:g:")) != -1) {
switch (opt) {
--
2.17.1
This variable is never referenced in the code, just remove them
that this problem was discovered by reading the code
Signed-off-by: Zhu Jun <zhujun2(a)cmss.chinamobile.com>
---
Changes in v2:
- modify commit info
tools/testing/selftests/breakpoints/step_after_suspend_test.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/breakpoints/step_after_suspend_test.c b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
index b8703c499d28..dfec31fb9b30 100644
--- a/tools/testing/selftests/breakpoints/step_after_suspend_test.c
+++ b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
@@ -130,7 +130,6 @@ int run_test(int cpu)
void suspend(void)
{
int power_state_fd;
- struct sigevent event = {};
int timerfd;
int err;
struct itimerspec spec = {};
--
2.17.1
Main function return value is int type, so add return
value in the end that this problem was discovered by reading the code
Signed-off-by: Zhu Jun <zhujun2(a)cmss.chinamobile.com>
---
Changes in v2:
- modify commit info
tools/testing/selftests/breakpoints/step_after_suspend_test.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/breakpoints/step_after_suspend_test.c b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
index dfec31fb9b30..b473131fce3e 100644
--- a/tools/testing/selftests/breakpoints/step_after_suspend_test.c
+++ b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
@@ -166,7 +166,7 @@ int main(int argc, char **argv)
bool succeeded = true;
unsigned int tests = 0;
cpu_set_t available_cpus;
- int err;
+ int err = 0;
int cpu;
ksft_print_header();
@@ -222,4 +222,6 @@ int main(int argc, char **argv)
ksft_exit_pass();
else
ksft_exit_fail();
+
+ return err;
}
--
2.17.1
From: Geliang Tang <tanggeliang(a)kylinos.cn>
Resend patch 1 out of "skip ENOTSUPP BPF selftests" set as Eduard
suggested. Together with two other cleanups.
Geliang Tang (3):
selftests/bpf: Null checks for links in bpf_tcp_ca
selftests/bpf: Check ASSERT_OK(err) in dummy_st_ops
selftests/bpf: Close obj in error paths in xdp_adjust_tail
.../selftests/bpf/prog_tests/bpf_tcp_ca.c | 21 +++++++++++++------
.../selftests/bpf/prog_tests/dummy_st_ops.c | 8 +++++--
.../bpf/prog_tests/xdp_adjust_tail.c | 2 +-
3 files changed, 22 insertions(+), 9 deletions(-)
--
2.43.0
On Thu, Jul 04, 2024 at 04:36:04PM +0200, Arnd Bergmann wrote:
> #define __ARCH_WANT_SYS_CLONE
> +#define __ARCH_WANT_NEW_STAT
>
> -#ifndef __COMPAT_SYSCALL_NR
> -#include <uapi/asm/unistd.h>
> -#endif
> +#include <asm/unistd_64.h>
It looks like this is causing widespread build breakage in kselftest in
-next for arm64, there are *many* errors in the form:
In file included from test_signals_utils.c:14:
/build/stage/build-work/usr/include/asm/unistd.h:2:10: fatal error: unistd_64.h: No such file or directory
2 | #include <unistd_64.h>
| ^~~~~~~~~~~~~
which obviously looks like it's tied to the above but I've not fully
understood the patch/series yet. Build log at:
https://builds.sirena.org.uk/82d01fe6ee52086035b201cfa1410a3b04384257/arm64…
A bisect appears to confirm that it's this commit, which is in -next as
6e4a077c0b607c674536908c5b68f1c31e4e26ec.
git bisect start
# status: waiting for both good and bad commits
# bad: [82d01fe6ee52086035b201cfa1410a3b04384257] Add linux-next specific files for 20240709
git bisect bad 82d01fe6ee52086035b201cfa1410a3b04384257
# status: waiting for good commit(s), bad commit known
# good: [037206cd4cb43d535453723140fde1bcde0b296e] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
git bisect good 037206cd4cb43d535453723140fde1bcde0b296e
# bad: [2ae3e655fc40f1b6620194b90dcf9a4515257918] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
git bisect bad 2ae3e655fc40f1b6620194b90dcf9a4515257918
# bad: [4f2a367612d46dff2068582feadfbdd8e1c0443f] Merge branch 'fs-next' of linux-next
git bisect bad 4f2a367612d46dff2068582feadfbdd8e1c0443f
# bad: [d3da7ed72840f3660f90966490adfd499d96ea8f] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git
git bisect bad d3da7ed72840f3660f90966490adfd499d96ea8f
# good: [6355edbb3dfe322f0748b1eb3987973a568bbb42] Merge tag 'v6.11-rockchip-dts64-2' of https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into soc/dt
git bisect good 6355edbb3dfe322f0748b1eb3987973a568bbb42
# good: [2073cda629a47f2ebe2afcd3cb8b3000d5cd13d1] mm: optimization on page allocation when CMA enabled
git bisect good 2073cda629a47f2ebe2afcd3cb8b3000d5cd13d1
# good: [91a2b5b12867f77dc68d2d15ec7381e6e43820cb] Merge branch 'perf-tools-next' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git
git bisect good 91a2b5b12867f77dc68d2d15ec7381e6e43820cb
# bad: [b8c38a39b6ee44b02ee563b60439f417fec441ad] Merge branch 'for-next/perf' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git
git bisect bad b8c38a39b6ee44b02ee563b60439f417fec441ad
# good: [c100216635e922f43d9e783da918a749995350ca] Merge branch 'for-next/vcpu-hotplug' into for-next/core
git bisect good c100216635e922f43d9e783da918a749995350ca
# bad: [fafb823fc82dfb746cc9043b1573c4b29ef1d52a] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux.git
git bisect bad fafb823fc82dfb746cc9043b1573c4b29ef1d52a
# bad: [8d46f9dd06378e346a562c75bc2a260a03abe807] csky: convert to generic syscall table
git bisect bad 8d46f9dd06378e346a562c75bc2a260a03abe807
# good: [57029ba74296a4dafe35f147e88d56d8ae7b69da] kbuild: add syscall table generation to scripts/Makefile.asm-headers
git bisect good 57029ba74296a4dafe35f147e88d56d8ae7b69da
# good: [ea0130bf3c45f276b1f9e005eeb255a80a10358b] arm64: convert unistd_32.h to syscall.tbl format
git bisect good ea0130bf3c45f276b1f9e005eeb255a80a10358b
# bad: [b2595bdb3eb3fe24137d0bd07a51bc622f068a81] arm64: rework compat syscall macros
git bisect bad b2595bdb3eb3fe24137d0bd07a51bc622f068a81
# bad: [6e4a077c0b607c674536908c5b68f1c31e4e26ec] arm64: generate 64-bit syscall.tbl
git bisect bad 6e4a077c0b607c674536908c5b68f1c31e4e26ec
# first bad commit: [6e4a077c0b607c674536908c5b68f1c31e4e26ec] arm64: generate 64-bit syscall.tbl