Changes from PATCH v2 -> v3: - Fixed typos in commit messages and documentation (Lance Yang, Randy Dunlap) - Split out the force_scan patch to be reviewed separately - Added benchmarks from Ghait Ouled Amar Ben Cheikh - Fixed reported compile error without CONFIG_MEMCG
Changes from PATCH v1 -> v2: - Updated selftest to use ksft_test_result_code instead of switch-case (Muhammad Usama Anjum) - Included more use cases in the cover letter (Huang, Ying) - Added documentation for sysfs and memcg interfaces - Added an aging-specific struct lru_gen_mm_walk in struct pglist_data to avoid allocating for each lruvec.
Changes from RFC v3 -> PATCH v1: - Updated selftest to use ksft_print_msg instead of fprintf(stderr, ...) (Muhammad Usama Anjum) - Included more detail in patch skipping pmd_young with force_scan (Huang, Ying) - Deferred reaccess histogram as a followup - Removed per-memcg page age interval configs for simplicity
Changes from RFC v2 -> RFC v3: - Update to v6.8 - Added an aging kernel thread (gated behind config) - Added basic selftests for sysfs interface files - Track swapped out pages for reaccesses - Refactoring and cleanup - Dropped the virtio-balloon extension to make things manageable
Changes from RFC v1 -> RFC v2: - Refactored the patchs into smaller pieces - Renamed interfaces and functions from wss to wsr (Working Set Reporting) - Fixed build errors when CONFIG_WSR is not set - Changed working_set_num_bins to u8 for virtio-balloon - Added support for per-NUMA node reporting for virtio-balloon
[rfc v1] https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.com... [rfc v2] https://lore.kernel.org/linux-mm/20230621180454.973862-1-yuanchu@google.com/ [rfc v3] https://lore.kernel.org/linux-mm/20240327213108.2384666-1-yuanchu@google.com...
This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. IMO, the kernel should provide a set of workingset interfaces that should be generic enough to accommodate the various use cases, and be extensible to potential future use cases. The current proposed interfaces are not sufficient in that regard, but I would like to start somewhere, solicit feedback, and iterate.
Use cases ========== Job scheduling On overcommitted hosts, workingset information allows the job scheduler to right-size each job and land more jobs on the same host or NUMA node, and in the case of a job with increasing workingset, policy decisions can be made to migrate other jobs off the host/NUMA node, or oom-kill the misbehaving job. If the job shape is very different from the machine shape, knowing the workingset per-node can also help inform page allocation policies.
Proactive reclaim Workingset information allows the a container manager to proactively reclaim memory while not impacting a job's performance. While PSI may provide a reactive measure of when a proactive reclaim has reclaimed too much, workingset reporting allows the policy to be more accurate and flexible.
Ballooning (similar to proactive reclaim) While this patch series does not extend the virtio-balloon device, balloon policies benefit from workingset to more precisely determine the size of the memory balloon. On desktops/laptops/mobile devices where memory is scarce and overcommitted, the balloon sizing in multiple VMs running on the same device can be orchestrated with workingset reports from each one.
Promotion/Demotion If different mechanisms are used for promition and demotion, workingset information can help connect the two and avoid pages being migrated back and forth. For example, given a promotion hot page threshold defined in reaccess distance of N seconds (promote pages accessed more often than every N seconds). The threshold N should be set so that ~80% (e.g.) of pages on the fast memory node passes the threshold. This calculation can be done with workingset reports. To be directly useful for promotion policies, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1].
[1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-...
Sysfs and Cgroup Interfaces ========== The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g.
1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892
I realize this does not generalize well to hotness information, but I lack the intuition for an abstraction that presents hotness in a useful way. Based on a recent proposal for move_phys_pages[2], it seems like userspace tiering software would like to move specific physical pages, instead of informing the kernel "move x number of hot pages to y device". Please advise.
[2] https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge....
Implementation ========== Currently, the reporting of user pages is based off of MGLRU, and therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more fine-grained workingset report. I will make the generation count configurable in the next version. The workingset reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind CONFIG_WORKINGSET_REPORT_AGING.
Benchmarks ========== Ghait Ouled Amar Ben Cheikh has implemented a simple "reclaim everything colder than 10 seconds every 40 seconds" policy and ran Linux compile and redis from the phoronix test suite. The results are in his repo: https://github.com/miloudi98/WMO
Yuanchu Xie (7): mm: aggregate working set information into histograms mm: use refresh interval to rate-limit workingset report aggregation mm: report workingset during memory pressure driven scanning mm: extend working set reporting to memcgs mm: add kernel aging thread for workingset reporting selftest: test system-wide workingset reporting Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces
Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/workingset_report.rst | 105 ++++ drivers/base/node.c | 6 + include/linux/memcontrol.h | 21 + include/linux/mmzone.h | 9 + include/linux/workingset_report.h | 97 +++ mm/Kconfig | 15 + mm/Makefile | 2 + mm/internal.h | 18 + mm/memcontrol.c | 184 +++++- mm/mm_init.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 56 +- mm/workingset_report.c | 561 ++++++++++++++++++ mm/workingset_report_aging.c | 127 ++++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + tools/testing/selftests/mm/run_vmtests.sh | 5 + .../testing/selftests/mm/workingset_report.c | 306 ++++++++++ .../testing/selftests/mm/workingset_report.h | 39 ++ .../selftests/mm/workingset_report_test.c | 330 +++++++++++ 21 files changed, 1885 insertions(+), 5 deletions(-) create mode 100644 Documentation/admin-guide/mm/workingset_report.rst create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c create mode 100644 mm/workingset_report_aging.c create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c
Hierarchically aggregate all memcgs' MGLRU generations and their page counts into working set page age histograms. The histograms break down the system's working set per-node, per-anon/file.
The sysfs interfaces are as follows: /sys/devices/system/node/nodeX/workingset_report/page_age A per-node page age histogram, showing an aggregate of the node's lruvecs. The information is extracted from MGLRU's per-generation page counters. Reading this file causes a hierarchical aging of all lruvecs, scanning pages and creates a new generation in each lruvec. For example: 1000 anon=0 file=0 2000 anon=0 file=0 100000 anon=5533696 file=5566464 18446744073709551615 anon=0 file=0
/sys/devices/system/node/nodeX/workingset_report/page_age_interval A comma separated list of time in milliseconds that configures what the page age histogram uses for aggregation.
Signed-off-by: Yuanchu Xie yuanchu@google.com Change-Id: I88fb6183f511f5a697c91a14619ee8e71e98e553 --- drivers/base/node.c | 6 + include/linux/mmzone.h | 9 + include/linux/workingset_report.h | 79 ++++++ mm/Kconfig | 9 + mm/Makefile | 1 + mm/internal.h | 4 + mm/memcontrol.c | 2 + mm/mm_init.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 10 +- mm/workingset_report.c | 451 ++++++++++++++++++++++++++++++ 11 files changed, 571 insertions(+), 4 deletions(-) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c
diff --git a/drivers/base/node.c b/drivers/base/node.c index eb72580288e6..ba5b8720dbfa 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,8 @@ #include <linux/pm_runtime.h> #include <linux/swap.h> #include <linux/slab.h> +#include <linux/memcontrol.h> +#include <linux/workingset_report.h>
static const struct bus_type node_subsys = { .name = "node", @@ -626,6 +628,7 @@ static int register_node(struct node *node, int num) } else { hugetlb_register_node(node); compaction_register_node(node); + wsr_init_sysfs(node); }
return error; @@ -642,6 +645,9 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); compaction_unregister_node(node); + wsr_remove_sysfs(node); + wsr_destroy_lruvec(mem_cgroup_lruvec(NULL, NODE_DATA(node->dev.id))); + wsr_destroy_pgdat(NODE_DATA(node->dev.id)); node_remove_accesses(node); node_remove_caches(node); device_unregister(&node->dev); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 41458892bc8a..57d485f73363 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -24,6 +24,7 @@ #include <linux/local_lock.h> #include <linux/zswap.h> #include <asm/page.h> +#include <linux/workingset_report.h>
/* Free memory management - zoned buddy allocator. */ #ifndef CONFIG_ARCH_FORCE_MAX_ORDER @@ -633,6 +634,9 @@ struct lruvec { struct lru_gen_mm_state mm_state; #endif #endif /* CONFIG_LRU_GEN */ +#ifdef CONFIG_WORKINGSET_REPORT + struct wsr_state wsr; +#endif /* CONFIG_WORKINGSET_REPORT */ #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif @@ -1405,6 +1409,11 @@ typedef struct pglist_data { struct lru_gen_memcg memcg_lru; #endif
+#ifdef CONFIG_WORKINGSET_REPORT + struct mutex wsr_update_mutex; + struct wsr_report_bins __rcu *wsr_page_age_bins; +#endif + CACHELINE_PADDING(_pad2_);
/* Per-node vmstats */ diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h new file mode 100644 index 000000000000..d7c2ee14ec87 --- /dev/null +++ b/include/linux/workingset_report.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_WORKINGSET_REPORT_H +#define _LINUX_WORKINGSET_REPORT_H + +#include <linux/types.h> +#include <linux/mutex.h> + +struct mem_cgroup; +struct pglist_data; +struct node; +struct lruvec; + +#ifdef CONFIG_WORKINGSET_REPORT + +#define WORKINGSET_REPORT_MIN_NR_BINS 2 +#define WORKINGSET_REPORT_MAX_NR_BINS 32 + +#define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) +#define ANON_AND_FILE 2 + +struct wsr_report_bin { + unsigned long idle_age; + unsigned long nr_pages[ANON_AND_FILE]; +}; + +struct wsr_report_bins { + /* excludes the WORKINGSET_INTERVAL_MAX bin */ + unsigned long nr_bins; + /* last bin contains WORKINGSET_INTERVAL_MAX */ + unsigned long idle_age[WORKINGSET_REPORT_MAX_NR_BINS]; + struct rcu_head rcu; +}; + +struct wsr_page_age_histo { + unsigned long timestamp; + struct wsr_report_bin bins[WORKINGSET_REPORT_MAX_NR_BINS]; +}; + +struct wsr_state { + /* breakdown of workingset by page age */ + struct mutex page_age_lock; + struct wsr_page_age_histo *page_age; +}; + +void wsr_init_lruvec(struct lruvec *lruvec); +void wsr_destroy_lruvec(struct lruvec *lruvec); +void wsr_init_pgdat(struct pglist_data *pgdat); +void wsr_destroy_pgdat(struct pglist_data *pgdat); +void wsr_init_sysfs(struct node *node); +void wsr_remove_sysfs(struct node *node); + +/* + * Returns true if the wsr is configured to be refreshed. + * The next refresh time is stored in refresh_time. + */ +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat); +#else +static inline void wsr_init_lruvec(struct lruvec *lruvec) +{ +} +static inline void wsr_destroy_lruvec(struct lruvec *lruvec) +{ +} +static inline void wsr_init_pgdat(struct pglist_data *pgdat) +{ +} +static inline void wsr_destroy_pgdat(struct pglist_data *pgdat) +{ +} +static inline void wsr_init_sysfs(struct node *node) +{ +} +static inline void wsr_remove_sysfs(struct node *node) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT */ + +#endif /* _LINUX_WORKINGSET_REPORT_H */ diff --git a/mm/Kconfig b/mm/Kconfig index b72e7d040f78..951b4e7300d2 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1263,6 +1263,15 @@ config IOMMU_MM_DATA config EXECMEM bool
+config WORKINGSET_REPORT + bool "Working set reporting" + depends on LRU_GEN && SYSFS + help + Report system and per-memcg working set to userspace. + + This option exports stats and events giving the user more insight + into its memory working set. + source "mm/damon/Kconfig"
endmenu diff --git a/mm/Makefile b/mm/Makefile index d2915f8c9dc0..62cef72ce7f8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -98,6 +98,7 @@ obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o +obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/internal.h b/mm/internal.h index b4d86436565b..f7d790f5d41c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -406,11 +406,15 @@ extern unsigned long highest_memmap_pfn; /* * in mm/vmscan.c: */ +struct scan_control; bool isolate_lru_page(struct page *page); bool folio_isolate_lru(struct folio *folio); void putback_lru_page(struct page *page); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); +bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap, + bool force_scan); +void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs);
/* * in mm/rmap.c: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f29157288b7d..5d16fd9d2c3f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -61,6 +61,7 @@ #include <linux/seq_buf.h> #include <linux/sched/isolation.h> #include <linux/kmemleak.h> +#include <linux/workingset_report.h> #include "internal.h" #include <net/sock.h> #include <net/ip.h> @@ -3504,6 +3505,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return;
+ wsr_destroy_lruvec(&pn->lruvec); free_percpu(pn->lruvec_stats_percpu); kfree(pn->lruvec_stats); kfree(pn); diff --git a/mm/mm_init.c b/mm/mm_init.c index 75c3bd42799b..d8fe19fc1cef 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -30,6 +30,7 @@ #include <linux/crash_dump.h> #include <linux/execmem.h> #include <linux/vmstat.h> +#include <linux/workingset_report.h> #include "internal.h" #include "slab.h" #include "shuffle.h" @@ -1378,6 +1379,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
pgdat_page_ext_init(pgdat); lruvec_init(&pgdat->__lruvec); + wsr_init_pgdat(pgdat); }
static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid, diff --git a/mm/mmzone.c b/mm/mmzone.c index c01896eca736..477cd5ac1d78 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -90,6 +90,8 @@ void lruvec_init(struct lruvec *lruvec) */ list_del(&lruvec->lists[LRU_UNEVICTABLE]);
+ wsr_init_lruvec(lruvec); + lru_gen_init_lruvec(lruvec); }
diff --git a/mm/vmscan.c b/mm/vmscan.c index cfa839284b92..8ab1b456d2cf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -56,6 +56,7 @@ #include <linux/khugepaged.h> #include <linux/rculist_nulls.h> #include <linux/random.h> +#include <linux/workingset_report.h>
#include <asm/tlbflush.h> #include <asm/div64.h> @@ -270,8 +271,7 @@ static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) } #endif
-static void set_task_reclaim_state(struct task_struct *task, - struct reclaim_state *rs) +void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs) { /* Check for an overwrite */ WARN_ON_ONCE(rs && task->reclaim_state); @@ -3857,8 +3857,8 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, return success; }
-static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, - bool can_swap, bool force_scan) +bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap, + bool force_scan) { bool success; struct lru_gen_mm_walk *walk; @@ -5631,6 +5631,8 @@ static int __init init_lru_gen(void) if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) pr_err("lru_gen: failed to create sysfs group\n");
+ wsr_init_sysfs(NULL); + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
diff --git a/mm/workingset_report.c b/mm/workingset_report.c new file mode 100644 index 000000000000..a4dcf62fcd96 --- /dev/null +++ b/mm/workingset_report.c @@ -0,0 +1,451 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/export.h> +#include <linux/lockdep.h> +#include <linux/jiffies.h> +#include <linux/kernfs.h> +#include <linux/memcontrol.h> +#include <linux/rcupdate.h> +#include <linux/mutex.h> +#include <linux/err.h> +#include <linux/atomic.h> +#include <linux/node.h> +#include <linux/mmzone.h> +#include <linux/mm.h> +#include <linux/mm_inline.h> +#include <linux/workingset_report.h> + +#include "internal.h" + +void wsr_init_pgdat(struct pglist_data *pgdat) +{ + mutex_init(&pgdat->wsr_update_mutex); + RCU_INIT_POINTER(pgdat->wsr_page_age_bins, NULL); +} + +void wsr_destroy_pgdat(struct pglist_data *pgdat) +{ + struct wsr_report_bins __rcu *bins; + + mutex_lock(&pgdat->wsr_update_mutex); + bins = rcu_replace_pointer(pgdat->wsr_page_age_bins, NULL, + lockdep_is_held(&pgdat->wsr_update_mutex)); + kfree_rcu(bins, rcu); + mutex_unlock(&pgdat->wsr_update_mutex); + mutex_destroy(&pgdat->wsr_update_mutex); +} + +void wsr_init_lruvec(struct lruvec *lruvec) +{ + struct wsr_state *wsr = &lruvec->wsr; + + memset(wsr, 0, sizeof(*wsr)); + mutex_init(&wsr->page_age_lock); +} + +void wsr_destroy_lruvec(struct lruvec *lruvec) +{ + struct wsr_state *wsr = &lruvec->wsr; + + mutex_destroy(&wsr->page_age_lock); + kfree(wsr->page_age); + memset(wsr, 0, sizeof(*wsr)); +} + +static int workingset_report_intervals_parse(char *src, + struct wsr_report_bins *bins) +{ + int err = 0, i = 0; + char *cur, *next = strim(src); + + if (*next == '\0') + return 0; + + while ((cur = strsep(&next, ","))) { + unsigned int interval; + + err = kstrtouint(cur, 0, &interval); + if (err) + goto out; + + bins->idle_age[i] = msecs_to_jiffies(interval); + if (i > 0 && bins->idle_age[i] <= bins->idle_age[i - 1]) { + err = -EINVAL; + goto out; + } + + if (++i == WORKINGSET_REPORT_MAX_NR_BINS) { + err = -ERANGE; + goto out; + } + } + + if (i && i < WORKINGSET_REPORT_MIN_NR_BINS - 1) { + err = -ERANGE; + goto out; + } + + bins->nr_bins = i; + bins->idle_age[i] = WORKINGSET_INTERVAL_MAX; +out: + return err ?: i; +} + +static unsigned long get_gen_start_time(const struct lru_gen_folio *lrugen, + unsigned long seq, + unsigned long max_seq, + unsigned long curr_timestamp) +{ + int younger_gen; + + if (seq == max_seq) + return curr_timestamp; + younger_gen = lru_gen_from_seq(seq + 1); + return READ_ONCE(lrugen->timestamps[younger_gen]); +} + +static void collect_page_age_type(const struct lru_gen_folio *lrugen, + struct wsr_report_bin *bin, + unsigned long max_seq, unsigned long min_seq, + unsigned long curr_timestamp, int type) +{ + unsigned long seq; + + for (seq = max_seq; seq + 1 > min_seq; seq--) { + int gen, zone; + unsigned long gen_end, gen_start, size = 0; + + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += max( + READ_ONCE(lrugen->nr_pages[gen][type][zone]), + 0L); + + gen_start = get_gen_start_time(lrugen, seq, max_seq, + curr_timestamp); + gen_end = READ_ONCE(lrugen->timestamps[gen]); + + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(gen_end + bin->idle_age, curr_timestamp)) { + unsigned long gen_in_bin = (long)gen_start - + (long)curr_timestamp + + (long)bin->idle_age; + unsigned long gen_len = (long)gen_start - (long)gen_end; + + if (!gen_len) + break; + if (gen_in_bin) { + unsigned long split_bin = + size / gen_len * gen_in_bin; + + bin->nr_pages[type] += split_bin; + size -= split_bin; + } + gen_start = curr_timestamp - bin->idle_age; + bin++; + } + bin->nr_pages[type] += size; + } +} + +/* + * proportionally aggregate Multi-gen LRU bins into a working set report + * MGLRU generations: + * current time + * | max_seq timestamp + * | | max_seq - 1 timestamp + * | | | unbounded + * | | | | + * -------------------------------- + * | max_seq | ... | ... | min_seq + * -------------------------------- + * + * Bins: + * + * current time + * | current - idle_age[0] + * | | current - idle_age[1] + * | | | unbounded + * | | | | + * ------------------------------ + * | bin 0 | ... | ... | bin n-1 + * ------------------------------ + * + * Assume the heuristic that pages are in the MGLRU generation + * through uniform accesses, so we can aggregate them + * proportionally into bins. + */ +static void collect_page_age(struct wsr_page_age_histo *page_age, + const struct lruvec *lruvec) +{ + int type; + const struct lru_gen_folio *lrugen = &lruvec->lrugen; + unsigned long curr_timestamp = jiffies; + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + unsigned long min_seq[ANON_AND_FILE] = { + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_ANON]), + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_FILE]), + }; + struct wsr_report_bin *bin = &page_age->bins[0]; + + for (type = 0; type < ANON_AND_FILE; type++) + collect_page_age_type(lrugen, bin, max_seq, min_seq[type], + curr_timestamp, type); +} + +/* First step: hierarchically scan child memcgs. */ +static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + unsigned int flags; + struct reclaim_state rs = { 0 }; + + set_task_reclaim_state(current, &rs); + flags = memalloc_noreclaim_save(); + + memcg = mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + + /* + * setting can_swap=true and force_scan=true ensures + * proper workingset stats when the system cannot swap. + */ + try_to_inc_max_seq(lruvec, max_seq, true, true); + cond_resched(); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); + + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); +} + +/* Second step: aggregate child memcgs into the page age histogram. */ +static void refresh_aggregate(struct wsr_page_age_histo *page_age, + struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + struct wsr_report_bin *bin; + + for (bin = page_age->bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) { + bin->nr_pages[0] = 0; + bin->nr_pages[1] = 0; + } + /* the last used bin has idle_age == WORKINGSET_INTERVAL_MAX. */ + bin->nr_pages[0] = 0; + bin->nr_pages[1] = 0; + + memcg = mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + collect_page_age(page_age, lruvec); + cond_resched(); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); + WRITE_ONCE(page_age->timestamp, jiffies); +} + +static void copy_node_bins(struct pglist_data *pgdat, + struct wsr_page_age_histo *page_age) +{ + struct wsr_report_bins *node_page_age_bins; + int i = 0; + + rcu_read_lock(); + node_page_age_bins = rcu_dereference(pgdat->wsr_page_age_bins); + if (!node_page_age_bins) + goto nocopy; + for (i = 0; i < node_page_age_bins->nr_bins; ++i) + page_age->bins[i].idle_age = node_page_age_bins->idle_age[i]; + +nocopy: + page_age->bins[i].idle_age = WORKINGSET_INTERVAL_MAX; + rcu_read_unlock(); +} + +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct wsr_page_age_histo *page_age; + + if (!READ_ONCE(wsr->page_age)) + return false; + + refresh_scan(wsr, root, pgdat); + mutex_lock(&wsr->page_age_lock); + page_age = READ_ONCE(wsr->page_age); + if (page_age) { + copy_node_bins(pgdat, page_age); + refresh_aggregate(page_age, root, pgdat); + } + mutex_unlock(&wsr->page_age_lock); + return !!page_age; +} +EXPORT_SYMBOL_GPL(wsr_refresh_report); + +static struct pglist_data *kobj_to_pgdat(struct kobject *kobj) +{ + int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id : + first_memory_node; + + return NODE_DATA(nid); +} + +static struct wsr_state *kobj_to_wsr(struct kobject *kobj) +{ + return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; +} + +static ssize_t page_age_intervals_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_report_bins *bins; + int len = 0; + struct pglist_data *pgdat = kobj_to_pgdat(kobj); + + rcu_read_lock(); + bins = rcu_dereference(pgdat->wsr_page_age_bins); + if (bins) { + int i; + int nr_bins = bins->nr_bins; + + for (i = 0; i < bins->nr_bins; ++i) { + len += sysfs_emit_at( + buf, len, "%u", + jiffies_to_msecs(bins->idle_age[i])); + if (i + 1 < nr_bins) + len += sysfs_emit_at(buf, len, ","); + } + } + len += sysfs_emit_at(buf, len, "\n"); + rcu_read_unlock(); + + return len; +} + +static ssize_t page_age_intervals_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *src, size_t len) +{ + struct wsr_report_bins *bins = NULL, __rcu *old; + char *buf = NULL; + int err = 0; + struct pglist_data *pgdat = kobj_to_pgdat(kobj); + + buf = kstrdup(src, GFP_KERNEL); + if (!buf) { + err = -ENOMEM; + goto failed; + } + + bins = + kzalloc(sizeof(struct wsr_report_bins), GFP_KERNEL); + + if (!bins) { + err = -ENOMEM; + goto failed; + } + + err = workingset_report_intervals_parse(buf, bins); + if (err < 0) + goto failed; + + if (err == 0) { + kfree(bins); + bins = NULL; + } + + mutex_lock(&pgdat->wsr_update_mutex); + old = rcu_replace_pointer(pgdat->wsr_page_age_bins, bins, + lockdep_is_held(&pgdat->wsr_update_mutex)); + mutex_unlock(&pgdat->wsr_update_mutex); + kfree_rcu(old, rcu); + kfree(buf); + return len; +failed: + kfree(bins); + kfree(buf); + + return err; +} + +static struct kobj_attribute page_age_intervals_attr = + __ATTR_RW(page_age_intervals); + +static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct wsr_report_bin *bin; + int ret = 0; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + wsr->page_age = + kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL); + mutex_unlock(&wsr->page_age_lock); + + wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj)); + + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + goto unlock; + for (bin = wsr->page_age->bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) + ret += sysfs_emit_at(buf, ret, "%u anon=%lu file=%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + + ret += sysfs_emit_at(buf, ret, "%lu anon=%lu file=%lu\n", + WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + +unlock: + mutex_unlock(&wsr->page_age_lock); + return ret; +} + +static struct kobj_attribute page_age_attr = __ATTR_RO(page_age); + +static struct attribute *workingset_report_attrs[] = { + &page_age_intervals_attr.attr, &page_age_attr.attr, NULL +}; + +static const struct attribute_group workingset_report_attr_group = { + .name = "workingset_report", + .attrs = workingset_report_attrs, +}; + +void wsr_init_sysfs(struct node *node) +{ + struct kobject *kobj = node ? &node->dev.kobj : mm_kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr = kobj_to_wsr(kobj); + + if (sysfs_create_group(kobj, &workingset_report_attr_group)) + pr_warn("Workingset report failed to create sysfs files\n"); +} +EXPORT_SYMBOL_GPL(wsr_init_sysfs); + +void wsr_remove_sysfs(struct node *node) +{ + struct kobject *kobj = &node->dev.kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr = kobj_to_wsr(kobj); + sysfs_remove_group(kobj, &workingset_report_attr_group); +} +EXPORT_SYMBOL_GPL(wsr_remove_sysfs);
The refresh interval is a rate limiting factor to workingset page age histogram reads. When a workingset report is generated, a timestamp is noted, and the same report will be read until it expires beyond the refresh interval, at which point a new report is generated.
Sysfs interface /sys/devices/system/node/nodeX/workingset_report/refresh_interval time in milliseconds specifying how long the report is valid for
Signed-off-by: Yuanchu Xie yuanchu@google.com Change-Id: Iabacec2845a7fac7280ffc5f4faed55186cfa7a2 --- include/linux/workingset_report.h | 1 + mm/workingset_report.c | 84 +++++++++++++++++++++++++------ 2 files changed, 70 insertions(+), 15 deletions(-)
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index d7c2ee14ec87..8bae6a600410 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -37,6 +37,7 @@ struct wsr_page_age_histo { };
struct wsr_state { + unsigned long refresh_interval; /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; diff --git a/mm/workingset_report.c b/mm/workingset_report.c index a4dcf62fcd96..fe553c0a653e 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -195,7 +195,8 @@ static void collect_page_age(struct wsr_page_age_histo *page_age,
/* First step: hierarchically scan child memcgs. */ static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + unsigned long refresh_interval) { struct mem_cgroup *memcg; unsigned int flags; @@ -208,12 +209,15 @@ static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + int gen = lru_gen_from_seq(max_seq); + unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
/* * setting can_swap=true and force_scan=true ensures * proper workingset stats when the system cannot swap. */ - try_to_inc_max_seq(lruvec, max_seq, true, true); + if (time_is_before_jiffies(birth + refresh_interval)) + try_to_inc_max_seq(lruvec, max_seq, true, true); cond_resched(); } while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
@@ -270,17 +274,25 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, struct pglist_data *pgdat) { struct wsr_page_age_histo *page_age; + unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval);
if (!READ_ONCE(wsr->page_age)) return false;
- refresh_scan(wsr, root, pgdat); + if (!refresh_interval) + return false; + mutex_lock(&wsr->page_age_lock); page_age = READ_ONCE(wsr->page_age); - if (page_age) { - copy_node_bins(pgdat, page_age); - refresh_aggregate(page_age, root, pgdat); - } + if (!page_age) + goto unlock; + if (page_age->timestamp && + time_is_after_jiffies(page_age->timestamp + refresh_interval)) + goto unlock; + refresh_scan(wsr, root, pgdat, refresh_interval); + copy_node_bins(pgdat, page_age); + refresh_aggregate(page_age, root, pgdat); +unlock: mutex_unlock(&wsr->page_age_lock); return !!page_age; } @@ -299,6 +311,52 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj) return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; }
+static ssize_t refresh_interval_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_state *wsr = kobj_to_wsr(kobj); + unsigned int interval = READ_ONCE(wsr->refresh_interval); + + return sysfs_emit(buf, "%u\n", jiffies_to_msecs(interval)); +} + +static ssize_t refresh_interval_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int interval; + int err; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + err = kstrtouint(buf, 0, &interval); + if (err) + return err; + + mutex_lock(&wsr->page_age_lock); + if (interval && !wsr->page_age) { + struct wsr_page_age_histo *page_age = + kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL); + + if (!page_age) { + err = -ENOMEM; + goto unlock; + } + wsr->page_age = page_age; + } + if (!interval && wsr->page_age) { + kfree(wsr->page_age); + wsr->page_age = NULL; + } + + WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(interval)); +unlock: + mutex_unlock(&wsr->page_age_lock); + return err ?: len; +} + +static struct kobj_attribute refresh_interval_attr = + __ATTR_RW(refresh_interval); + static ssize_t page_age_intervals_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -382,13 +440,6 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, int ret = 0; struct wsr_state *wsr = kobj_to_wsr(kobj);
- - mutex_lock(&wsr->page_age_lock); - if (!wsr->page_age) - wsr->page_age = - kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL); - mutex_unlock(&wsr->page_age_lock); - wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj));
mutex_lock(&wsr->page_age_lock); @@ -414,7 +465,10 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, static struct kobj_attribute page_age_attr = __ATTR_RO(page_age);
static struct attribute *workingset_report_attrs[] = { - &page_age_intervals_attr.attr, &page_age_attr.attr, NULL + &refresh_interval_attr.attr, + &page_age_intervals_attr.attr, + &page_age_attr.attr, + NULL };
static const struct attribute_group workingset_report_attr_group = {
When a node reaches its low watermarks and wakes up kswapd, notify all userspace programs waiting on the workingset page age histogram of the memory pressure, so a userspace agent can read the workingset report in time and make policy decisions, such as logging, oom-killing, or migration.
Sysfs interface: /sys/devices/system/node/nodeX/workingset_report/report_threshold time in milliseconds that specifies how often the userspace agent can be notified for node memory pressure.
Signed-off-by: Yuanchu Xie yuanchu@google.com Change-Id: Ib1fdb726da921e8d467822a99da8f384c6181951 --- include/linux/workingset_report.h | 4 +++ mm/internal.h | 12 ++++++++ mm/vmscan.c | 46 +++++++++++++++++++++++++++++++ mm/workingset_report.c | 43 ++++++++++++++++++++++++++++- 4 files changed, 104 insertions(+), 1 deletion(-)
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 8bae6a600410..2ec8b927b200 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -37,7 +37,11 @@ struct wsr_page_age_histo { };
struct wsr_state { + unsigned long report_threshold; unsigned long refresh_interval; + + struct kernfs_node *page_age_sys_file; + /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; diff --git a/mm/internal.h b/mm/internal.h index f7d790f5d41c..19af882c506e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -416,6 +416,18 @@ bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap, bool force_scan); void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs);
+#ifdef CONFIG_WORKINGSET_REPORT +/* + * in mm/wsr.c + */ +void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat); +#else +static inline void notify_workingset(struct mem_cgroup *memcg, + struct pglist_data *pgdat) +{ +} +#endif + /* * in mm/rmap.c: */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 8ab1b456d2cf..b07fd2016c75 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2574,6 +2574,15 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, return can_demote(pgdat->node_id, sc); }
+#ifdef CONFIG_WORKINGSET_REPORT +static void try_to_report_workingset(struct pglist_data *pgdat, struct scan_control *sc); +#else +static inline void try_to_report_workingset(struct pglist_data *pgdat, + struct scan_control *sc) +{ +} +#endif + #ifdef CONFIG_LRU_GEN
#ifdef CONFIG_LRU_GEN_ENABLED @@ -4000,6 +4009,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
set_initial_priority(pgdat, sc);
+ try_to_report_workingset(pgdat, sc); + memcg = mem_cgroup_iter(NULL, NULL, NULL); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); @@ -5640,6 +5651,38 @@ static int __init init_lru_gen(void) }; late_initcall(init_lru_gen);
+#ifdef CONFIG_WORKINGSET_REPORT +static void try_to_report_workingset(struct pglist_data *pgdat, + struct scan_control *sc) +{ + struct mem_cgroup *memcg = sc->target_mem_cgroup; + struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; + unsigned long threshold = READ_ONCE(wsr->report_threshold); + + if (sc->priority == DEF_PRIORITY) + return; + + if (!threshold) + return; + + if (!mutex_trylock(&wsr->page_age_lock)) + return; + + if (!wsr->page_age) { + mutex_unlock(&wsr->page_age_lock); + return; + } + + if (time_is_after_jiffies(wsr->page_age->timestamp + threshold)) { + mutex_unlock(&wsr->page_age_lock); + return; + } + + mutex_unlock(&wsr->page_age_lock); + notify_workingset(memcg, pgdat); +} +#endif /* CONFIG_WORKINGSET_REPORT */ + #else /* !CONFIG_LRU_GEN */
static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) @@ -6191,6 +6234,9 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) if (zone->zone_pgdat == last_pgdat) continue; last_pgdat = zone->zone_pgdat; + + if (!sc->proactive) + try_to_report_workingset(zone->zone_pgdat, sc); shrink_node(zone->zone_pgdat, sc); }
diff --git a/mm/workingset_report.c b/mm/workingset_report.c index fe553c0a653e..801ac8e5c1da 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -311,6 +311,33 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj) return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; }
+static ssize_t report_threshold_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_state *wsr = kobj_to_wsr(kobj); + unsigned int threshold = READ_ONCE(wsr->report_threshold); + + return sysfs_emit(buf, "%u\n", jiffies_to_msecs(threshold)); +} + +static ssize_t report_threshold_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int threshold; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + if (kstrtouint(buf, 0, &threshold)) + return -EINVAL; + + WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(threshold)); + + return len; +} + +static struct kobj_attribute report_threshold_attr = + __ATTR_RW(report_threshold); + static ssize_t refresh_interval_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -465,6 +492,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, static struct kobj_attribute page_age_attr = __ATTR_RO(page_age);
static struct attribute *workingset_report_attrs[] = { + &report_threshold_attr.attr, &refresh_interval_attr.attr, &page_age_intervals_attr.attr, &page_age_attr.attr, @@ -486,8 +514,13 @@ void wsr_init_sysfs(struct node *node)
wsr = kobj_to_wsr(kobj);
- if (sysfs_create_group(kobj, &workingset_report_attr_group)) + if (sysfs_create_group(kobj, &workingset_report_attr_group)) { pr_warn("Workingset report failed to create sysfs files\n"); + return; + } + + wsr->page_age_sys_file = + kernfs_walk_and_get(kobj->sd, "workingset_report/page_age"); } EXPORT_SYMBOL_GPL(wsr_init_sysfs);
@@ -500,6 +533,14 @@ void wsr_remove_sysfs(struct node *node) return;
wsr = kobj_to_wsr(kobj); + kernfs_put(wsr->page_age_sys_file); sysfs_remove_group(kobj, &workingset_report_attr_group); } EXPORT_SYMBOL_GPL(wsr_remove_sysfs); + +void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) +{ + struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; + + kernfs_notify(wsr->page_age_sys_file); +}
Break down the system-wide working set reporting into per-memcg reports, which aggregages its children hierarchically. The per-node working set reporting histograms and refresh/report threshold files are presented as memcg files, showing a report containing all the nodes.
The per-node page age interval is configurable in sysfs and not available per-memcg, while the refresh interval and report threshold are configured per-memcg.
Memcg interface: /sys/fs/cgroup/.../memory.workingset.page_age The memcg equivalent of the sysfs workingset page age histogram breaks down the workingset of this memcg and its children into page age intervals. Each node is prefixed with a node header and a newline. Non-proactive direct reclaim on this memcg can also wake up userspace agents that are waiting on this file. e.g. N0 1000 anon=0 file=0 2000 anon=0 file=0 3000 anon=0 file=0 4000 anon=0 file=0 5000 anon=0 file=0 18446744073709551615 anon=0 file=0
/sys/fs/cgroup/.../memory.workingset.refresh_interval The memcg equivalent of the sysfs refresh interval. A per-node number of how much time a page age histogram is valid for, in milliseconds. e.g. echo N0=2000 > memory.workingset.refresh_interval
/sys/fs/cgroup/.../memory.workingset.report_threshold The memcg equivalent of the sysfs report threshold. A per-node number of how often userspace agent waiting on the page age histogram can be woken up, in milliseconds. e.g. echo N0=1000 > memory.workingset.report_threshold
Signed-off-by: Yuanchu Xie yuanchu@google.com Change-Id: I6c8a39cddd248c031c85f144787c9044912f3c21 --- include/linux/memcontrol.h | 21 ++++ include/linux/workingset_report.h | 6 +- mm/internal.h | 2 + mm/memcontrol.c | 178 +++++++++++++++++++++++++++++- mm/workingset_report.c | 12 +- 5 files changed, 214 insertions(+), 5 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0e5bf25d324f..6d67f1072943 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -315,6 +315,11 @@ struct mem_cgroup { spinlock_t event_list_lock; #endif /* CONFIG_MEMCG_V1 */
+#ifdef CONFIG_WORKINGSET_REPORT + /* memory.workingset.page_age file */ + struct cgroup_file workingset_page_age_file; +#endif + struct mem_cgroup_per_node *nodeinfo[]; };
@@ -1063,6 +1068,16 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
void split_page_memcg(struct page *head, int old_order, int new_order);
+static inline struct cgroup_file * +mem_cgroup_page_age_file(struct mem_cgroup *memcg) +{ +#ifdef CONFIG_WORKINGSET_REPORT + return &memcg->workingset_page_age_file; +#else + return NULL; +#endif +} + #else /* CONFIG_MEMCG */
#define MEM_CGROUP_ID_SHIFT 0 @@ -1470,6 +1485,12 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) static inline void split_page_memcg(struct page *head, int old_order, int new_order) { } + +static inline struct cgroup_file * +mem_cgroup_page_age_file(struct mem_cgroup *memcg) +{ + return NULL; +} #endif /* CONFIG_MEMCG */
/* diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 2ec8b927b200..ae412d408037 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -9,6 +9,7 @@ struct mem_cgroup; struct pglist_data; struct node; struct lruvec; +struct cgroup_file;
#ifdef CONFIG_WORKINGSET_REPORT
@@ -40,7 +41,10 @@ struct wsr_state { unsigned long report_threshold; unsigned long refresh_interval;
- struct kernfs_node *page_age_sys_file; + union { + struct kernfs_node *page_age_sys_file; + struct cgroup_file *page_age_cgroup_file; + };
/* breakdown of workingset by page age */ struct mutex page_age_lock; diff --git a/mm/internal.h b/mm/internal.h index 19af882c506e..cc54358ed582 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -421,6 +421,8 @@ void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs); * in mm/wsr.c */ void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat); +int workingset_report_intervals_parse(char *src, + struct wsr_report_bins *bins); #else static inline void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5d16fd9d2c3f..56848dc94d48 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4318,6 +4318,162 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, return nbytes; }
+#ifdef CONFIG_WORKINGSET_REPORT +static int memory_ws_refresh_interval_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + + seq_printf(m, "N%d=%u ", nid, + jiffies_to_msecs(READ_ONCE(wsr->refresh_interval))); + } + seq_putc(m, '\n'); + + return 0; +} + +static ssize_t memory_wsr_threshold_parse(char *buf, size_t nbytes, + unsigned int *nid_out, + unsigned int *msecs) +{ + char *node, *threshold; + unsigned int nid; + int err; + + buf = strstrip(buf); + threshold = buf; + node = strsep(&threshold, "="); + + if (*node != 'N') + return -EINVAL; + + err = kstrtouint(node + 1, 0, &nid); + if (err) + return err; + + if (nid >= nr_node_ids || !node_state(nid, N_MEMORY)) + return -EINVAL; + + err = kstrtouint(threshold, 0, msecs); + if (err) + return err; + + *nid_out = nid; + + return nbytes; +} + +static ssize_t memory_ws_refresh_interval_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + unsigned int nid, msecs; + struct wsr_state *wsr; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs); + + if (ret < 0) + return ret; + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + + mutex_lock(&wsr->page_age_lock); + if (msecs && !wsr->page_age) { + struct wsr_page_age_histo *page_age = + kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL); + + if (!page_age) { + ret = -ENOMEM; + goto unlock; + } + wsr->page_age = page_age; + } + if (!msecs && wsr->page_age) { + kfree(wsr->page_age); + wsr->page_age = NULL; + } + + WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(msecs)); +unlock: + mutex_unlock(&wsr->page_age_lock); + return ret; +} + +static int memory_ws_report_threshold_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + + seq_printf(m, "N%d=%u ", nid, + jiffies_to_msecs(READ_ONCE(wsr->report_threshold))); + } + seq_putc(m, '\n'); + + return 0; +} + +static ssize_t memory_ws_report_threshold_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + unsigned int nid, msecs; + struct wsr_state *wsr; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs); + + if (ret < 0) + return ret; + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(msecs)); + return ret; +} + +static int memory_ws_page_age_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + struct wsr_report_bin *bin; + + if (!READ_ONCE(wsr->page_age)) + continue; + + wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + goto unlock; + seq_printf(m, "N%d\n", nid); + for (bin = wsr->page_age->bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) + seq_printf(m, "%u anon=%lu file=%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + + seq_printf(m, "%lu anon=%lu file=%lu\n", WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + +unlock: + mutex_unlock(&wsr->page_age_lock); + } + + return 0; +} +#endif + static struct cftype memory_files[] = { { .name = "current", @@ -4386,7 +4542,27 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NS_DELEGATABLE, .write = memory_reclaim, }, - { } /* terminate */ +#ifdef CONFIG_WORKINGSET_REPORT + { + .name = "workingset.refresh_interval", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_refresh_interval_show, + .write = memory_ws_refresh_interval_write, + }, + { + .name = "workingset.report_threshold", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_report_threshold_show, + .write = memory_ws_report_threshold_write, + }, + { + .name = "workingset.page_age", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .file_offset = offsetof(struct mem_cgroup, workingset_page_age_file), + .seq_show = memory_ws_page_age_show, + }, +#endif + {} /* terminate */ };
struct cgroup_subsys memory_cgrp_subsys = { diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 801ac8e5c1da..333e51e3ee12 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -37,9 +37,12 @@ void wsr_destroy_pgdat(struct pglist_data *pgdat) void wsr_init_lruvec(struct lruvec *lruvec) { struct wsr_state *wsr = &lruvec->wsr; + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
memset(wsr, 0, sizeof(*wsr)); mutex_init(&wsr->page_age_lock); + if (memcg && !mem_cgroup_is_root(memcg)) + wsr->page_age_cgroup_file = mem_cgroup_page_age_file(memcg); }
void wsr_destroy_lruvec(struct lruvec *lruvec) @@ -51,8 +54,8 @@ void wsr_destroy_lruvec(struct lruvec *lruvec) memset(wsr, 0, sizeof(*wsr)); }
-static int workingset_report_intervals_parse(char *src, - struct wsr_report_bins *bins) +int workingset_report_intervals_parse(char *src, + struct wsr_report_bins *bins) { int err = 0, i = 0; char *cur, *next = strim(src); @@ -542,5 +545,8 @@ void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) { struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr;
- kernfs_notify(wsr->page_age_sys_file); + if (mem_cgroup_is_root(memcg)) + kernfs_notify(wsr->page_age_sys_file); + else + cgroup_file_notify(wsr->page_age_cgroup_file); }
For reliable and timely aging on memcgs, one has to read the page age histograms on time. A kernel thread makes it easier by aging memcgs with valid refresh_interval when they can be refreshed, and also reduces the latency of any userspace consumers of the page age histogram.
The kerne aging thread is gated behind CONFIG_WORKINGSET_REPORT_AGING. Debugging stats may be added in the future for when aging cannot keep up with the configured refresh_interval.
Signed-off-by: Yuanchu Xie yuanchu@google.com Change-Id: Ia34a5d6b7eb91e39fab5f2d4ca2f18a9c370890b --- include/linux/workingset_report.h | 11 ++- mm/Kconfig | 6 ++ mm/Makefile | 1 + mm/memcontrol.c | 8 +- mm/workingset_report.c | 15 +++- mm/workingset_report_aging.c | 127 ++++++++++++++++++++++++++++++ 6 files changed, 162 insertions(+), 6 deletions(-) create mode 100644 mm/workingset_report_aging.c
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index ae412d408037..9294023db5a8 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -63,7 +63,16 @@ void wsr_remove_sysfs(struct node *node); * The next refresh time is stored in refresh_time. */ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat); + struct pglist_data *pgdat, unsigned long *refresh_time); + +#ifdef CONFIG_WORKINGSET_REPORT_AGING +void wsr_wakeup_aging_thread(void); +#else /* CONFIG_WORKINGSET_REPORT_AGING */ +static inline void wsr_wakeup_aging_thread(void) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT_AGING */ + #else static inline void wsr_init_lruvec(struct lruvec *lruvec) { diff --git a/mm/Kconfig b/mm/Kconfig index 951b4e7300d2..c1dfece5b71c 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1272,6 +1272,12 @@ config WORKINGSET_REPORT This option exports stats and events giving the user more insight into its memory working set.
+config WORKINGSET_REPORT_AGING + bool "Workingset report kernel aging thread" + depends on WORKINGSET_REPORT + help + Performs aging on memcgs with their configured refresh intervals. + source "mm/damon/Kconfig"
endmenu diff --git a/mm/Makefile b/mm/Makefile index 62cef72ce7f8..4a64b2454b6a 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -99,6 +99,7 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o +obj-$(CONFIG_WORKINGSET_REPORT_AGING) += workingset_report_aging.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 56848dc94d48..52123b6cc9ce 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4373,12 +4373,12 @@ static ssize_t memory_ws_refresh_interval_write(struct kernfs_open_file *of, { unsigned int nid, msecs; struct wsr_state *wsr; + unsigned long old_interval; struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs);
if (ret < 0) return ret; - wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr;
mutex_lock(&wsr->page_age_lock); @@ -4397,9 +4397,13 @@ static ssize_t memory_ws_refresh_interval_write(struct kernfs_open_file *of, wsr->page_age = NULL; }
+ old_interval = READ_ONCE(wsr->refresh_interval); WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(msecs)); unlock: mutex_unlock(&wsr->page_age_lock); + if (ret > 0 && msecs && + (!old_interval || jiffies_to_msecs(old_interval) > msecs)) + wsr_wakeup_aging_thread(); return ret; }
@@ -4450,7 +4454,7 @@ static int memory_ws_page_age_show(struct seq_file *m, void *v) if (!READ_ONCE(wsr->page_age)) continue;
- wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + wsr_refresh_report(wsr, memcg, NODE_DATA(nid), NULL); mutex_lock(&wsr->page_age_lock); if (!wsr->page_age) goto unlock; diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 333e51e3ee12..6f64d89fe70d 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -274,7 +274,7 @@ static void copy_node_bins(struct pglist_data *pgdat, }
bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat) + struct pglist_data *pgdat, unsigned long *refresh_time) { struct wsr_page_age_histo *page_age; unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval); @@ -291,10 +291,14 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, goto unlock; if (page_age->timestamp && time_is_after_jiffies(page_age->timestamp + refresh_interval)) - goto unlock; + goto time; refresh_scan(wsr, root, pgdat, refresh_interval); copy_node_bins(pgdat, page_age); refresh_aggregate(page_age, root, pgdat); + +time: + if (refresh_time) + *refresh_time = page_age->timestamp + refresh_interval; unlock: mutex_unlock(&wsr->page_age_lock); return !!page_age; @@ -357,6 +361,7 @@ static ssize_t refresh_interval_store(struct kobject *kobj, unsigned int interval; int err; struct wsr_state *wsr = kobj_to_wsr(kobj); + unsigned long old_interval = 0;
err = kstrtouint(buf, 0, &interval); if (err) @@ -378,9 +383,13 @@ static ssize_t refresh_interval_store(struct kobject *kobj, wsr->page_age = NULL; }
+ old_interval = READ_ONCE(wsr->refresh_interval); WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(interval)); unlock: mutex_unlock(&wsr->page_age_lock); + if (!err && interval && + (!old_interval || jiffies_to_msecs(old_interval) > interval)) + wsr_wakeup_aging_thread(); return err ?: len; }
@@ -470,7 +479,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, int ret = 0; struct wsr_state *wsr = kobj_to_wsr(kobj);
- wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj)); + wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj), NULL);
mutex_lock(&wsr->page_age_lock); if (!wsr->page_age) diff --git a/mm/workingset_report_aging.c b/mm/workingset_report_aging.c new file mode 100644 index 000000000000..91ad5020778a --- /dev/null +++ b/mm/workingset_report_aging.c @@ -0,0 +1,127 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Workingset report kernel aging thread + * + * Performs aging on behalf of memcgs with their configured refresh interval. + * While a userspace program can periodically read the page age breakdown + * per-memcg and trigger aging, the kernel performing aging is less overhead, + * more consistent, and more reliable for the use case where every memcg should + * be aged according to their refresh interval. + */ +#define pr_fmt(fmt) "workingset report aging: " fmt + +#include <linux/jiffies.h> +#include <linux/module.h> +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/kthread.h> +#include <linux/memcontrol.h> +#include <linux/swap.h> +#include <linux/wait.h> +#include <linux/mmzone.h> +#include <linux/workingset_report.h> + +static DECLARE_WAIT_QUEUE_HEAD(aging_wait); +static bool refresh_pending; + +static bool do_aging_node(int nid, unsigned long *next_wake_time) +{ + struct mem_cgroup *memcg; + bool should_wait = true; + struct pglist_data *pgdat = NODE_DATA(nid); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + struct wsr_state *wsr = &lruvec->wsr; + unsigned long refresh_time; + + /* use returned time to decide when to wake up next */ + if (wsr_refresh_report(wsr, memcg, pgdat, &refresh_time)) { + if (should_wait) { + should_wait = false; + *next_wake_time = refresh_time; + } else if (time_before(refresh_time, *next_wake_time)) { + *next_wake_time = refresh_time; + } + } + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + return should_wait; +} + +static int do_aging(void *unused) +{ + while (!kthread_should_stop()) { + int nid; + long timeout_ticks; + unsigned long next_wake_time; + bool should_wait = true; + + WRITE_ONCE(refresh_pending, false); + for_each_node_state(nid, N_MEMORY) { + unsigned long node_next_wake_time; + + if (do_aging_node(nid, &node_next_wake_time)) + continue; + if (should_wait) { + should_wait = false; + next_wake_time = node_next_wake_time; + } else if (time_before(node_next_wake_time, + next_wake_time)) { + next_wake_time = node_next_wake_time; + } + } + + if (should_wait) { + wait_event_interruptible(aging_wait, refresh_pending); + continue; + } + + /* sleep until next aging */ + timeout_ticks = next_wake_time - jiffies; + if (timeout_ticks > 0 && + timeout_ticks != MAX_SCHEDULE_TIMEOUT) { + schedule_timeout_idle(timeout_ticks); + continue; + } + } + return 0; +} + +/* Invoked when refresh_interval shortens or changes to a non-zero value. */ +void wsr_wakeup_aging_thread(void) +{ + WRITE_ONCE(refresh_pending, true); + wake_up_interruptible(&aging_wait); +} + +static struct task_struct *aging_thread; + +static int aging_init(void) +{ + struct task_struct *task; + + task = kthread_run(do_aging, NULL, "kagingd"); + + if (IS_ERR(task)) { + pr_err("Failed to create aging kthread\n"); + return PTR_ERR(task); + } + + aging_thread = task; + pr_info("module loaded\n"); + return 0; +} + +static void aging_exit(void) +{ + kthread_stop(aging_thread); + aging_thread = NULL; + pr_info("module unloaded\n"); +} + +module_init(aging_init); +module_exit(aging_exit);
A basic test that verifies the working set size of a simple memory accessor. It should work with or without the aging thread.
When running tests with run_vmtests.sh, file workingset report testing requires an environment variable WORKINGSET_REPORT_TEST_FILE_PATH to store a temporary file, which is passed into the test invocation as a parameter.
Signed-off-by: Yuanchu Xie yuanchu@google.com Change-Id: I1a55364164b9a67934f8500aaf77df4372edaa55 --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + tools/testing/selftests/mm/run_vmtests.sh | 5 + .../testing/selftests/mm/workingset_report.c | 306 ++++++++++++++++ .../testing/selftests/mm/workingset_report.h | 39 +++ .../selftests/mm/workingset_report_test.c | 330 ++++++++++++++++++ 6 files changed, 684 insertions(+) create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index da030b43e43b..e5cd0085ab74 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -51,3 +51,4 @@ hugetlb_madv_vs_map mseal_test seal_elf droppable +workingset_report_test diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 7b8a5def54a1..8d1a7d24ecd1 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -77,6 +77,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += workingset_report_test
ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty @@ -135,6 +136,8 @@ $(TEST_GEN_FILES): vm_util.c thp_settings.c $(OUTPUT)/uffd-stress: uffd-common.c $(OUTPUT)/uffd-unit-tests: uffd-common.c
+$(OUTPUT)/workingset_report_test: workingset_report.c + ifeq ($(ARCH),x86_64) BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32)) BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64)) diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 03ac4f2e1cce..c4a81f17a3a3 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -75,6 +75,8 @@ separated by spaces: read-only VMAs - mdwe test prctl(PR_SET_MDWE, ...) +- workingset_report + test workingset reporting
example: ./run_vmtests.sh -t "hmm mmap ksm" EOF @@ -453,6 +455,9 @@ CATEGORY="mkdirty" run_test ./mkdirty
CATEGORY="mdwe" run_test ./mdwe_test
+CATEGORY="workingset_report" run_test ./workingset_report_test \ + "${WORKINGSET_REPORT_TEST_FILE_PATH}" + echo "SUMMARY: PASS=${count_pass} SKIP=${count_skip} FAIL=${count_fail}" | tap_prefix echo "1..${count_total}" | tap_output
diff --git a/tools/testing/selftests/mm/workingset_report.c b/tools/testing/selftests/mm/workingset_report.c new file mode 100644 index 000000000000..ee4dda5c371d --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report.c @@ -0,0 +1,306 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "workingset_report.h" + +#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <stdbool.h> +#include <unistd.h> +#include <string.h> +#include <sys/mman.h> +#include <sys/wait.h> + +#include "../kselftest.h" + +#define SYSFS_NODE_ONLINE "/sys/devices/system/node/online" +#define PROC_DROP_CACHES "/proc/sys/vm/drop_caches" + +/* Returns read len on success, or -errno on failure. */ +static ssize_t read_text(const char *path, char *buf, size_t max_len) +{ + ssize_t len; + int fd, err; + size_t bytes_read = 0; + + if (!max_len) + return -EINVAL; + + fd = open(path, O_RDONLY); + if (fd < 0) + return -errno; + + while (bytes_read < max_len - 1) { + len = read(fd, buf + bytes_read, max_len - 1 - bytes_read); + + if (len <= 0) + break; + bytes_read += len; + } + + buf[bytes_read] = '\0'; + + err = -errno; + close(fd); + return len < 0 ? err : bytes_read; +} + +/* Returns written len on success, or -errno on failure. */ +static ssize_t write_text(const char *path, const char *buf, ssize_t max_len) +{ + int fd, len, err; + size_t bytes_written = 0; + + fd = open(path, O_WRONLY | O_APPEND); + if (fd < 0) + return -errno; + + while (bytes_written < max_len) { + len = write(fd, buf + bytes_written, max_len - bytes_written); + + if (len < 0) + break; + bytes_written += len; + } + + err = -errno; + close(fd); + return len < 0 ? err : bytes_written; +} + +static long read_num(const char *path) +{ + char buf[21]; + + if (read_text(path, buf, sizeof(buf)) <= 0) + return -1; + return (long)strtoul(buf, NULL, 10); +} + +static int write_num(const char *path, unsigned long n) +{ + char buf[21]; + + sprintf(buf, "%lu", n); + if (write_text(path, buf, strlen(buf)) < 0) + return -1; + return 0; +} + +long sysfs_get_refresh_interval(int nid) +{ + char file[128]; + + snprintf(file, sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/refresh_interval", + nid); + return read_num(file); +} + +int sysfs_set_refresh_interval(int nid, long interval) +{ + char file[128]; + + snprintf(file, sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/refresh_interval", + nid); + return write_num(file, interval); +} + +int sysfs_get_page_age_intervals_str(int nid, char *buf, int len) +{ + char path[128]; + + snprintf(path, sizeof(path), + "/sys/devices/system/node/node%d/workingset_report/page_age_intervals", + nid); + return read_text(path, buf, len); + +} + +int sysfs_set_page_age_intervals_str(int nid, const char *buf, int len) +{ + char path[128]; + + snprintf(path, sizeof(path), + "/sys/devices/system/node/node%d/workingset_report/page_age_intervals", + nid); + return write_text(path, buf, len); +} + +int sysfs_set_page_age_intervals(int nid, const char *const intervals[], + int nr_intervals) +{ + char file[128]; + char buf[1024]; + int i; + int err, len = 0; + + for (i = 0; i < nr_intervals; ++i) { + err = snprintf(buf + len, sizeof(buf) - len, "%s", intervals[i]); + + if (err < 0) + return err; + len += err; + + if (i < nr_intervals - 1) { + err = snprintf(buf + len, sizeof(buf) - len, ","); + if (err < 0) + return err; + len += err; + } + } + + snprintf(file, sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/page_age_intervals", + nid); + return write_text(file, buf, len); +} + +int get_nr_nodes(void) +{ + char buf[22]; + char *found; + + if (read_text(SYSFS_NODE_ONLINE, buf, sizeof(buf)) <= 0) + return -1; + found = strstr(buf, "-"); + if (found) + return (int)strtoul(found + 1, NULL, 10) + 1; + return (long)strtoul(buf, NULL, 10) + 1; +} + +int drop_pagecache(void) +{ + return write_num(PROC_DROP_CACHES, 1); +} + +ssize_t sysfs_page_age_read(int nid, char *buf, size_t len) + +{ + char file[128]; + + snprintf(file, sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/page_age", + nid); + return read_text(file, buf, len); +} + +/* + * Finds the first occurrence of "N<nid>\n" + * Modifies buf to terminate before the next occurrence of "N". + * Returns a substring of buf starting after "N<nid>\n" + */ +char *page_age_split_node(char *buf, int nid, char **next) +{ + char node_str[5]; + char *found; + int node_str_len; + + node_str_len = snprintf(node_str, sizeof(node_str), "N%u\n", nid); + + /* find the node prefix first */ + found = strstr(buf, node_str); + if (!found) { + ksft_print_msg("cannot find '%s' in page_idle_age", node_str); + return NULL; + } + found += node_str_len; + + *next = strchr(found, 'N'); + if (*next) + *(*next - 1) = '\0'; + + return found; +} + +ssize_t page_age_read(const char *buf, const char *interval, int pagetype) +{ + static const char * const type[ANON_AND_FILE] = { "anon=", "file=" }; + char *found; + + found = strstr(buf, interval); + if (!found) { + ksft_print_msg("cannot find %s in page_age", interval); + return -1; + } + found = strstr(found, type[pagetype]); + if (!found) { + ksft_print_msg("cannot find %s in page_age", type[pagetype]); + return -1; + } + found += strlen(type[pagetype]); + return (long)strtoul(found, NULL, 10); +} + +static const char *TEMP_FILE = "/tmp/workingset_selftest"; +void cleanup_file_workingset(void) +{ + remove(TEMP_FILE); +} + +int alloc_file_workingset(void *arg) +{ + int err = 0; + char *ptr; + int fd; + int ppid; + char *mapped; + size_t size = (size_t)arg; + size_t page_size = getpagesize(); + + ppid = getppid(); + + fd = open(TEMP_FILE, O_RDWR | O_CREAT); + if (fd < 0) { + err = -errno; + ksft_perror("failed to open temp file\n"); + goto cleanup; + } + + if (fallocate(fd, 0, 0, size) < 0) { + err = -errno; + ksft_perror("fallocate"); + goto cleanup; + } + + mapped = (char *)mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + fd, 0); + if (mapped == NULL) { + err = -errno; + ksft_perror("mmap"); + goto cleanup; + } + + while (getppid() == ppid) { + sync(); + for (ptr = mapped; ptr < mapped + size; ptr += page_size) + *ptr = *ptr ^ 0xFF; + } + +cleanup: + cleanup_file_workingset(); + return err; +} + +int alloc_anon_workingset(void *arg) +{ + char *buf, *ptr; + int ppid = getppid(); + size_t size = (size_t)arg; + size_t page_size = getpagesize(); + + buf = malloc(size); + + if (!buf) { + ksft_print_msg("cannot allocate anon workingset"); + exit(1); + } + + while (getppid() == ppid) { + for (ptr = buf; ptr < buf + size; ptr += page_size) + *ptr = *ptr ^ 0xFF; + } + + free(buf); + return 0; +} diff --git a/tools/testing/selftests/mm/workingset_report.h b/tools/testing/selftests/mm/workingset_report.h new file mode 100644 index 000000000000..c5c281e4069b --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report.h @@ -0,0 +1,39 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef WORKINGSET_REPORT_H_ +#define WORKINGSET_REPORT_H_ + +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif + +#include <fcntl.h> +#include <sys/stat.h> +#include <errno.h> +#include <stdint.h> +#include <sys/types.h> + +#define PAGETYPE_ANON 0 +#define PAGETYPE_FILE 1 +#define ANON_AND_FILE 2 + +int get_nr_nodes(void); +int drop_pagecache(void); + +long sysfs_get_refresh_interval(int nid); +int sysfs_set_refresh_interval(int nid, long interval); + +int sysfs_get_page_age_intervals_str(int nid, char *buf, int len); +int sysfs_set_page_age_intervals_str(int nid, const char *buf, int len); + +int sysfs_set_page_age_intervals(int nid, const char *const intervals[], + int nr_intervals); + +char *page_age_split_node(char *buf, int nid, char **next); +ssize_t sysfs_page_age_read(int nid, char *buf, size_t len); +ssize_t page_age_read(const char *buf, const char *interval, int pagetype); + +int alloc_file_workingset(void *arg); +void cleanup_file_workingset(void); +int alloc_anon_workingset(void *arg); + +#endif /* WORKINGSET_REPORT_H_ */ diff --git a/tools/testing/selftests/mm/workingset_report_test.c b/tools/testing/selftests/mm/workingset_report_test.c new file mode 100644 index 000000000000..89ff4e9d746e --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report_test.c @@ -0,0 +1,330 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "workingset_report.h" + +#include <stdlib.h> +#include <stdio.h> +#include <signal.h> +#include <time.h> + +#include "../clone3/clone3_selftests.h" + +#define REFRESH_INTERVAL 5000 +#define MB(x) (x << 20) + +static void sleep_ms(int milliseconds) +{ + struct timespec ts; + + ts.tv_sec = milliseconds / 1000; + ts.tv_nsec = (milliseconds % 1000) * 1000000; + nanosleep(&ts, NULL); +} + +/* + * Checks if two given values differ by less than err% of their sum. + */ +static inline int values_close(long a, long b, int err) +{ + return labs(a - b) <= (a + b) / 100 * err; +} + +static const char * const PAGE_AGE_INTERVALS[] = { + "6000", "10000", "15000", "18446744073709551615", +}; +#define NR_PAGE_AGE_INTERVALS (ARRAY_SIZE(PAGE_AGE_INTERVALS)) + +static int set_page_age_intervals_all_nodes(const char *intervals, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + int err = sysfs_set_page_age_intervals_str( + i, &intervals[i * 1024], strlen(&intervals[i * 1024])); + + if (err < 0) + return err; + } + return 0; +} + +static int get_page_age_intervals_all_nodes(char *intervals, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + int err = sysfs_get_page_age_intervals_str( + i, &intervals[i * 1024], 1024); + + if (err < 0) + return err; + } + return 0; +} + +static int set_refresh_interval_all_nodes(const long *interval, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + int err = sysfs_set_refresh_interval(i, interval[i]); + + if (err < 0) + return err; + } + return 0; +} + +static int get_refresh_interval_all_nodes(long *interval, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + long val = sysfs_get_refresh_interval(i); + + if (val < 0) + return val; + interval[i] = val; + } + return 0; +} + +static pid_t clone_and_run(int fn(void *arg), void *arg) +{ + pid_t pid; + + struct __clone_args args = { + .exit_signal = SIGCHLD, + }; + + pid = sys_clone3(&args, sizeof(struct __clone_args)); + + if (pid == 0) + exit(fn(arg)); + + return pid; +} + +static int read_workingset(int pagetype, int nid, + unsigned long page_age[NR_PAGE_AGE_INTERVALS]) +{ + int i, err; + char buf[4096]; + + err = sysfs_page_age_read(nid, buf, sizeof(buf)); + if (err < 0) + return err; + + for (i = 0; i < NR_PAGE_AGE_INTERVALS; ++i) { + err = page_age_read(buf, PAGE_AGE_INTERVALS[i], pagetype); + if (err < 0) + return err; + page_age[i] = err; + } + + return 0; +} + +static ssize_t read_interval_all_nodes(int pagetype, int interval) +{ + int i, err; + unsigned long page_age[NR_PAGE_AGE_INTERVALS]; + ssize_t ret = 0; + int nr_nodes = get_nr_nodes(); + + for (i = 0; i < nr_nodes; ++i) { + err = read_workingset(pagetype, i, page_age); + if (err < 0) + return err; + + ret += page_age[interval]; + } + + return ret; +} + +#define TEST_SIZE MB(500l) + +static int run_test(int f(void)) +{ + int i, err, test_result; + long *old_refresh_intervals; + long *new_refresh_intervals; + char *old_page_age_intervals; + int nr_nodes = get_nr_nodes(); + + if (nr_nodes <= 0) { + ksft_print_msg("failed to get nr_nodes\n"); + return KSFT_FAIL; + } + + old_refresh_intervals = calloc(nr_nodes, sizeof(long)); + new_refresh_intervals = calloc(nr_nodes, sizeof(long)); + old_page_age_intervals = calloc(nr_nodes, 1024); + + if (!(old_refresh_intervals && new_refresh_intervals && + old_page_age_intervals)) { + ksft_print_msg("failed to allocate memory for intervals\n"); + return KSFT_FAIL; + } + + err = get_refresh_interval_all_nodes(old_refresh_intervals, nr_nodes); + if (err < 0) { + ksft_print_msg("failed to read refresh interval\n"); + return KSFT_FAIL; + } + + err = get_page_age_intervals_all_nodes(old_page_age_intervals, nr_nodes); + if (err < 0) { + ksft_print_msg("failed to read page age interval\n"); + return KSFT_FAIL; + } + + for (i = 0; i < nr_nodes; ++i) + new_refresh_intervals[i] = REFRESH_INTERVAL; + + for (i = 0; i < nr_nodes; ++i) { + err = sysfs_set_page_age_intervals(i, PAGE_AGE_INTERVALS, + NR_PAGE_AGE_INTERVALS - 1); + if (err < 0) { + ksft_print_msg("failed to set page age interval\n"); + test_result = KSFT_FAIL; + goto fail; + } + } + + err = set_refresh_interval_all_nodes(new_refresh_intervals, nr_nodes); + if (err < 0) { + ksft_print_msg("failed to set refresh interval\n"); + test_result = KSFT_FAIL; + goto fail; + } + + sync(); + drop_pagecache(); + + test_result = f(); + +fail: + err = set_refresh_interval_all_nodes(old_refresh_intervals, nr_nodes); + if (err < 0) { + ksft_print_msg("failed to restore refresh interval\n"); + test_result = KSFT_FAIL; + } + err = set_page_age_intervals_all_nodes(old_page_age_intervals, nr_nodes); + if (err < 0) { + ksft_print_msg("failed to restore page age interval\n"); + test_result = KSFT_FAIL; + } + return test_result; +} + +static char *file_test_path; +static int test_file(void) +{ + ssize_t ws_size_ref, ws_size_test; + int ret = KSFT_FAIL, i; + pid_t pid = 0; + + if (!file_test_path) { + ksft_print_msg("Set a path to test file workingset\n"); + return KSFT_SKIP; + } + + ws_size_ref = read_interval_all_nodes(PAGETYPE_FILE, 0); + if (ws_size_ref < 0) + goto cleanup; + + pid = clone_and_run(alloc_file_workingset, (void *)TEST_SIZE); + if (pid < 0) + goto cleanup; + + read_interval_all_nodes(PAGETYPE_FILE, 0); + sleep_ms(REFRESH_INTERVAL); + + for (i = 0; i < 3; ++i) { + sleep_ms(REFRESH_INTERVAL); + ws_size_test = read_interval_all_nodes(PAGETYPE_FILE, 0); + ws_size_test += read_interval_all_nodes(PAGETYPE_FILE, 1); + if (ws_size_test < 0) + goto cleanup; + + if (!values_close(ws_size_test - ws_size_ref, TEST_SIZE, 10)) { + ksft_print_msg( + "file working set size difference too large: actual=%ld, expected=%ld\n", + ws_size_test - ws_size_ref, TEST_SIZE); + goto cleanup; + } + } + ret = KSFT_PASS; + +cleanup: + if (pid > 0) + kill(pid, SIGKILL); + cleanup_file_workingset(); + return ret; +} + +static int test_anon(void) +{ + ssize_t ws_size_ref, ws_size_test; + pid_t pid = 0; + int ret = KSFT_FAIL, i; + + ws_size_ref = read_interval_all_nodes(PAGETYPE_ANON, 0); + ws_size_ref += read_interval_all_nodes(PAGETYPE_ANON, 1); + if (ws_size_ref < 0) + goto cleanup; + + pid = clone_and_run(alloc_anon_workingset, (void *)TEST_SIZE); + if (pid < 0) + goto cleanup; + + sleep_ms(REFRESH_INTERVAL); + read_interval_all_nodes(PAGETYPE_ANON, 0); + + for (i = 0; i < 5; ++i) { + sleep_ms(REFRESH_INTERVAL); + ws_size_test = read_interval_all_nodes(PAGETYPE_ANON, 0); + ws_size_test += read_interval_all_nodes(PAGETYPE_ANON, 1); + if (ws_size_test < 0) + goto cleanup; + + if (!values_close(ws_size_test - ws_size_ref, TEST_SIZE, 10)) { + ksft_print_msg( + "anon working set size difference too large: actual=%ld, expected=%ld\n", + ws_size_test - ws_size_ref, TEST_SIZE); + goto cleanup; + } + } + ret = KSFT_PASS; + +cleanup: + if (pid > 0) + kill(pid, SIGKILL); + return ret; +} + + +#define T(x) { x, #x } +struct workingset_test { + int (*fn)(void); + const char *name; +} tests[] = { + T(test_anon), + T(test_file), +}; +#undef T + +int main(int argc, char **argv) +{ + int i, err; + + if (argc > 1) + file_test_path = argv[1]; + + for (i = 0; i < ARRAY_SIZE(tests); i++) { + err = run_test(tests[i].fn); + ksft_test_result_code(err, tests[i].name, NULL); + } + return 0; +}
Add workingset reporting documentation for better discoverability of its sysfs and memcg interfaces. Also document the required kernel config to enable workingset reporting.
Change-Id: Ib9dfc9004473baa6ef26ca7277d220b6199517de Signed-off-by: Yuanchu Xie yuanchu@google.com --- Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/workingset_report.rst | 105 ++++++++++++++++++ 2 files changed, 106 insertions(+) create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8b35795b664b..61a2a347fc91 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -41,4 +41,5 @@ the Linux memory management. swap_numa transhuge userfaultfd + workingset_report zswap diff --git a/Documentation/admin-guide/mm/workingset_report.rst b/Documentation/admin-guide/mm/workingset_report.rst new file mode 100644 index 000000000000..ddcc0c33a8df --- /dev/null +++ b/Documentation/admin-guide/mm/workingset_report.rst @@ -0,0 +1,105 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Workingset Report +================= +Workingset report provides a view of memory coldness in user-defined +time intervals, i.e. X bytes are Y milliseconds cold. It breaks down +the user pages in the system per-NUMA node, per-memcg, for both +anonymous and file pages into histograms that look like: +:: + + 1000 anon=137368 file=24530 + 20000 anon=34342 file=0 + 30000 anon=353232 file=333608 + 40000 anon=407198 file=206052 + 9223372036854775807 anon=4925624 file=892892 + +The workingset reports can be used to drive proactive reclaim, by +identifying the number of cold bytes in a memcg, then writing to +``memory.reclaim``. + +Quick start +=========== +Build the kernel with the following configurations. The report relies +on Multi-gen LRU for page coldness. + +* ``CONFIG_LRU_GEN=y`` +* ``CONFIG_LRU_GEN_ENABLED=y`` +* ``CONFIG_WORKINGSET_REPORT=y`` + +Optionally, the aging kernel daemon can be enabled with the following +configuration. +* ``CONFIG_WORKINGSET_REPORT_AGING=y`` + +Sysfs interfaces +================ +``/sys/devices/system/node/nodeX/workingset_report/page_age`` provides +a per-node page age histogram, showing an aggregate of the node's lruvecs. +Reading this file causes a hierarchical aging of all lruvecs, scanning +pages and creates a new Multi-gen LRU generation in each lruvec. +For example: +:: + + 1000 anon=0 file=0 + 2000 anon=0 file=0 + 100000 anon=5533696 file=5566464 + 18446744073709551615 anon=0 file=0 + +``/sys/devices/system/node/nodeX/workingset_report/page_age_intervals`` +is a comma separated list of time in milliseconds that configures what +the page age histogram uses for aggregation. For the above histogram, +the intervals are: +:: + 1000,2000,100000 + +``/sys/devices/system/node/nodeX/workingset_report/refresh_interval`` +defines the amount of time the report is valid for in milliseconds. +When a report is still valid, reading the ``page_age`` file shows +the existing valid report, instead of generating a new one. + +``/sys/devices/system/node/nodeX/workingset_report/report_threshold`` +specifies how often the userspace agent can be notified for node +memory pressure, in milliseconds. When a node reaches its low +watermarks and wakes up kswapd, programs waiting on ``page_age`` are +woken up so they can read the histogram and make policy decisions. + +Memcg interface +=============== +While ``page_age_interval`` is defined per-node in sysfs, ``page_age``, +``refresh_interval`` and ``report_threshold`` are available per-memcg. + +``/sys/fs/cgroup/.../memory.workingset.page_age`` +The memcg equivalent of the sysfs workingset page age histogram +breaks down the workingset of this memcg and its children into +page age intervals. Each node is prefixed with a node header and +a newline. Non-proactive direct reclaim on this memcg can also +wake up userspace agents that are waiting on this file. +e.g. +:: + + N0 + 1000 anon=0 file=0 + 2000 anon=0 file=0 + 3000 anon=0 file=0 + 4000 anon=0 file=0 + 5000 anon=0 file=0 + 18446744073709551615 anon=0 file=0 + +``/sys/fs/cgroup/.../memory.workingset.refresh_interval`` +The memcg equivalent of the sysfs refresh interval. A per-node +number of how much time a page age histogram is valid for, in +milliseconds. +e.g. +:: + + echo N0=2000 > memory.workingset.refresh_interval + +``/sys/fs/cgroup/.../memory.workingset.report_threshold`` +The memcg equivalent of the sysfs report threshold. A per-node +number of how often userspace agent waiting on the page age +histogram can be woken up, in milliseconds. +e.g. +:: + + echo N0=1000 > memory.workingset.report_threshold
On 8/13/24 12:56, Yuanchu Xie wrote:
Add workingset reporting documentation for better discoverability of its sysfs and memcg interfaces. Also document the required kernel config to enable workingset reporting.
Change-Id: Ib9dfc9004473baa6ef26ca7277d220b6199517de Signed-off-by: Yuanchu Xie yuanchu@google.com
Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/workingset_report.rst | 105 ++++++++++++++++++
The new memory cgroup control files need to be documented in Documentation/admin-guide/cgroup-v2.rst as well.
Cheers, Longman
2 files changed, 106 insertions(+) create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8b35795b664b..61a2a347fc91 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -41,4 +41,5 @@ the Linux memory management. swap_numa transhuge userfaultfd
- workingset_report zswap
diff --git a/Documentation/admin-guide/mm/workingset_report.rst b/Documentation/admin-guide/mm/workingset_report.rst new file mode 100644 index 000000000000..ddcc0c33a8df --- /dev/null +++ b/Documentation/admin-guide/mm/workingset_report.rst @@ -0,0 +1,105 @@ +.. SPDX-License-Identifier: GPL-2.0
+================= +Workingset Report +================= +Workingset report provides a view of memory coldness in user-defined +time intervals, i.e. X bytes are Y milliseconds cold. It breaks down +the user pages in the system per-NUMA node, per-memcg, for both +anonymous and file pages into histograms that look like: +::
- 1000 anon=137368 file=24530
- 20000 anon=34342 file=0
- 30000 anon=353232 file=333608
- 40000 anon=407198 file=206052
- 9223372036854775807 anon=4925624 file=892892
+The workingset reports can be used to drive proactive reclaim, by +identifying the number of cold bytes in a memcg, then writing to +``memory.reclaim``.
+Quick start +=========== +Build the kernel with the following configurations. The report relies +on Multi-gen LRU for page coldness.
+* ``CONFIG_LRU_GEN=y`` +* ``CONFIG_LRU_GEN_ENABLED=y`` +* ``CONFIG_WORKINGSET_REPORT=y``
+Optionally, the aging kernel daemon can be enabled with the following +configuration. +* ``CONFIG_WORKINGSET_REPORT_AGING=y``
+Sysfs interfaces +================ +``/sys/devices/system/node/nodeX/workingset_report/page_age`` provides +a per-node page age histogram, showing an aggregate of the node's lruvecs. +Reading this file causes a hierarchical aging of all lruvecs, scanning +pages and creates a new Multi-gen LRU generation in each lruvec. +For example: +::
- 1000 anon=0 file=0
- 2000 anon=0 file=0
- 100000 anon=5533696 file=5566464
- 18446744073709551615 anon=0 file=0
+``/sys/devices/system/node/nodeX/workingset_report/page_age_intervals`` +is a comma separated list of time in milliseconds that configures what +the page age histogram uses for aggregation. For the above histogram, +the intervals are: +::
- 1000,2000,100000
+``/sys/devices/system/node/nodeX/workingset_report/refresh_interval`` +defines the amount of time the report is valid for in milliseconds. +When a report is still valid, reading the ``page_age`` file shows +the existing valid report, instead of generating a new one.
+``/sys/devices/system/node/nodeX/workingset_report/report_threshold`` +specifies how often the userspace agent can be notified for node +memory pressure, in milliseconds. When a node reaches its low +watermarks and wakes up kswapd, programs waiting on ``page_age`` are +woken up so they can read the histogram and make policy decisions.
+Memcg interface +=============== +While ``page_age_interval`` is defined per-node in sysfs, ``page_age``, +``refresh_interval`` and ``report_threshold`` are available per-memcg.
+``/sys/fs/cgroup/.../memory.workingset.page_age`` +The memcg equivalent of the sysfs workingset page age histogram +breaks down the workingset of this memcg and its children into +page age intervals. Each node is prefixed with a node header and +a newline. Non-proactive direct reclaim on this memcg can also +wake up userspace agents that are waiting on this file. +e.g. +::
- N0
- 1000 anon=0 file=0
- 2000 anon=0 file=0
- 3000 anon=0 file=0
- 4000 anon=0 file=0
- 5000 anon=0 file=0
- 18446744073709551615 anon=0 file=0
+``/sys/fs/cgroup/.../memory.workingset.refresh_interval`` +The memcg equivalent of the sysfs refresh interval. A per-node +number of how much time a page age histogram is valid for, in +milliseconds. +e.g. +::
- echo N0=2000 > memory.workingset.refresh_interval
+``/sys/fs/cgroup/.../memory.workingset.report_threshold`` +The memcg equivalent of the sysfs report threshold. A per-node +number of how often userspace agent waiting on the page age +histogram can be woken up, in milliseconds. +e.g. +::
- echo N0=1000 > memory.workingset.report_threshold
Hi,
On 8/13/24 9:56 AM, Yuanchu Xie wrote:
Add workingset reporting documentation for better discoverability of its sysfs and memcg interfaces. Also document the required kernel config to enable workingset reporting.
Change-Id: Ib9dfc9004473baa6ef26ca7277d220b6199517de Signed-off-by: Yuanchu Xie yuanchu@google.com
Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/workingset_report.rst | 105 ++++++++++++++++++ 2 files changed, 106 insertions(+) create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
diff --git a/Documentation/admin-guide/mm/workingset_report.rst b/Documentation/admin-guide/mm/workingset_report.rst new file mode 100644 index 000000000000..ddcc0c33a8df --- /dev/null +++ b/Documentation/admin-guide/mm/workingset_report.rst @@ -0,0 +1,105 @@ +.. SPDX-License-Identifier: GPL-2.0
+================= +Workingset Report +================= +Workingset report provides a view of memory coldness in user-defined +time intervals, i.e. X bytes are Y milliseconds cold. It breaks down
e.g., X bytes are Y milliseconds cold.
+the user pages in the system per-NUMA node, per-memcg, for both +anonymous and file pages into histograms that look like: +::
- 1000 anon=137368 file=24530
- 20000 anon=34342 file=0
- 30000 anon=353232 file=333608
- 40000 anon=407198 file=206052
- 9223372036854775807 anon=4925624 file=892892
+The workingset reports can be used to drive proactive reclaim, by +identifying the number of cold bytes in a memcg, then writing to +``memory.reclaim``.
+Quick start +=========== +Build the kernel with the following configurations. The report relies +on Multi-gen LRU for page coldness.
+* ``CONFIG_LRU_GEN=y`` +* ``CONFIG_LRU_GEN_ENABLED=y`` +* ``CONFIG_WORKINGSET_REPORT=y``
+Optionally, the aging kernel daemon can be enabled with the following +configuration. +* ``CONFIG_WORKINGSET_REPORT_AGING=y``
+Sysfs interfaces +================ +``/sys/devices/system/node/nodeX/workingset_report/page_age`` provides +a per-node page age histogram, showing an aggregate of the node's lruvecs. +Reading this file causes a hierarchical aging of all lruvecs, scanning +pages and creates a new Multi-gen LRU generation in each lruvec. +For example: +::
- 1000 anon=0 file=0
- 2000 anon=0 file=0
- 100000 anon=5533696 file=5566464
- 18446744073709551615 anon=0 file=0
+``/sys/devices/system/node/nodeX/workingset_report/page_age_intervals`` +is a comma separated list of time in milliseconds that configures what
comma-separated
+the page age histogram uses for aggregation. For the above histogram, +the intervals are: +::
I guess just change the "are:" to "are::" and change the line that only contains "::" to a blank line. Otherwise there is a warning:
Documentation/admin-guide/mm/workingset_report.rst:54: ERROR: Unexpected indentation.
- 1000,2000,100000
+``/sys/devices/system/node/nodeX/workingset_report/refresh_interval`` +defines the amount of time the report is valid for in milliseconds. +When a report is still valid, reading the ``page_age`` file shows +the existing valid report, instead of generating a new one.
+``/sys/devices/system/node/nodeX/workingset_report/report_threshold`` +specifies how often the userspace agent can be notified for node +memory pressure, in milliseconds. When a node reaches its low +watermarks and wakes up kswapd, programs waiting on ``page_age`` are +woken up so they can read the histogram and make policy decisions.
+Memcg interface +=============== +While ``page_age_interval`` is defined per-node in sysfs, ``page_age``, +``refresh_interval`` and ``report_threshold`` are available per-memcg.
+``/sys/fs/cgroup/.../memory.workingset.page_age`` +The memcg equivalent of the sysfs workingset page age histogram +breaks down the workingset of this memcg and its children into +page age intervals. Each node is prefixed with a node header and +a newline. Non-proactive direct reclaim on this memcg can also +wake up userspace agents that are waiting on this file. +e.g.
E.g.
+::
- N0
- 1000 anon=0 file=0
- 2000 anon=0 file=0
- 3000 anon=0 file=0
- 4000 anon=0 file=0
- 5000 anon=0 file=0
- 18446744073709551615 anon=0 file=0
+``/sys/fs/cgroup/.../memory.workingset.refresh_interval`` +The memcg equivalent of the sysfs refresh interval. A per-node +number of how much time a page age histogram is valid for, in +milliseconds. +e.g.
E.g.
+::
- echo N0=2000 > memory.workingset.refresh_interval
+``/sys/fs/cgroup/.../memory.workingset.report_threshold`` +The memcg equivalent of the sysfs report threshold. A per-node +number of how often userspace agent waiting on the page age +histogram can be woken up, in milliseconds. +e.g.
E.g.
+::
- echo N0=1000 > memory.workingset.report_threshold
On Tue, 13 Aug 2024 09:56:11 -0700 Yuanchu Xie yuanchu@google.com wrote:
This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references.
Very little reviewer interest. I wonder why. Will Google be the only organization which finds this useful?
Benchmarks
Ghait Ouled Amar Ben Cheikh has implemented a simple "reclaim everything colder than 10 seconds every 40 seconds" policy and ran Linux compile and redis from the phoronix test suite. The results are in his repo: https://github.com/miloudi98/WMO
I'd suggest at least summarizing these results here in the [0/N]. The Linux kernel will probably outlive that URL!
On Tue, 13 Aug 2024, Andrew Morton wrote:
On Tue, 13 Aug 2024 09:56:11 -0700 Yuanchu Xie yuanchu@google.com wrote:
This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references.
Very little reviewer interest. I wonder why. Will Google be the only organization which finds this useful?
Although also from Google, I'm optimistic that others will find this very useful. It's implemented in a way that is intended to be generally useful for multiple use cases, including user defined policy for proactive reclaim. The cited sample userspace implementation is intended to demonstrate how this insight can be put into practice.
Insight into the working set of applications, particularly on multi-tenant systems, has derived significant memory savings for Google over the past decade. The introduction of MGLRU into the upstream kernel has allowed this information to be derived in a much more efficient manner, presented here, that should make upstreaming of this insight much more palatable.
This insight into working set will only become more critical going forward with memory tiered systems.
Nothing here is specific to Google; in fact, we apply the insight into working set in very different ways across our fleets.
Benchmarks
Ghait Ouled Amar Ben Cheikh has implemented a simple "reclaim everything colder than 10 seconds every 40 seconds" policy and ran Linux compile and redis from the phoronix test suite. The results are in his repo: https://github.com/miloudi98/WMO
I'd suggest at least summarizing these results here in the [0/N]. The Linux kernel will probably outlive that URL!
Fully agreed that this would be useful for including in the cover letter.
The results showing the impact of proactive reclaim using insight into working set is impressive for multi-tenant systems. Having very comparable performance for kernbench with a fraction of the memory usage shows the potential for proactive reclaim and without the dependency on direct reclaim or throttling of the application itself.
This is one of several benchmarks that we are running and we'll be expanding upon this with cotenancy, user defined latency senstivity per job, extensions for insight into memory re-access, and in-guest use cases.
On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote:
This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. IMO, the kernel should provide a set of workingset interfaces that should be generic enough to accommodate the various use cases, and be extensible to potential future use cases. The current proposed interfaces are not sufficient in that regard, but I would like to start somewhere, solicit feedback, and iterate.
... snip ...
Use cases
Promotion/Demotion If different mechanisms are used for promition and demotion, workingset information can help connect the two and avoid pages being migrated back and forth. For example, given a promotion hot page threshold defined in reaccess distance of N seconds (promote pages accessed more often than every N seconds). The threshold N should be set so that ~80% (e.g.) of pages on the fast memory node passes the threshold. This calculation can be done with workingset reports. To be directly useful for promotion policies, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1].
[1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-...
Sysfs and Cgroup Interfaces
The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g.
1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892
I realize this does not generalize well to hotness information, but I lack the intuition for an abstraction that presents hotness in a useful way. Based on a recent proposal for move_phys_pages[2], it seems like userspace tiering software would like to move specific physical pages, instead of informing the kernel "move x number of hot pages to y device". Please advise.
[2] https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge....
Just as a note on this work, this is really a testing interface. The end-goal is not to merge such an interface that is user-facing like move_phys_pages, but instead to have something like a triggered kernel task that has a directive of "Promote X pages from Device A".
This work is more of an open collaboration for prototyping such that we don't have to plumb it through the kernel from the start and assess the usefulness of the hardware hotness collection mechanism.
---
More generally on promotion, I have been considering recently a problem with promoting unmapped pagecache pages - since they are not subject to NUMA hint faults. I started looking at PG_accessed and PG_workingset as a potential mechanism to trigger promotion - but i'm starting to see a pattern of competing priorities between reclaim (LRU/MGLRU) logic and promotion logic.
Reclaim is triggered largely under memory pressure - which means co-opting reclaim logic for promotion is at best logically confusing, and at worst likely to introduce regressions. The LRU/MGLRU logic is written largely for reclaim, not promotion. This makes hacking promotion in after the fact rather dubious - the design choices don't match.
One example: if a page moves from inactive->active (or old->young), we could treat this as a page "becoming hot" and mark it for promotion, but this potentially punishes pages on the "active/younger" lists which are themselves hotter.
I'm starting to think separate demotion/reclaim and promotion components are warranted. This could take the form of a separate kernel worker that occasionally gets scheduled to manage a promotion list, or even the addition of a PG_promote flag to decouple reclaim and promotion logic completely. Separating the structures entirely would be good to allow both demotion/reclaim and promotion to occur concurrently (although this seems problematic under memory pressure).
Would like to know your thoughts here. If we can decide to segregate promotion and demotion logic, it might go a long way to simplify the existing interfaces and formalize transactions between the two.
(also if you're going to LPC, might be worth a chat in person)
~Gregory
On Tue, Aug 20, 2024 at 6:00 AM Gregory Price gourry@gourry.net wrote:
On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote:
This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. IMO, the kernel should provide a set of workingset interfaces that should be generic enough to accommodate the various use cases, and be extensible to potential future use cases. The current proposed interfaces are not sufficient in that regard, but I would like to start somewhere, solicit feedback, and iterate.
... snip ...
Use cases
Promotion/Demotion If different mechanisms are used for promition and demotion, workingset information can help connect the two and avoid pages being migrated back and forth. For example, given a promotion hot page threshold defined in reaccess distance of N seconds (promote pages accessed more often than every N seconds). The threshold N should be set so that ~80% (e.g.) of pages on the fast memory node passes the threshold. This calculation can be done with workingset reports. To be directly useful for promotion policies, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1].
[1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-...
Sysfs and Cgroup Interfaces
The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g.
1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892
I realize this does not generalize well to hotness information, but I lack the intuition for an abstraction that presents hotness in a useful way. Based on a recent proposal for move_phys_pages[2], it seems like userspace tiering software would like to move specific physical pages, instead of informing the kernel "move x number of hot pages to y device". Please advise.
[2] https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge....
Just as a note on this work, this is really a testing interface. The end-goal is not to merge such an interface that is user-facing like move_phys_pages, but instead to have something like a triggered kernel task that has a directive of "Promote X pages from Device A".
This work is more of an open collaboration for prototyping such that we don't have to plumb it through the kernel from the start and assess the usefulness of the hardware hotness collection mechanism.
Understood. I think we previously had this exchange and I forgot to remove the mentions from the cover letter.
More generally on promotion, I have been considering recently a problem with promoting unmapped pagecache pages - since they are not subject to NUMA hint faults. I started looking at PG_accessed and PG_workingset as a potential mechanism to trigger promotion - but i'm starting to see a pattern of competing priorities between reclaim (LRU/MGLRU) logic and promotion logic.
In this case, IMO hardware support would be good as it could provide the kernel with exactly what pages are hot, and it would not care whether a page is mapped or not. I recall there being some CXL proposal on this, but I'm not sure whether it has settled into a standard yet.
Reclaim is triggered largely under memory pressure - which means co-opting reclaim logic for promotion is at best logically confusing, and at worst likely to introduce regressions. The LRU/MGLRU logic is written largely for reclaim, not promotion. This makes hacking promotion in after the fact rather dubious - the design choices don't match.
One example: if a page moves from inactive->active (or old->young), we could treat this as a page "becoming hot" and mark it for promotion, but this potentially punishes pages on the "active/younger" lists which are themselves hotter.
To avoid punishing pages on the "young" list, one could insert the page into a "less young" generation, but it would be difficult to have a fixed policy for this in the kernel, so it may be best for this to be configurable via BPF. One could insert the page in the middle of the active/inactive list, but that would in effect create multiple generations.
I'm starting to think separate demotion/reclaim and promotion components are warranted. This could take the form of a separate kernel worker that occasionally gets scheduled to manage a promotion list, or even the addition of a PG_promote flag to decouple reclaim and promotion logic completely. Separating the structures entirely would be good to allow both demotion/reclaim and promotion to occur concurrently (although this seems problematic under memory pressure).
Would like to know your thoughts here. If we can decide to segregate promotion and demotion logic, it might go a long way to simplify the existing interfaces and formalize transactions between the two.
The two systems still have to interact, so separating the two would essentially create a new policy that decides whether the demotion/reclaim or the promotion policy is in effect. If promotion could figure out where to insert the page in terms of generations, wouldn't that be simpler?
(also if you're going to LPC, might be worth a chat in person)
I cannot make it to LPC. :( Sadness
Yuanchu
linux-kselftest-mirror@lists.linaro.org