This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. IMO, the kernel should provide a set of workingset interfaces that should be generic enough to accommodate the various use cases, and be extensible to potential future use cases. The current proposed interfaces are not sufficient in that regard, but I would like to start somewhere, solicit feedback, and iterate.
Use cases ========== Job scheduling For data center machines, workingset information allows the job scheduler to right-size each job and land more jobs on the same host or NUMA node, and in the case of a job with increasing workingset, policy decisions can be made to migrate other jobs off the host/NUMA node, or oom-kill the misbehaving job. If the job shape is very different from the machine shape, knowing the workingset per-node can also help inform page allocation policies.
Proactive reclaim Workingset information allows the a container manager to proactively reclaim memory while not impacting a job's performance. While PSI may provide a reactive measure of when a proactive reclaim has reclaimed too much, workingset reporting enables the policy to be more accurate and flexible.
Ballooning (similar to proactive reclaim) While this patch series does not extend the virtio-balloon device, balloon policies benefit from workingset to more precisely determine the size of the memory balloon. On desktops/laptops/mobile devices where memory is scarce and overcommitted, the balloon sizing in multiple VMs running on the same device can be orchestrated with workingset reports from each one.
Promotion/Demotion Similar to proactive reclaim, a workingset report enables demotion to a slower tier of memory. For promotion, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1].
[1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-...
Sysfs and Cgroup Interfaces ========== The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g.
1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892
I realize this does not generalize well to hotness information, but I lack the intuition for an abstraction that presents hotness in a useful way. Based on a recent proposal for move_phys_pages[2], it seems like userspace tiering software would like to move specific physical pages, instead of informing the kernel "move x number of hot pages to y device". Please advise.
[2] https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge....
Implementation ========== Currently, the reporting of user pages is based off of MGLRU, and therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more fine-grained workingset report. I will make the generation count configurable in the next version. The workingset reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind CONFIG_WORKINGSET_REPORT_AGING.
-- Changes from RFC v2 -> RFC v3: - Update to v6.8 - Added an aging kernel thread (gated behind config) - Added basic selftests for sysfs interface files - Track swapped out pages for reaccesses - Refactoring and cleanup - Dropped the virtio-balloon extension to make things manageable
Changes from RFC v1 -> RFC v2: - Refactored the patchs into smaller pieces - Renamed interfaces and functions from wss to wsr (Working Set Reporting) - Fixed build errors when CONFIG_WSR is not set - Changed working_set_num_bins to u8 for virtio-balloon - Added support for per-NUMA node reporting for virtio-balloon
[rfc v1] https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.com... [rfc v2] https://lore.kernel.org/linux-mm/20230621180454.973862-1-yuanchu@google.com/
Yuanchu Xie (8): mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=true mm: aggregate working set information into histograms mm: use refresh interval to rate-limit workingset report aggregation mm: report workingset during memory pressure driven scanning mm: extend working set reporting to memcgs mm: add per-memcg reaccess histogram mm: add kernel aging thread for workingset reporting mm: test system-wide workingset reporting
drivers/base/node.c | 3 + include/linux/memcontrol.h | 5 + include/linux/mmzone.h | 4 + include/linux/workingset_report.h | 107 +++ mm/Kconfig | 15 + mm/Makefile | 2 + mm/internal.h | 45 ++ mm/memcontrol.c | 386 ++++++++- mm/mmzone.c | 2 + mm/vmscan.c | 95 ++- mm/workingset.c | 9 +- mm/workingset_report.c | 757 ++++++++++++++++++ mm/workingset_report_aging.c | 127 +++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + .../testing/selftests/mm/workingset_report.c | 315 ++++++++ .../testing/selftests/mm/workingset_report.h | 37 + .../selftests/mm/workingset_report_test.c | 328 ++++++++ 18 files changed, 2231 insertions(+), 10 deletions(-) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c create mode 100644 mm/workingset_report_aging.c create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c
When non-leaf pmd accessed bits are available, MGLRU page table walks can clear the accessed bit and promptly ignore the accessed bit on the pte because it's on a different node, so the walk does not update the generation of said page. When the next scan comes around on the right node, the non-leaf pmd accessed bit might remain cleared and the pte accessed bits won't be checked. While this is sufficient for reclaim-driven aging, where the goal is to select a reasonably cold page, the access can be missed when aging proactively for measuring the working set size of a node/memcg.
Since force_scan disables various other optimizations, we check force_scan to ignore the non-leaf pmd accessed bit.
Signed-off-by: Yuanchu Xie yuanchu@google.com --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..1a7c7d537db6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3522,7 +3522,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
walk->mm_stats[MM_NONLEAF_TOTAL]++;
- if (should_clear_pmd_young()) { + if (!walk->force_scan && should_clear_pmd_young()) { if (!pmd_young(val)) continue;
Yuanchu Xie yuanchu@google.com writes:
When non-leaf pmd accessed bits are available, MGLRU page table walks can clear the accessed bit and promptly ignore the accessed bit on the pte because it's on a different node, so the walk does not update the generation of said page. When the next scan comes around on the right node, the non-leaf pmd accessed bit might remain cleared and the pte accessed bits won't be checked. While this is sufficient for reclaim-driven aging, where the goal is to select a reasonably cold page, the access can be missed when aging proactively for measuring the working set size of a node/memcg.
Since force_scan disables various other optimizations, we check force_scan to ignore the non-leaf pmd accessed bit.
Signed-off-by: Yuanchu Xie yuanchu@google.com
mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..1a7c7d537db6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3522,7 +3522,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end, walk->mm_stats[MM_NONLEAF_TOTAL]++;
if (should_clear_pmd_young()) {
if (!walk->force_scan && should_clear_pmd_young()) { if (!pmd_young(val)) continue;
Sorry, I don't understand why we need this. If !pmd_young(val), we don't need to update the generation. If pmd_young(val), the bloom filter will be ignored if force_scan == true. Or do I miss something?
-- Best Regards, Huang, Ying
On Mon, Apr 8, 2024 at 11:52 PM Huang, Ying ying.huang@intel.com wrote:
Yuanchu Xie yuanchu@google.com writes:
When non-leaf pmd accessed bits are available, MGLRU page table walks can clear the accessed bit and promptly ignore the accessed bit on the pte because it's on a different node, so the walk does not update the generation of said page. When the next scan comes around on the right node, the non-leaf pmd accessed bit might remain cleared and the pte accessed bits won't be checked. While this is sufficient for reclaim-driven aging, where the goal is to select a reasonably cold page, the access can be missed when aging proactively for measuring the working set size of a node/memcg.
Since force_scan disables various other optimizations, we check force_scan to ignore the non-leaf pmd accessed bit.
Signed-off-by: Yuanchu Xie yuanchu@google.com
mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..1a7c7d537db6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3522,7 +3522,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
walk->mm_stats[MM_NONLEAF_TOTAL]++;
if (should_clear_pmd_young()) {
if (!walk->force_scan && should_clear_pmd_young()) { if (!pmd_young(val)) continue;
Sorry, I don't understand why we need this. If !pmd_young(val), we don't need to update the generation. If pmd_young(val), the bloom filter will be ignored if force_scan == true. Or do I miss something?
If !pmd_young(val), we still might need to update the generation.
The get_pfn_folio function returns NULL if the folio's nid != node under scanning, so the pte accessed bit does not get cleared and the generation is not updated. Now the pmd_young flag of this pmd is cleared, and if none of the pte's are accessed before another round of scanning occurs on the folio's node, the pmd_young check fails and the pte accessed bit is skipped.
This is fine for kswapd but can introduce inaccuracies when scanning proactively for workingset estimation.
Thanks, Yuanchu
Yuanchu Xie yuanchu@google.com writes:
On Mon, Apr 8, 2024 at 11:52 PM Huang, Ying ying.huang@intel.com wrote:
Yuanchu Xie yuanchu@google.com writes:
When non-leaf pmd accessed bits are available, MGLRU page table walks can clear the accessed bit and promptly ignore the accessed bit on the pte because it's on a different node, so the walk does not update the generation of said page. When the next scan comes around on the right node, the non-leaf pmd accessed bit might remain cleared and the pte accessed bits won't be checked. While this is sufficient for reclaim-driven aging, where the goal is to select a reasonably cold page, the access can be missed when aging proactively for measuring the working set size of a node/memcg.
Since force_scan disables various other optimizations, we check force_scan to ignore the non-leaf pmd accessed bit.
Signed-off-by: Yuanchu Xie yuanchu@google.com
mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..1a7c7d537db6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3522,7 +3522,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
walk->mm_stats[MM_NONLEAF_TOTAL]++;
if (should_clear_pmd_young()) {
if (!walk->force_scan && should_clear_pmd_young()) { if (!pmd_young(val)) continue;
Sorry, I don't understand why we need this. If !pmd_young(val), we don't need to update the generation. If pmd_young(val), the bloom filter will be ignored if force_scan == true. Or do I miss something?
If !pmd_young(val), we still might need to update the generation.
The get_pfn_folio function returns NULL if the folio's nid != node under scanning, so the pte accessed bit does not get cleared and the generation is not updated. Now the pmd_young flag of this pmd is cleared, and if none of the pte's are accessed before another round of scanning occurs on the folio's node, the pmd_young check fails and the pte accessed bit is skipped.
This is fine for kswapd but can introduce inaccuracies when scanning proactively for workingset estimation.
Got it! Thanks for detailed explanation. Can you give more details in patch description too?
It's unfortunate because PMD young checking helps scanning performance much. It's unnecessary to be done in this patchset, but I hope we can find some way to get it back at some time.
-- Best Regards, Huang, Ying
Hierarchically aggregate all memcgs' MGLRU generations and their page counts into working set page age histograms. The histograms break down the system's working set per-node, per-anon/file.
The sysfs interfaces are as follows: /sys/devices/system/node/nodeX/page_age A per-node page age histogram, showing an aggregate of the node's lruvecs. The information is extracted from MGLRU's per-generation page counters. Reading this file causes a hierarchical aging of all lruvecs, scanning pages and creates a new generation in each lruvec. For example: 1000 anon=0 file=0 2000 anon=0 file=0 100000 anon=5533696 file=5566464 18446744073709551615 anon=0 file=0
/sys/devices/system/node/nodeX/page_age_interval A comma separated list of time in milliseconds that configures what the page age histogram uses for aggregation.
Signed-off-by: Yuanchu Xie yuanchu@google.com --- drivers/base/node.c | 3 + include/linux/mmzone.h | 4 + include/linux/workingset_report.h | 69 +++++ mm/Kconfig | 9 + mm/Makefile | 1 + mm/internal.h | 9 + mm/memcontrol.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 34 ++- mm/workingset_report.c | 413 ++++++++++++++++++++++++++++++ 10 files changed, 545 insertions(+), 1 deletion(-) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c
diff --git a/drivers/base/node.c b/drivers/base/node.c index 1c05640461dd..4f589b8253f4 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,7 @@ #include <linux/pm_runtime.h> #include <linux/swap.h> #include <linux/slab.h> +#include <linux/workingset_report.h>
static const struct bus_type node_subsys = { .name = "node", @@ -625,6 +626,7 @@ static int register_node(struct node *node, int num) } else { hugetlb_register_node(node); compaction_register_node(node); + wsr_register_node(node); }
return error; @@ -641,6 +643,7 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); compaction_unregister_node(node); + wsr_unregister_node(node); node_remove_accesses(node); node_remove_caches(node); device_unregister(&node->dev); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a497f189d988..8839931646ee 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -24,6 +24,7 @@ #include <linux/local_lock.h> #include <linux/zswap.h> #include <asm/page.h> +#include <linux/workingset_report.h>
/* Free memory management - zoned buddy allocator. */ #ifndef CONFIG_ARCH_FORCE_MAX_ORDER @@ -625,6 +626,9 @@ struct lruvec { struct lru_gen_mm_state mm_state; #endif #endif /* CONFIG_LRU_GEN */ +#ifdef CONFIG_WORKINGSET_REPORT + struct wsr_state wsr; +#endif /* CONFIG_WORKINGSET_REPORT */ #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h new file mode 100644 index 000000000000..0de640cb1ef0 --- /dev/null +++ b/include/linux/workingset_report.h @@ -0,0 +1,69 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_WORKINGSET_REPORT_H +#define _LINUX_WORKINGSET_REPORT_H + +#include <linux/types.h> +#include <linux/mutex.h> + +struct mem_cgroup; +struct pglist_data; +struct node; +struct lruvec; + +#ifdef CONFIG_WORKINGSET_REPORT + +#define WORKINGSET_REPORT_MIN_NR_BINS 2 +#define WORKINGSET_REPORT_MAX_NR_BINS 32 + +#define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) +#define ANON_AND_FILE 2 + +struct wsr_report_bin { + unsigned long idle_age; + unsigned long nr_pages[ANON_AND_FILE]; +}; + +struct wsr_report_bins { + unsigned long nr_bins; + /* last bin contains WORKINGSET_INTERVAL_MAX */ + struct wsr_report_bin bins[WORKINGSET_REPORT_MAX_NR_BINS]; +}; + +struct wsr_page_age_histo { + unsigned long timestamp; + struct wsr_report_bins bins; +}; + +struct wsr_state { + /* breakdown of workingset by page age */ + struct mutex page_age_lock; + struct wsr_page_age_histo *page_age; +}; + +void wsr_init(struct lruvec *lruvec); +void wsr_destroy(struct lruvec *lruvec); + +/* + * Returns true if the wsr is configured to be refreshed. + * The next refresh time is stored in refresh_time. + */ +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat); +void wsr_register_node(struct node *node); +void wsr_unregister_node(struct node *node); +#else +static inline void wsr_init(struct lruvec *lruvec) +{ +} +static inline void wsr_destroy(struct lruvec *lruvec) +{ +} +static inline void wsr_register_node(struct node *node) +{ +} +static inline void wsr_unregister_node(struct node *node) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT */ + +#endif /* _LINUX_WORKINGSET_REPORT_H */ diff --git a/mm/Kconfig b/mm/Kconfig index ffc3a2ba3a8c..212f203b10b9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1261,6 +1261,15 @@ config LOCK_MM_AND_FIND_VMA config IOMMU_MM_DATA bool
+config WORKINGSET_REPORT + bool "Working set reporting" + depends on LRU_GEN && SYSFS + help + Report system and per-memcg working set to userspace. + + This option exports stats and events giving the user more insight + into its memory working set. + source "mm/damon/Kconfig"
endmenu diff --git a/mm/Makefile b/mm/Makefile index e4b5b75aaec9..57093657030d 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o +obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/internal.h b/mm/internal.h index f309a010d50f..5e0caba64ee4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -198,12 +198,21 @@ extern unsigned long highest_memmap_pfn; /* * in mm/vmscan.c: */ +struct scan_control; bool isolate_lru_page(struct page *page); bool folio_isolate_lru(struct folio *folio); void putback_lru_page(struct page *page); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
+#ifdef CONFIG_WORKINGSET_REPORT +/* + * in mm/wsr.c + */ +/* Requires wsr->page_age_lock held */ +void wsr_refresh_scan(struct lruvec *lruvec); +#endif + /* * in mm/rmap.c: */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1ed40f9d3a27..2f07141de16c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -65,6 +65,7 @@ #include <linux/seq_buf.h> #include <linux/sched/isolation.h> #include <linux/kmemleak.h> +#include <linux/workingset_report.h> #include "internal.h" #include <net/sock.h> #include <net/ip.h> @@ -5457,6 +5458,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return;
+ wsr_destroy(&pn->lruvec); free_percpu(pn->lruvec_stats_percpu); kfree(pn); } diff --git a/mm/mmzone.c b/mm/mmzone.c index c01896eca736..efca44c1b84b 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -90,6 +90,8 @@ void lruvec_init(struct lruvec *lruvec) */ list_del(&lruvec->lists[LRU_UNEVICTABLE]);
+ wsr_init(lruvec); + lru_gen_init_lruvec(lruvec); }
diff --git a/mm/vmscan.c b/mm/vmscan.c index 1a7c7d537db6..b694d80ab2d1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -56,6 +56,7 @@ #include <linux/khugepaged.h> #include <linux/rculist_nulls.h> #include <linux/random.h> +#include <linux/workingset_report.h>
#include <asm/tlbflush.h> #include <asm/div64.h> @@ -3815,7 +3816,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, return success; }
-static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, +bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, struct scan_control *sc, bool can_swap, bool force_scan) { bool success; @@ -5606,6 +5607,8 @@ static int __init init_lru_gen(void) if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) pr_err("lru_gen: failed to create sysfs group\n");
+ wsr_register_node(NULL); + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
@@ -5613,6 +5616,35 @@ static int __init init_lru_gen(void) }; late_initcall(init_lru_gen);
+/****************************************************************************** + * workingset reporting + ******************************************************************************/ +#ifdef CONFIG_WORKINGSET_REPORT +void wsr_refresh_scan(struct lruvec *lruvec) +{ + DEFINE_MAX_SEQ(lruvec); + struct scan_control sc = { + .may_writepage = true, + .may_unmap = true, + .may_swap = true, + .proactive = true, + .reclaim_idx = MAX_NR_ZONES - 1, + .gfp_mask = GFP_KERNEL, + }; + unsigned int flags; + + set_task_reclaim_state(current, &sc.reclaim_state); + flags = memalloc_noreclaim_save(); + /* + * setting can_swap=true and force_scan=true ensures + * proper workingset stats when the system cannot swap. + */ + try_to_inc_max_seq(lruvec, max_seq, &sc, true, true); + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); +} +#endif /* CONFIG_WORKINGSET_REPORT */ + #else /* !CONFIG_LRU_GEN */
static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) diff --git a/mm/workingset_report.c b/mm/workingset_report.c new file mode 100644 index 000000000000..98cdaffcb6b4 --- /dev/null +++ b/mm/workingset_report.c @@ -0,0 +1,413 @@ +// SPDX-License-Identifier: GPL-2.0 +// +#include <linux/export.h> +#include <linux/lockdep.h> +#include <linux/jiffies.h> +#include <linux/kernfs.h> +#include <linux/memcontrol.h> +#include <linux/rcupdate.h> +#include <linux/mutex.h> +#include <linux/err.h> +#include <linux/atomic.h> +#include <linux/node.h> +#include <linux/mmzone.h> +#include <linux/mm.h> +#include <linux/mm_inline.h> +#include <linux/workingset_report.h> + +#include "internal.h" + +void wsr_init(struct lruvec *lruvec) +{ + struct wsr_state *wsr = &lruvec->wsr; + + memset(wsr, 0, sizeof(*wsr)); + mutex_init(&wsr->page_age_lock); +} + +void wsr_destroy(struct lruvec *lruvec) +{ + struct wsr_state *wsr = &lruvec->wsr; + + mutex_destroy(&wsr->page_age_lock); + kfree(wsr->page_age); + memset(wsr, 0, sizeof(*wsr)); +} + +static int workingset_report_intervals_parse(char *src, + struct wsr_report_bins *bins) +{ + int err = 0, i = 0; + char *cur, *next = strim(src); + + if (*next == '\0') + return 0; + + while ((cur = strsep(&next, ","))) { + unsigned int interval; + + err = kstrtouint(cur, 0, &interval); + if (err) + goto out; + + bins->bins[i].idle_age = msecs_to_jiffies(interval); + if (i > 0 && bins->bins[i].idle_age <= bins->bins[i - 1].idle_age) { + err = -EINVAL; + goto out; + } + + if (++i == WORKINGSET_REPORT_MAX_NR_BINS) { + err = -ERANGE; + goto out; + } + } + + if (i && i < WORKINGSET_REPORT_MIN_NR_BINS - 1) { + err = -ERANGE; + goto out; + } + + bins->nr_bins = i; + bins->bins[i].idle_age = WORKINGSET_INTERVAL_MAX; +out: + return err ?: i; +} + +static unsigned long get_gen_start_time(const struct lru_gen_folio *lrugen, + unsigned long seq, + unsigned long max_seq, + unsigned long curr_timestamp) +{ + int younger_gen; + + if (seq == max_seq) + return curr_timestamp; + younger_gen = lru_gen_from_seq(seq + 1); + return READ_ONCE(lrugen->timestamps[younger_gen]); +} + +static void collect_page_age_type(const struct lru_gen_folio *lrugen, + struct wsr_report_bin *bin, + unsigned long max_seq, unsigned long min_seq, + unsigned long curr_timestamp, int type) +{ + unsigned long seq; + + for (seq = max_seq; seq + 1 > min_seq; seq--) { + int gen, zone; + unsigned long gen_end, gen_start, size = 0; + + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += max( + READ_ONCE(lrugen->nr_pages[gen][type][zone]), + 0L); + + gen_start = get_gen_start_time(lrugen, seq, max_seq, + curr_timestamp); + gen_end = READ_ONCE(lrugen->timestamps[gen]); + + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(gen_end + bin->idle_age, curr_timestamp)) { + unsigned long gen_in_bin = (long)gen_start - + (long)curr_timestamp + + (long)bin->idle_age; + unsigned long gen_len = (long)gen_start - (long)gen_end; + + if (!gen_len) + break; + if (gen_in_bin) { + unsigned long split_bin = + size / gen_len * gen_in_bin; + + bin->nr_pages[type] += split_bin; + size -= split_bin; + } + gen_start = curr_timestamp - bin->idle_age; + bin++; + } + bin->nr_pages[type] += size; + } +} + +/* + * proportionally aggregate Multi-gen LRU bins into a working set report + * MGLRU generations: + * current time + * | max_seq timestamp + * | | max_seq - 1 timestamp + * | | | unbounded + * | | | | + * -------------------------------- + * | max_seq | ... | ... | min_seq + * -------------------------------- + * + * Bins: + * + * current time + * | current - idle_age[0] + * | | current - idle_age[1] + * | | | unbounded + * | | | | + * ------------------------------ + * | bin 0 | ... | ... | bin n-1 + * ------------------------------ + * + * Assume the heuristic that pages are in the MGLRU generation + * through uniform accesses, so we can aggregate them + * proportionally into bins. + */ +static void collect_page_age(struct wsr_page_age_histo *page_age, + const struct lruvec *lruvec) +{ + int type; + const struct lru_gen_folio *lrugen = &lruvec->lrugen; + unsigned long curr_timestamp = jiffies; + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + unsigned long min_seq[ANON_AND_FILE] = { + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_ANON]), + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_FILE]), + }; + struct wsr_report_bins *bins = &page_age->bins; + + for (type = 0; type < ANON_AND_FILE; type++) { + struct wsr_report_bin *bin = &bins->bins[0]; + + collect_page_age_type(lrugen, bin, max_seq, min_seq[type], + curr_timestamp, type); + } +} + +/* First step: hierarchically scan child memcgs. */ +static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + + memcg = mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + wsr_refresh_scan(lruvec); + cond_resched(); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); +} + +/* Second step: aggregate child memcgs into the page age histogram. */ +static void refresh_aggregate(struct wsr_page_age_histo *page_age, + struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + struct wsr_report_bin *bin; + + /* + * page_age_intervals should free the page_age struct + * if no intervals are provided. + */ + VM_WARN_ON_ONCE(page_age->bins.bins[0].idle_age == + WORKINGSET_INTERVAL_MAX); + + for (bin = page_age->bins.bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) { + bin->nr_pages[0] = 0; + bin->nr_pages[1] = 0; + } + /* the last used bin has idle_age == WORKINGSET_INTERVAL_MAX. */ + bin->nr_pages[0] = 0; + bin->nr_pages[1] = 0; + + memcg = mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + collect_page_age(page_age, lruvec); + cond_resched(); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); + WRITE_ONCE(page_age->timestamp, jiffies); +} + +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct wsr_page_age_histo *page_age; + + if (!READ_ONCE(wsr->page_age)) + return false; + + refresh_scan(wsr, root, pgdat); + mutex_lock(&wsr->page_age_lock); + page_age = READ_ONCE(wsr->page_age); + if (page_age) + refresh_aggregate(page_age, root, pgdat); + mutex_unlock(&wsr->page_age_lock); + return !!page_age; +} +EXPORT_SYMBOL_GPL(wsr_refresh_report); + +static struct pglist_data *kobj_to_pgdat(struct kobject *kobj) +{ + int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id : + first_memory_node; + + return NODE_DATA(nid); +} + +static struct wsr_state *kobj_to_wsr(struct kobject *kobj) +{ + return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; +} + +static ssize_t page_age_intervals_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int len = 0; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + mutex_lock(&wsr->page_age_lock); + + if (!!wsr->page_age) { + int i; + int nr_bins = wsr->page_age->bins.nr_bins; + + for (i = 0; i < nr_bins; ++i) { + struct wsr_report_bin *bin = + &wsr->page_age->bins.bins[i]; + + len += sysfs_emit_at(buf, len, "%u", + jiffies_to_msecs(bin->idle_age)); + if (i + 1 < nr_bins) + len += sysfs_emit_at(buf, len, ","); + } + } + len += sysfs_emit_at(buf, len, "\n"); + + mutex_unlock(&wsr->page_age_lock); + return len; +} + +static ssize_t page_age_intervals_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *src, size_t len) +{ + struct wsr_page_age_histo *page_age = NULL, *old; + char *buf = NULL; + int err = 0; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + buf = kstrdup(src, GFP_KERNEL); + if (!buf) { + err = -ENOMEM; + goto failed; + } + + page_age = + kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL_ACCOUNT); + + if (!page_age) { + err = -ENOMEM; + goto failed; + } + + err = workingset_report_intervals_parse(buf, &page_age->bins); + if (err < 0) + goto failed; + + if (err == 0) { + kfree(page_age); + page_age = NULL; + } + + mutex_lock(&wsr->page_age_lock); + old = xchg(&wsr->page_age, page_age); + mutex_unlock(&wsr->page_age_lock); + kfree(old); + kfree(buf); + return len; +failed: + kfree(page_age); + kfree(buf); + + return err; +} + +static struct kobj_attribute page_age_intervals_attr = + __ATTR_RW(page_age_intervals); + +static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct wsr_report_bin *bin; + int ret = 0; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + if (!READ_ONCE(wsr->page_age)) + return -EINVAL; + + wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj)); + + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) { + ret = -EINVAL; + goto unlock; + } + + for (bin = wsr->page_age->bins.bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) + ret += sysfs_emit_at(buf, ret, "%u anon=%lu file=%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + + ret += sysfs_emit_at(buf, ret, "%lu anon=%lu file=%lu\n", + WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + +unlock: + mutex_unlock(&wsr->page_age_lock); + return ret; +} + +static struct kobj_attribute page_age_attr = __ATTR_RO(page_age); + +static struct attribute *workingset_report_attrs[] = { + &page_age_intervals_attr.attr, &page_age_attr.attr, NULL +}; + +static const struct attribute_group workingset_report_attr_group = { + .name = "workingset_report", + .attrs = workingset_report_attrs, +}; + +void wsr_register_node(struct node *node) +{ + struct kobject *kobj = node ? &node->dev.kobj : mm_kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr = kobj_to_wsr(kobj); + + if (sysfs_create_group(kobj, &workingset_report_attr_group)) { + pr_warn("WSR failed to created group"); + return; + } +} +EXPORT_SYMBOL_GPL(wsr_register_node); + +void wsr_unregister_node(struct node *node) +{ + struct kobject *kobj = &node->dev.kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr = kobj_to_wsr(kobj); + sysfs_remove_group(kobj, &workingset_report_attr_group); + wsr_destroy(mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))); +} +EXPORT_SYMBOL_GPL(wsr_unregister_node);
Yuanchu Xie yuanchu@google.com writes:
Hierarchically aggregate all memcgs' MGLRU generations and their page counts into working set page age histograms. The histograms break down the system's working set per-node, per-anon/file.
The sysfs interfaces are as follows: /sys/devices/system/node/nodeX/page_age A per-node page age histogram, showing an aggregate of the node's lruvecs. The information is extracted from MGLRU's per-generation page counters. Reading this file causes a hierarchical aging of all lruvecs, scanning pages and creates a new generation in each lruvec. For example: 1000 anon=0 file=0 2000 anon=0 file=0 100000 anon=5533696 file=5566464 18446744073709551615 anon=0 file=0
/sys/devices/system/node/nodeX/page_age_interval A comma separated list of time in milliseconds that configures what the page age histogram uses for aggregation.
Signed-off-by: Yuanchu Xie yuanchu@google.com
drivers/base/node.c | 3 + include/linux/mmzone.h | 4 + include/linux/workingset_report.h | 69 +++++ mm/Kconfig | 9 + mm/Makefile | 1 + mm/internal.h | 9 + mm/memcontrol.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 34 ++- mm/workingset_report.c | 413 ++++++++++++++++++++++++++++++ 10 files changed, 545 insertions(+), 1 deletion(-) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c
diff --git a/drivers/base/node.c b/drivers/base/node.c index 1c05640461dd..4f589b8253f4 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,7 @@ #include <linux/pm_runtime.h> #include <linux/swap.h> #include <linux/slab.h> +#include <linux/workingset_report.h> static const struct bus_type node_subsys = { .name = "node", @@ -625,6 +626,7 @@ static int register_node(struct node *node, int num) } else { hugetlb_register_node(node); compaction_register_node(node);
}wsr_register_node(node);
return error; @@ -641,6 +643,7 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); compaction_unregister_node(node);
- wsr_unregister_node(node); node_remove_accesses(node); node_remove_caches(node); device_unregister(&node->dev);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a497f189d988..8839931646ee 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -24,6 +24,7 @@ #include <linux/local_lock.h> #include <linux/zswap.h> #include <asm/page.h> +#include <linux/workingset_report.h> /* Free memory management - zoned buddy allocator. */ #ifndef CONFIG_ARCH_FORCE_MAX_ORDER @@ -625,6 +626,9 @@ struct lruvec { struct lru_gen_mm_state mm_state; #endif #endif /* CONFIG_LRU_GEN */ +#ifdef CONFIG_WORKINGSET_REPORT
- struct wsr_state wsr;
+#endif /* CONFIG_WORKINGSET_REPORT */ #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h new file mode 100644 index 000000000000..0de640cb1ef0 --- /dev/null +++ b/include/linux/workingset_report.h @@ -0,0 +1,69 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_WORKINGSET_REPORT_H +#define _LINUX_WORKINGSET_REPORT_H
+#include <linux/types.h> +#include <linux/mutex.h>
+struct mem_cgroup; +struct pglist_data; +struct node; +struct lruvec;
+#ifdef CONFIG_WORKINGSET_REPORT
+#define WORKINGSET_REPORT_MIN_NR_BINS 2 +#define WORKINGSET_REPORT_MAX_NR_BINS 32
+#define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) +#define ANON_AND_FILE 2
+struct wsr_report_bin {
- unsigned long idle_age;
- unsigned long nr_pages[ANON_AND_FILE];
+};
+struct wsr_report_bins {
- unsigned long nr_bins;
- /* last bin contains WORKINGSET_INTERVAL_MAX */
- struct wsr_report_bin bins[WORKINGSET_REPORT_MAX_NR_BINS];
+};
+struct wsr_page_age_histo {
- unsigned long timestamp;
- struct wsr_report_bins bins;
+};
+struct wsr_state {
- /* breakdown of workingset by page age */
- struct mutex page_age_lock;
- struct wsr_page_age_histo *page_age;
+};
+void wsr_init(struct lruvec *lruvec); +void wsr_destroy(struct lruvec *lruvec);
+/*
- Returns true if the wsr is configured to be refreshed.
- The next refresh time is stored in refresh_time.
- */
+bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
struct pglist_data *pgdat);
+void wsr_register_node(struct node *node); +void wsr_unregister_node(struct node *node); +#else +static inline void wsr_init(struct lruvec *lruvec) +{ +} +static inline void wsr_destroy(struct lruvec *lruvec) +{ +} +static inline void wsr_register_node(struct node *node) +{ +} +static inline void wsr_unregister_node(struct node *node) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT */
+#endif /* _LINUX_WORKINGSET_REPORT_H */ diff --git a/mm/Kconfig b/mm/Kconfig index ffc3a2ba3a8c..212f203b10b9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1261,6 +1261,15 @@ config LOCK_MM_AND_FIND_VMA config IOMMU_MM_DATA bool +config WORKINGSET_REPORT
- bool "Working set reporting"
- depends on LRU_GEN && SYSFS
- help
Report system and per-memcg working set to userspace.
This option exports stats and events giving the user more insight
into its memory working set.
source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index e4b5b75aaec9..57093657030d 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o +obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/internal.h b/mm/internal.h index f309a010d50f..5e0caba64ee4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -198,12 +198,21 @@ extern unsigned long highest_memmap_pfn; /*
- in mm/vmscan.c:
*/ +struct scan_control; bool isolate_lru_page(struct page *page); bool folio_isolate_lru(struct folio *folio); void putback_lru_page(struct page *page); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); +#ifdef CONFIG_WORKINGSET_REPORT +/*
- in mm/wsr.c
- */
+/* Requires wsr->page_age_lock held */ +void wsr_refresh_scan(struct lruvec *lruvec); +#endif
/*
- in mm/rmap.c:
*/ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1ed40f9d3a27..2f07141de16c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -65,6 +65,7 @@ #include <linux/seq_buf.h> #include <linux/sched/isolation.h> #include <linux/kmemleak.h> +#include <linux/workingset_report.h> #include "internal.h" #include <net/sock.h> #include <net/ip.h> @@ -5457,6 +5458,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return;
- wsr_destroy(&pn->lruvec); free_percpu(pn->lruvec_stats_percpu); kfree(pn);
} diff --git a/mm/mmzone.c b/mm/mmzone.c index c01896eca736..efca44c1b84b 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -90,6 +90,8 @@ void lruvec_init(struct lruvec *lruvec) */ list_del(&lruvec->lists[LRU_UNEVICTABLE]);
- wsr_init(lruvec);
- lru_gen_init_lruvec(lruvec);
} diff --git a/mm/vmscan.c b/mm/vmscan.c index 1a7c7d537db6..b694d80ab2d1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -56,6 +56,7 @@ #include <linux/khugepaged.h> #include <linux/rculist_nulls.h> #include <linux/random.h> +#include <linux/workingset_report.h> #include <asm/tlbflush.h> #include <asm/div64.h> @@ -3815,7 +3816,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, return success; } -static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, +bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, struct scan_control *sc, bool can_swap, bool force_scan)
It appears that this change isn't necessary.
{ bool success; @@ -5606,6 +5607,8 @@ static int __init init_lru_gen(void) if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) pr_err("lru_gen: failed to create sysfs group\n");
- wsr_register_node(NULL);
- debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
@@ -5613,6 +5616,35 @@ static int __init init_lru_gen(void) }; late_initcall(init_lru_gen); +/******************************************************************************
workingset reporting
- ******************************************************************************/
+#ifdef CONFIG_WORKINGSET_REPORT +void wsr_refresh_scan(struct lruvec *lruvec) +{
- DEFINE_MAX_SEQ(lruvec);
- struct scan_control sc = {
.may_writepage = true,
.may_unmap = true,
.may_swap = true,
.proactive = true,
.reclaim_idx = MAX_NR_ZONES - 1,
.gfp_mask = GFP_KERNEL,
- };
- unsigned int flags;
- set_task_reclaim_state(current, &sc.reclaim_state);
- flags = memalloc_noreclaim_save();
- /*
* setting can_swap=true and force_scan=true ensures
* proper workingset stats when the system cannot swap.
*/
- try_to_inc_max_seq(lruvec, max_seq, &sc, true, true);
- memalloc_noreclaim_restore(flags);
- set_task_reclaim_state(current, NULL);
+} +#endif /* CONFIG_WORKINGSET_REPORT */
#else /* !CONFIG_LRU_GEN */ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) diff --git a/mm/workingset_report.c b/mm/workingset_report.c new file mode 100644 index 000000000000..98cdaffcb6b4 --- /dev/null +++ b/mm/workingset_report.c @@ -0,0 +1,413 @@ +// SPDX-License-Identifier: GPL-2.0 +// +#include <linux/export.h> +#include <linux/lockdep.h> +#include <linux/jiffies.h> +#include <linux/kernfs.h> +#include <linux/memcontrol.h> +#include <linux/rcupdate.h> +#include <linux/mutex.h> +#include <linux/err.h> +#include <linux/atomic.h> +#include <linux/node.h> +#include <linux/mmzone.h> +#include <linux/mm.h> +#include <linux/mm_inline.h> +#include <linux/workingset_report.h>
+#include "internal.h"
+void wsr_init(struct lruvec *lruvec) +{
- struct wsr_state *wsr = &lruvec->wsr;
- memset(wsr, 0, sizeof(*wsr));
- mutex_init(&wsr->page_age_lock);
+}
+void wsr_destroy(struct lruvec *lruvec) +{
- struct wsr_state *wsr = &lruvec->wsr;
- mutex_destroy(&wsr->page_age_lock);
- kfree(wsr->page_age);
- memset(wsr, 0, sizeof(*wsr));
+}
+static int workingset_report_intervals_parse(char *src,
struct wsr_report_bins *bins)
+{
- int err = 0, i = 0;
- char *cur, *next = strim(src);
- if (*next == '\0')
return 0;
- while ((cur = strsep(&next, ","))) {
unsigned int interval;
err = kstrtouint(cur, 0, &interval);
if (err)
goto out;
bins->bins[i].idle_age = msecs_to_jiffies(interval);
if (i > 0 && bins->bins[i].idle_age <= bins->bins[i - 1].idle_age) {
err = -EINVAL;
goto out;
}
if (++i == WORKINGSET_REPORT_MAX_NR_BINS) {
err = -ERANGE;
goto out;
}
- }
- if (i && i < WORKINGSET_REPORT_MIN_NR_BINS - 1) {
err = -ERANGE;
goto out;
- }
- bins->nr_bins = i;
- bins->bins[i].idle_age = WORKINGSET_INTERVAL_MAX;
+out:
- return err ?: i;
+}
+static unsigned long get_gen_start_time(const struct lru_gen_folio *lrugen,
unsigned long seq,
unsigned long max_seq,
unsigned long curr_timestamp)
+{
- int younger_gen;
- if (seq == max_seq)
return curr_timestamp;
- younger_gen = lru_gen_from_seq(seq + 1);
- return READ_ONCE(lrugen->timestamps[younger_gen]);
+}
+static void collect_page_age_type(const struct lru_gen_folio *lrugen,
struct wsr_report_bin *bin,
unsigned long max_seq, unsigned long min_seq,
unsigned long curr_timestamp, int type)
+{
- unsigned long seq;
- for (seq = max_seq; seq + 1 > min_seq; seq--) {
int gen, zone;
unsigned long gen_end, gen_start, size = 0;
gen = lru_gen_from_seq(seq);
for (zone = 0; zone < MAX_NR_ZONES; zone++)
size += max(
READ_ONCE(lrugen->nr_pages[gen][type][zone]),
0L);
gen_start = get_gen_start_time(lrugen, seq, max_seq,
curr_timestamp);
gen_end = READ_ONCE(lrugen->timestamps[gen]);
while (bin->idle_age != WORKINGSET_INTERVAL_MAX &&
time_before(gen_end + bin->idle_age, curr_timestamp)) {
unsigned long gen_in_bin = (long)gen_start -
(long)curr_timestamp +
(long)bin->idle_age;
unsigned long gen_len = (long)gen_start - (long)gen_end;
if (!gen_len)
break;
if (gen_in_bin) {
unsigned long split_bin =
size / gen_len * gen_in_bin;
bin->nr_pages[type] += split_bin;
size -= split_bin;
}
gen_start = curr_timestamp - bin->idle_age;
bin++;
}
bin->nr_pages[type] += size;
- }
+}
+/*
- proportionally aggregate Multi-gen LRU bins into a working set report
- MGLRU generations:
- current time
- | max_seq timestamp
- | | max_seq - 1 timestamp
- | | | unbounded
- | | | |
- | max_seq | ... | ... | min_seq
- Bins:
- current time
- | current - idle_age[0]
- | | current - idle_age[1]
- | | | unbounded
- | | | |
- | bin 0 | ... | ... | bin n-1
- Assume the heuristic that pages are in the MGLRU generation
- through uniform accesses, so we can aggregate them
- proportionally into bins.
- */
+static void collect_page_age(struct wsr_page_age_histo *page_age,
const struct lruvec *lruvec)
+{
- int type;
- const struct lru_gen_folio *lrugen = &lruvec->lrugen;
- unsigned long curr_timestamp = jiffies;
- unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
- unsigned long min_seq[ANON_AND_FILE] = {
READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_ANON]),
READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_FILE]),
- };
- struct wsr_report_bins *bins = &page_age->bins;
- for (type = 0; type < ANON_AND_FILE; type++) {
struct wsr_report_bin *bin = &bins->bins[0];
collect_page_age_type(lrugen, bin, max_seq, min_seq[type],
curr_timestamp, type);
- }
+}
+/* First step: hierarchically scan child memcgs. */ +static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root,
struct pglist_data *pgdat)
+{
- struct mem_cgroup *memcg;
- memcg = mem_cgroup_iter(root, NULL, NULL);
- do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
wsr_refresh_scan(lruvec);
cond_resched();
- } while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
+}
+/* Second step: aggregate child memcgs into the page age histogram. */ +static void refresh_aggregate(struct wsr_page_age_histo *page_age,
struct mem_cgroup *root,
struct pglist_data *pgdat)
+{
- struct mem_cgroup *memcg;
- struct wsr_report_bin *bin;
- /*
* page_age_intervals should free the page_age struct
* if no intervals are provided.
*/
- VM_WARN_ON_ONCE(page_age->bins.bins[0].idle_age ==
WORKINGSET_INTERVAL_MAX);
- for (bin = page_age->bins.bins;
bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) {
bin->nr_pages[0] = 0;
bin->nr_pages[1] = 0;
- }
- /* the last used bin has idle_age == WORKINGSET_INTERVAL_MAX. */
- bin->nr_pages[0] = 0;
- bin->nr_pages[1] = 0;
- memcg = mem_cgroup_iter(root, NULL, NULL);
- do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
collect_page_age(page_age, lruvec);
cond_resched();
- } while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
- WRITE_ONCE(page_age->timestamp, jiffies);
+}
+bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
struct pglist_data *pgdat)
+{
- struct wsr_page_age_histo *page_age;
- if (!READ_ONCE(wsr->page_age))
return false;
- refresh_scan(wsr, root, pgdat);
- mutex_lock(&wsr->page_age_lock);
- page_age = READ_ONCE(wsr->page_age);
- if (page_age)
refresh_aggregate(page_age, root, pgdat);
- mutex_unlock(&wsr->page_age_lock);
- return !!page_age;
+} +EXPORT_SYMBOL_GPL(wsr_refresh_report);
+static struct pglist_data *kobj_to_pgdat(struct kobject *kobj) +{
- int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id :
first_memory_node;
- return NODE_DATA(nid);
+}
+static struct wsr_state *kobj_to_wsr(struct kobject *kobj) +{
- return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr;
+}
+static ssize_t page_age_intervals_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
+{
- int len = 0;
- struct wsr_state *wsr = kobj_to_wsr(kobj);
- mutex_lock(&wsr->page_age_lock);
- if (!!wsr->page_age) {
int i;
int nr_bins = wsr->page_age->bins.nr_bins;
for (i = 0; i < nr_bins; ++i) {
struct wsr_report_bin *bin =
&wsr->page_age->bins.bins[i];
len += sysfs_emit_at(buf, len, "%u",
jiffies_to_msecs(bin->idle_age));
if (i + 1 < nr_bins)
len += sysfs_emit_at(buf, len, ",");
}
- }
- len += sysfs_emit_at(buf, len, "\n");
- mutex_unlock(&wsr->page_age_lock);
- return len;
+}
+static ssize_t page_age_intervals_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *src, size_t len)
+{
- struct wsr_page_age_histo *page_age = NULL, *old;
- char *buf = NULL;
- int err = 0;
- struct wsr_state *wsr = kobj_to_wsr(kobj);
- buf = kstrdup(src, GFP_KERNEL);
- if (!buf) {
err = -ENOMEM;
goto failed;
- }
- page_age =
kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL_ACCOUNT);
- if (!page_age) {
err = -ENOMEM;
goto failed;
- }
- err = workingset_report_intervals_parse(buf, &page_age->bins);
- if (err < 0)
goto failed;
- if (err == 0) {
kfree(page_age);
page_age = NULL;
- }
- mutex_lock(&wsr->page_age_lock);
- old = xchg(&wsr->page_age, page_age);
- mutex_unlock(&wsr->page_age_lock);
- kfree(old);
- kfree(buf);
- return len;
+failed:
- kfree(page_age);
- kfree(buf);
- return err;
+}
+static struct kobj_attribute page_age_intervals_attr =
- __ATTR_RW(page_age_intervals);
+static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
+{
- struct wsr_report_bin *bin;
- int ret = 0;
- struct wsr_state *wsr = kobj_to_wsr(kobj);
- if (!READ_ONCE(wsr->page_age))
return -EINVAL;
- wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj));
- mutex_lock(&wsr->page_age_lock);
- if (!wsr->page_age) {
ret = -EINVAL;
goto unlock;
- }
- for (bin = wsr->page_age->bins.bins;
bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++)
ret += sysfs_emit_at(buf, ret, "%u anon=%lu file=%lu\n",
jiffies_to_msecs(bin->idle_age),
bin->nr_pages[0] * PAGE_SIZE,
bin->nr_pages[1] * PAGE_SIZE);
- ret += sysfs_emit_at(buf, ret, "%lu anon=%lu file=%lu\n",
WORKINGSET_INTERVAL_MAX,
bin->nr_pages[0] * PAGE_SIZE,
bin->nr_pages[1] * PAGE_SIZE);
+unlock:
- mutex_unlock(&wsr->page_age_lock);
- return ret;
+}
+static struct kobj_attribute page_age_attr = __ATTR_RO(page_age);
+static struct attribute *workingset_report_attrs[] = {
- &page_age_intervals_attr.attr, &page_age_attr.attr, NULL
+};
+static const struct attribute_group workingset_report_attr_group = {
- .name = "workingset_report",
- .attrs = workingset_report_attrs,
+};
+void wsr_register_node(struct node *node) +{
- struct kobject *kobj = node ? &node->dev.kobj : mm_kobj;
- struct wsr_state *wsr;
- if (IS_ENABLED(CONFIG_NUMA) && !node)
return;
- wsr = kobj_to_wsr(kobj);
- if (sysfs_create_group(kobj, &workingset_report_attr_group)) {
pr_warn("WSR failed to created group");
return;
- }
+} +EXPORT_SYMBOL_GPL(wsr_register_node);
+void wsr_unregister_node(struct node *node) +{
- struct kobject *kobj = &node->dev.kobj;
- struct wsr_state *wsr;
- if (IS_ENABLED(CONFIG_NUMA) && !node)
return;
- wsr = kobj_to_wsr(kobj);
- sysfs_remove_group(kobj, &workingset_report_attr_group);
- wsr_destroy(mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj)));
+} +EXPORT_SYMBOL_GPL(wsr_unregister_node);
-- Best Regards, Huang, Ying
The refresh interval is a rate limiting factor to workingset page age histogram reads. When a workingset report is generated, a timestamp is noted, and the same report will be read until it expires beyond the refresh interval, at which point a new report is generated.
Sysfs interface /sys/devices/system/node/nodeX/workingset_report/refresh_interval time in milliseconds specifying how long the report is valid for
Signed-off-by: Yuanchu Xie yuanchu@google.com --- include/linux/workingset_report.h | 1 + mm/internal.h | 2 +- mm/vmscan.c | 27 ++++++++------ mm/workingset_report.c | 58 ++++++++++++++++++++++++++----- 4 files changed, 69 insertions(+), 19 deletions(-)
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 0de640cb1ef0..23d2ae747a31 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -35,6 +35,7 @@ struct wsr_page_age_histo { };
struct wsr_state { + unsigned long refresh_interval; /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; diff --git a/mm/internal.h b/mm/internal.h index 5e0caba64ee4..151f09c6983e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -210,7 +210,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason * in mm/wsr.c */ /* Requires wsr->page_age_lock held */ -void wsr_refresh_scan(struct lruvec *lruvec); +void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval); #endif
/* diff --git a/mm/vmscan.c b/mm/vmscan.c index b694d80ab2d1..5f04a04f5261 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5620,7 +5620,7 @@ late_initcall(init_lru_gen); * workingset reporting ******************************************************************************/ #ifdef CONFIG_WORKINGSET_REPORT -void wsr_refresh_scan(struct lruvec *lruvec) +void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval) { DEFINE_MAX_SEQ(lruvec); struct scan_control sc = { @@ -5633,15 +5633,22 @@ void wsr_refresh_scan(struct lruvec *lruvec) }; unsigned int flags;
- set_task_reclaim_state(current, &sc.reclaim_state); - flags = memalloc_noreclaim_save(); - /* - * setting can_swap=true and force_scan=true ensures - * proper workingset stats when the system cannot swap. - */ - try_to_inc_max_seq(lruvec, max_seq, &sc, true, true); - memalloc_noreclaim_restore(flags); - set_task_reclaim_state(current, NULL); + if (refresh_interval) { + int gen = lru_gen_from_seq(max_seq); + unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); + + if (time_is_before_jiffies(birth + refresh_interval)) { + set_task_reclaim_state(current, &sc.reclaim_state); + flags = memalloc_noreclaim_save(); + /* + * setting can_swap=true and force_scan=true ensures + * proper workingset stats when the system cannot swap. + */ + try_to_inc_max_seq(lruvec, max_seq, &sc, true, true); + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); + } + } } #endif /* CONFIG_WORKINGSET_REPORT */
diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 98cdaffcb6b4..370e7d355604 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -181,7 +181,8 @@ static void collect_page_age(struct wsr_page_age_histo *page_age,
/* First step: hierarchically scan child memcgs. */ static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + unsigned long refresh_interval) { struct mem_cgroup *memcg;
@@ -189,7 +190,7 @@ static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
- wsr_refresh_scan(lruvec); + wsr_refresh_scan(lruvec, refresh_interval); cond_resched(); } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); } @@ -231,16 +232,25 @@ static void refresh_aggregate(struct wsr_page_age_histo *page_age, bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, struct pglist_data *pgdat) { - struct wsr_page_age_histo *page_age; + struct wsr_page_age_histo *page_age = NULL; + unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval);
if (!READ_ONCE(wsr->page_age)) return false;
- refresh_scan(wsr, root, pgdat); + if (!refresh_interval) + return false; + mutex_lock(&wsr->page_age_lock); page_age = READ_ONCE(wsr->page_age); - if (page_age) - refresh_aggregate(page_age, root, pgdat); + if (!page_age) + goto unlock; + if (time_is_after_jiffies(page_age->timestamp + refresh_interval)) + goto unlock; + refresh_scan(wsr, root, pgdat, refresh_interval); + refresh_aggregate(page_age, root, pgdat); + +unlock: mutex_unlock(&wsr->page_age_lock); return !!page_age; } @@ -259,6 +269,35 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj) return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; }
+static ssize_t refresh_interval_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_state *wsr = kobj_to_wsr(kobj); + unsigned int interval = READ_ONCE(wsr->refresh_interval); + + return sysfs_emit(buf, "%u\n", jiffies_to_msecs(interval)); +} + +static ssize_t refresh_interval_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int interval; + int err; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + err = kstrtouint(buf, 0, &interval); + if (err) + return err; + + WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(interval)); + + return len; +} + +static struct kobj_attribute refresh_interval_attr = + __ATTR_RW(refresh_interval); + static ssize_t page_age_intervals_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -267,7 +306,7 @@ static ssize_t page_age_intervals_show(struct kobject *kobj,
mutex_lock(&wsr->page_age_lock);
- if (!!wsr->page_age) { + if (wsr->page_age) { int i; int nr_bins = wsr->page_age->bins.nr_bins;
@@ -373,7 +412,10 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, static struct kobj_attribute page_age_attr = __ATTR_RO(page_age);
static struct attribute *workingset_report_attrs[] = { - &page_age_intervals_attr.attr, &page_age_attr.attr, NULL + &refresh_interval_attr.attr, + &page_age_intervals_attr.attr, + &page_age_attr.attr, + NULL };
static const struct attribute_group workingset_report_attr_group = {
When a node reaches its low watermarks and wakes up kswapd, notify all userspace programs waiting on the workingset page age histogram of the memory pressure, so a userspace agent can read the workingset report in time and make policy decisions, such as logging, oom-killing, or migration.
Sysfs interface: /sys/devices/system/node/nodeX/workingset_report/report_threshold time in milliseconds that specifies how often the userspace agent can be notified for node memory pressure.
Signed-off-by: Yuanchu Xie yuanchu@google.com --- include/linux/workingset_report.h | 4 +++ mm/internal.h | 6 +++++ mm/vmscan.c | 44 +++++++++++++++++++++++++++++++ mm/workingset_report.c | 39 +++++++++++++++++++++++++++ 4 files changed, 93 insertions(+)
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 23d2ae747a31..589d240d6251 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -35,7 +35,11 @@ struct wsr_page_age_histo { };
struct wsr_state { + unsigned long report_threshold; unsigned long refresh_interval; + + struct kernfs_node *page_age_sys_file; + /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; diff --git a/mm/internal.h b/mm/internal.h index 151f09c6983e..36480c7ac0dd 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -209,8 +209,14 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason /* * in mm/wsr.c */ +void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat); /* Requires wsr->page_age_lock held */ void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval); +#else +static inline void notify_workingset(struct mem_cgroup *memcg, + struct pglist_data *pgdat) +{ +} #endif
/* diff --git a/mm/vmscan.c b/mm/vmscan.c index 5f04a04f5261..c6acd5265b3f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2535,6 +2535,15 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, return can_demote(pgdat->node_id, sc); }
+#ifdef CONFIG_WORKINGSET_REPORT +static void try_to_report_workingset(struct pglist_data *pgdat, struct scan_control *sc); +#else +static inline void try_to_report_workingset(struct pglist_data *pgdat, + struct scan_control *sc) +{ +} +#endif + #ifdef CONFIG_LRU_GEN
#ifdef CONFIG_LRU_GEN_ENABLED @@ -3936,6 +3945,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) if (!min_ttl || sc->order || sc->priority == DEF_PRIORITY) return;
+ try_to_report_workingset(pgdat, sc); + memcg = mem_cgroup_iter(NULL, NULL, NULL); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); @@ -5650,6 +5661,36 @@ void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval) } } } + +static void try_to_report_workingset(struct pglist_data *pgdat, + struct scan_control *sc) +{ + struct mem_cgroup *memcg = sc->target_mem_cgroup; + struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; + unsigned long threshold = READ_ONCE(wsr->report_threshold); + + if (sc->priority == DEF_PRIORITY) + return; + + if (!threshold) + return; + + if (!mutex_trylock(&wsr->page_age_lock)) + return; + + if (!wsr->page_age) { + mutex_unlock(&wsr->page_age_lock); + return; + } + + if (time_is_after_jiffies(wsr->page_age->timestamp + threshold)) { + mutex_unlock(&wsr->page_age_lock); + return; + } + + mutex_unlock(&wsr->page_age_lock); + notify_workingset(memcg, pgdat); +} #endif /* CONFIG_WORKINGSET_REPORT */
#else /* !CONFIG_LRU_GEN */ @@ -6177,6 +6218,9 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) if (zone->zone_pgdat == last_pgdat) continue; last_pgdat = zone->zone_pgdat; + + if (!sc->proactive) + try_to_report_workingset(zone->zone_pgdat, sc); shrink_node(zone->zone_pgdat, sc); }
diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 370e7d355604..3ed3b0e8f8ad 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -269,6 +269,33 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj) return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; }
+static ssize_t report_threshold_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_state *wsr = kobj_to_wsr(kobj); + unsigned int threshold = READ_ONCE(wsr->report_threshold); + + return sysfs_emit(buf, "%u\n", jiffies_to_msecs(threshold)); +} + +static ssize_t report_threshold_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int threshold; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + if (kstrtouint(buf, 0, &threshold)) + return -EINVAL; + + WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(threshold)); + + return len; +} + +static struct kobj_attribute report_threshold_attr = + __ATTR_RW(report_threshold); + static ssize_t refresh_interval_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -412,6 +439,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, static struct kobj_attribute page_age_attr = __ATTR_RO(page_age);
static struct attribute *workingset_report_attrs[] = { + &report_threshold_attr.attr, &refresh_interval_attr.attr, &page_age_intervals_attr.attr, &page_age_attr.attr, @@ -437,6 +465,9 @@ void wsr_register_node(struct node *node) pr_warn("WSR failed to created group"); return; } + + wsr->page_age_sys_file = + kernfs_walk_and_get(kobj->sd, "workingset_report/page_age"); } EXPORT_SYMBOL_GPL(wsr_register_node);
@@ -450,6 +481,14 @@ void wsr_unregister_node(struct node *node)
wsr = kobj_to_wsr(kobj); sysfs_remove_group(kobj, &workingset_report_attr_group); + kernfs_put(wsr->page_age_sys_file); wsr_destroy(mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))); } EXPORT_SYMBOL_GPL(wsr_unregister_node); + +void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) +{ + struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; + + kernfs_notify(wsr->page_age_sys_file); +}
Break down the system-wide working set reporting into per-memcg reports, which aggregages its children hierarchically. The per-node working set reporting histograms and refresh/report threshold files are presented as memcg files, showing a report containing all the nodes.
Memcg interface: /sys/fs/cgroup/.../memory.workingset.page_age The memcg equivalent of the sysfs workingset page age histogram, breaks down the workingset of this memcg and its children into page age intervals. Each node is prefixed with a node header and a newline. Non-proactive direct reclaim on this memcg can also wake up userspace agents that are waiting on this file. e.g. N0 1000 anon=0 file=0 2000 anon=0 file=0 3000 anon=0 file=0 4000 anon=0 file=0 5000 anon=0 file=0 18446744073709551615 anon=0 file=0
/sys/fs/cgroup/.../memory.workingset.page_age_intervals Configures the intervals for the page age histogram. This file operates on a per-node basis, allowing for different intervals for each node. e.g. echo N0=1000,2000,3000,4000,5000 > memory.workingset.page_age_intervals
/sys/fs/cgroup/.../memory.workingset.refresh_interval The memcg equivalent of the sysfs refresh interval. A per-node number of how much time a page age histogram is valid for, in milliseconds. e.g. echo N0=2000 > memory.workingset.refresh_interval
/sys/fs/cgroup/.../memory.workingset.report_threshold The memcg equivalent of the sysfs report threshold. A per-node number of how often userspace agent waiting on the page age histogram can be woken up, in milliseconds. e.g. echo N0=1000 > memory.workingset.report_threshold
Signed-off-by: Yuanchu Xie yuanchu@google.com --- include/linux/memcontrol.h | 5 + include/linux/workingset_report.h | 6 +- mm/internal.h | 2 + mm/memcontrol.c | 267 +++++++++++++++++++++++++++++- mm/workingset_report.c | 10 +- 5 files changed, 286 insertions(+), 4 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 20ff87f8e001..7d7bc0928961 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -335,6 +335,11 @@ struct mem_cgroup { struct lru_gen_mm_list mm_list; #endif
+#ifdef CONFIG_WORKINGSET_REPORT + /* memory.workingset.page_age file */ + struct cgroup_file workingset_page_age_file; +#endif + struct mem_cgroup_per_node *nodeinfo[]; };
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 589d240d6251..502542c812b3 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -9,6 +9,7 @@ struct mem_cgroup; struct pglist_data; struct node; struct lruvec; +struct cgroup_file;
#ifdef CONFIG_WORKINGSET_REPORT
@@ -38,7 +39,10 @@ struct wsr_state { unsigned long report_threshold; unsigned long refresh_interval;
- struct kernfs_node *page_age_sys_file; + union { + struct kernfs_node *page_age_sys_file; + struct cgroup_file *page_age_cgroup_file; + };
/* breakdown of workingset by page age */ struct mutex page_age_lock; diff --git a/mm/internal.h b/mm/internal.h index 36480c7ac0dd..3730c8399ad4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -212,6 +212,8 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat); /* Requires wsr->page_age_lock held */ void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval); +int workingset_report_intervals_parse(char *src, + struct wsr_report_bins *bins); #else static inline void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2f07141de16c..75bda5f7994d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7005,6 +7005,245 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, return nbytes; }
+#ifdef CONFIG_WORKINGSET_REPORT +static int memory_ws_page_age_intervals_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr; + struct wsr_page_age_histo *page_age; + int i, nr_bins; + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + mutex_lock(&wsr->page_age_lock); + page_age = wsr->page_age; + if (!page_age) + goto no_page_age; + seq_printf(m, "N%d=", nid); + nr_bins = page_age->bins.nr_bins; + for (i = 0; i < nr_bins; ++i) { + struct wsr_report_bin *bin = + &page_age->bins.bins[i]; + + seq_printf(m, "%u", jiffies_to_msecs(bin->idle_age)); + if (i + 1 < nr_bins) + seq_putc(m, ','); + } + seq_putc(m, ' '); +no_page_age: + mutex_unlock(&wsr->page_age_lock); + } + seq_putc(m, '\n'); + + return 0; +} + +static ssize_t memory_wsr_interval_parse(struct kernfs_open_file *of, char *buf, + size_t nbytes, unsigned int *nid_out, + struct wsr_report_bins *bins) +{ + char *node, *intervals; + unsigned int nid; + int err; + + buf = strstrip(buf); + intervals = buf; + node = strsep(&intervals, "="); + + if (*node != 'N') + return -EINVAL; + + err = kstrtouint(node + 1, 0, &nid); + if (err) + return err; + + if (nid >= nr_node_ids || !node_state(nid, N_MEMORY)) + return -EINVAL; + + err = workingset_report_intervals_parse(intervals, bins); + if (err < 0) + return err; + + *nid_out = nid; + return err; +} + +static ssize_t memory_ws_page_age_intervals_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + unsigned int nid; + int err; + struct wsr_page_age_histo *page_age = NULL, *old; + struct wsr_state *wsr; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + page_age = + kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL_ACCOUNT); + + if (!page_age) { + err = -ENOMEM; + goto failed; + } + + err = memory_wsr_interval_parse(of, buf, nbytes, &nid, &page_age->bins); + if (err < 0) + goto failed; + + if (err == 0) { + kfree(page_age); + page_age = NULL; + } + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + mutex_lock(&wsr->page_age_lock); + old = xchg(&wsr->page_age, page_age); + mutex_unlock(&wsr->page_age_lock); + kfree(old); + return nbytes; +failed: + kfree(page_age); + return err; +} + +static int memory_ws_refresh_interval_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + + seq_printf(m, "N%d=%u ", nid, + jiffies_to_msecs(READ_ONCE(wsr->refresh_interval))); + } + seq_putc(m, '\n'); + + return 0; +} + +static ssize_t memory_wsr_threshold_parse(char *buf, size_t nbytes, + unsigned int *nid_out, + unsigned int *msecs) +{ + char *node, *threshold; + unsigned int nid; + int err; + + buf = strstrip(buf); + threshold = buf; + node = strsep(&threshold, "="); + + if (*node != 'N') + return -EINVAL; + + err = kstrtouint(node + 1, 0, &nid); + if (err) + return err; + + if (nid >= nr_node_ids || !node_state(nid, N_MEMORY)) + return -EINVAL; + + err = kstrtouint(threshold, 0, msecs); + if (err) + return err; + + *nid_out = nid; + + return nbytes; +} + +static ssize_t memory_ws_refresh_interval_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + unsigned int nid, msecs; + struct wsr_state *wsr; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs); + + if (ret < 0) + return ret; + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(msecs)); + return ret; +} + +static int memory_ws_report_threshold_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + + seq_printf(m, "N%d=%u ", nid, + jiffies_to_msecs(READ_ONCE(wsr->report_threshold))); + } + seq_putc(m, '\n'); + + return 0; +} + +static ssize_t memory_ws_report_threshold_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + unsigned int nid, msecs; + struct wsr_state *wsr; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs); + + if (ret < 0) + return ret; + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(msecs)); + return ret; +} + +static int memory_ws_page_age_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + struct wsr_report_bin *bin; + + if (!READ_ONCE(wsr->page_age)) + continue; + + wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + goto unlock; + seq_printf(m, "N%d\n", nid); + for (bin = wsr->page_age->bins.bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) + seq_printf(m, "%u anon=%lu file=%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + + seq_printf(m, "%lu anon=%lu file=%lu\n", WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + +unlock: + mutex_unlock(&wsr->page_age_lock); + } + + return 0; +} +#endif + static struct cftype memory_files[] = { { .name = "current", @@ -7073,7 +7312,33 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NS_DELEGATABLE, .write = memory_reclaim, }, - { } /* terminate */ +#ifdef CONFIG_WORKINGSET_REPORT + { + .name = "workingset.page_age_intervals", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_page_age_intervals_show, + .write = memory_ws_page_age_intervals_write, + }, + { + .name = "workingset.refresh_interval", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_refresh_interval_show, + .write = memory_ws_refresh_interval_write, + }, + { + .name = "workingset.report_threshold", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_report_threshold_show, + .write = memory_ws_report_threshold_write, + }, + { + .name = "workingset.page_age", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .file_offset = offsetof(struct mem_cgroup, workingset_page_age_file), + .seq_show = memory_ws_page_age_show, + }, +#endif + {} /* terminate */ };
struct cgroup_subsys memory_cgrp_subsys = { diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 3ed3b0e8f8ad..b00ffbfebcab 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -20,9 +20,12 @@ void wsr_init(struct lruvec *lruvec) { struct wsr_state *wsr = &lruvec->wsr; + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
memset(wsr, 0, sizeof(*wsr)); mutex_init(&wsr->page_age_lock); + if (memcg && !mem_cgroup_is_root(memcg)) + wsr->page_age_cgroup_file = &memcg->workingset_page_age_file; }
void wsr_destroy(struct lruvec *lruvec) @@ -34,7 +37,7 @@ void wsr_destroy(struct lruvec *lruvec) memset(wsr, 0, sizeof(*wsr)); }
-static int workingset_report_intervals_parse(char *src, +int workingset_report_intervals_parse(char *src, struct wsr_report_bins *bins) { int err = 0, i = 0; @@ -490,5 +493,8 @@ void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) { struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr;
- kernfs_notify(wsr->page_age_sys_file); + if (mem_cgroup_is_root(memcg)) + kernfs_notify(wsr->page_age_sys_file); + else + cgroup_file_notify(wsr->page_age_cgroup_file); }
A reaccess refers to detecting an access on a page via refault or access bit harvesting after the initial access. Similar to the working set histogram, the reaccess histogram breaks down reaccesses into user-defined bins.
It tracks reaccesses from MGLRU walks, where a move from older generations to the young generation counts as a reaccess. Swapped out pages are tracked with the generation number encoded in mm/workingset.c, and additional tracking is added for enabled memory cgroups to track an additional 4 swapped out generations.
Memcg interfaces: /sys/fs/cgroup/.../memory.workingset.reaccess The format is identical to memory.workingset.page_age, but the content breaks down reaccesses into pre-defined intervals. e.g. N0 1000 anon=6330 file=0 2000 anon=72 file=0 4000 anon=0 file=0 18446744073709551615 anon=0 file=0 N1 18446744073709551615 anon=0 file=0
/sys/fs/cgroup/.../memory.workingset.reaccess_intervals Defines the per-node intervals for memory.workingset.reaccess. e.g. echo N0=120000,240000,480000 > memory.workingset.reaccess_intervals
Signed-off-by: Yuanchu Xie yuanchu@google.com --- include/linux/workingset_report.h | 20 +++ mm/internal.h | 28 ++++ mm/memcontrol.c | 112 ++++++++++++++ mm/vmscan.c | 8 +- mm/workingset.c | 9 +- mm/workingset_report.c | 249 ++++++++++++++++++++++++++++++ 6 files changed, 419 insertions(+), 7 deletions(-)
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 502542c812b3..e908c5678b1e 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -4,6 +4,7 @@
#include <linux/types.h> #include <linux/mutex.h> +#include <linux/rcutree.h>
struct mem_cgroup; struct pglist_data; @@ -19,6 +20,12 @@ struct cgroup_file; #define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) #define ANON_AND_FILE 2
+/* + * MAX_NR_EVICTED_GENS is set to 4 so we can track the same number of + * generations as MGLRU has resident. + */ +#define MAX_NR_EVICTED_GENS 4 + struct wsr_report_bin { unsigned long idle_age; unsigned long nr_pages[ANON_AND_FILE]; @@ -35,6 +42,18 @@ struct wsr_page_age_histo { struct wsr_report_bins bins; };
+struct wsr_evicted_gen { + unsigned long timestamp; + int seq; +}; + +struct wsr_reaccess_histo { + struct rcu_head rcu; + /* evicted gens start from min_seq[LRU_GEN_ANON] - 1 */ + struct wsr_evicted_gen gens[MAX_NR_EVICTED_GENS]; + struct wsr_report_bins bins; +}; + struct wsr_state { unsigned long report_threshold; unsigned long refresh_interval; @@ -47,6 +66,7 @@ struct wsr_state { /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; + struct wsr_reaccess_histo __rcu *reaccess; };
void wsr_init(struct lruvec *lruvec); diff --git a/mm/internal.h b/mm/internal.h index 3730c8399ad4..077340b526e8 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -205,16 +205,44 @@ void putback_lru_page(struct page *page); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
+/* + * in mm/workingset.c + */ +#define WORKINGSET_SHIFT 1 +#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ + WORKINGSET_SHIFT + NODES_SHIFT + \ + MEM_CGROUP_ID_SHIFT) +#define EVICTION_MASK (~0UL >> EVICTION_SHIFT) + #ifdef CONFIG_WORKINGSET_REPORT /* * in mm/wsr.c */ +void report_lru_gen_eviction(struct lruvec *lruvec, int type, int min_seq); +void lru_gen_report_reaccess(struct lruvec *lruvec, + struct lru_gen_mm_walk *walk); +void report_reaccess_refault(struct lruvec *lruvec, unsigned long token, + int type, int nr_pages); void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat); /* Requires wsr->page_age_lock held */ void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval); int workingset_report_intervals_parse(char *src, struct wsr_report_bins *bins); #else +struct lru_gen_mm_walk; +static inline void report_lru_gen_eviction(struct lruvec *lruvec, int type, + int min_seq) +{ +} +static inline void lru_gen_report_reaccess(struct lruvec *lruvec, + struct lru_gen_mm_walk *walk) +{ +} +static inline void report_reaccess_refault(struct lruvec *lruvec, + unsigned long token, int type, + int nr_pages) +{ +} static inline void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 75bda5f7994d..2a39a4445bb7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7108,6 +7108,71 @@ static ssize_t memory_ws_page_age_intervals_write(struct kernfs_open_file *of, return err; }
+static int memory_ws_reaccess_intervals_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr; + struct wsr_reaccess_histo *reaccess; + int i, nr_bins; + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + if (!reaccess) + goto unlock; + seq_printf(m, "N%d=", nid); + nr_bins = reaccess->bins.nr_bins; + for (i = 0; i < nr_bins; ++i) { + struct wsr_report_bin *bin = &reaccess->bins.bins[i]; + + seq_printf(m, "%u", jiffies_to_msecs(bin->idle_age)); + if (i + 1 < nr_bins) + seq_putc(m, ','); + } + seq_putc(m, ' '); +unlock: + rcu_read_unlock(); + } + seq_putc(m, '\n'); + + return 0; +} + +static ssize_t memory_ws_reaccess_intervals_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + unsigned int nid; + int err; + struct wsr_state *wsr; + struct wsr_reaccess_histo *reaccess = NULL, *old; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + reaccess = kzalloc(sizeof(struct wsr_reaccess_histo), GFP_KERNEL); + if (!reaccess) + return -ENOMEM; + + err = memory_wsr_interval_parse(of, buf, nbytes, &nid, &reaccess->bins); + if (err < 0) + goto failed; + + if (err == 0) { + kfree(reaccess); + reaccess = NULL; + } + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + old = xchg(&wsr->reaccess, reaccess); + kfree_rcu(old, rcu); + return nbytes; +failed: + kfree(reaccess); + return err; +} + static int memory_ws_refresh_interval_show(struct seq_file *m, void *v) { int nid; @@ -7242,6 +7307,42 @@ static int memory_ws_page_age_show(struct seq_file *m, void *v)
return 0; } + +static int memory_ws_reaccess_histogram_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + struct wsr_reaccess_histo *reaccess; + struct wsr_report_bin *bin; + + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + + if (!reaccess) + goto unlock; + + wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + + seq_printf(m, "N%d\n", nid); + for (bin = reaccess->bins.bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) + seq_printf(m, "%u anon=%lu file=%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0], bin->nr_pages[1]); + + seq_printf(m, "%lu anon=%lu file=%lu\n", WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0], bin->nr_pages[1]); + +unlock: + rcu_read_unlock(); + } + + return 0; +} #endif
static struct cftype memory_files[] = { @@ -7337,6 +7438,17 @@ static struct cftype memory_files[] = { .file_offset = offsetof(struct mem_cgroup, workingset_page_age_file), .seq_show = memory_ws_page_age_show, }, + { + .name = "workingset.reaccess_intervals", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_reaccess_intervals_show, + .write = memory_ws_reaccess_intervals_write, + }, + { + .name = "workingset.reaccess", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_reaccess_histogram_show, + }, #endif {} /* terminate */ }; diff --git a/mm/vmscan.c b/mm/vmscan.c index c6acd5265b3f..4d9245e2c0d1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3637,6 +3637,7 @@ static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_ mem_cgroup_unlock_pages();
if (walk->batched) { + lru_gen_report_reaccess(lruvec, walk); spin_lock_irq(&lruvec->lru_lock); reset_batch_size(lruvec, walk); spin_unlock_irq(&lruvec->lru_lock); @@ -3709,6 +3710,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) } done: reset_ctrl_pos(lruvec, type, true); + report_lru_gen_eviction(lruvec, type, lrugen->min_seq[type] + 1); WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
return true; @@ -3750,6 +3752,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) continue;
reset_ctrl_pos(lruvec, type, true); + report_lru_gen_eviction(lruvec, type, min_seq[type]); WRITE_ONCE(lrugen->min_seq[type], min_seq[type]); success = true; } @@ -4565,11 +4568,14 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap sc->nr_scanned -= folio_nr_pages(folio); }
+ walk = current->reclaim_state->mm_walk; + if (walk && walk->batched) + lru_gen_report_reaccess(lruvec, walk); + spin_lock_irq(&lruvec->lru_lock);
move_folios_to_lru(lruvec, &list);
- walk = current->reclaim_state->mm_walk; if (walk && walk->batched) reset_batch_size(lruvec, walk);
diff --git a/mm/workingset.c b/mm/workingset.c index 226012974328..057fbedd91ea 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -17,6 +17,8 @@ #include <linux/fs.h> #include <linux/mm.h>
+#include "internal.h" + /* * Double CLOCK lists * @@ -179,12 +181,6 @@ * refault distance will immediately activate the refaulting page. */
-#define WORKINGSET_SHIFT 1 -#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ - WORKINGSET_SHIFT + NODES_SHIFT + \ - MEM_CGROUP_ID_SHIFT) -#define EVICTION_MASK (~0UL >> EVICTION_SHIFT) - /* * Eviction timestamps need to be able to cover the full range of * actionable refaults. However, bits are tight in the xarray @@ -294,6 +290,7 @@ static void lru_gen_refault(struct folio *folio, void *shadow) goto unlock;
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta); + report_reaccess_refault(lruvec, token, type, delta);
if (!recent) goto unlock; diff --git a/mm/workingset_report.c b/mm/workingset_report.c index b00ffbfebcab..504d840bbe6a 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -34,6 +34,7 @@ void wsr_destroy(struct lruvec *lruvec)
mutex_destroy(&wsr->page_age_lock); kfree(wsr->page_age); + kfree_rcu(wsr->reaccess, rcu); memset(wsr, 0, sizeof(*wsr)); }
@@ -259,6 +260,254 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, } EXPORT_SYMBOL_GPL(wsr_refresh_report);
+static void lru_gen_collect_reaccess_refault(struct wsr_report_bins *bins, + unsigned long timestamp, int type, + int nr_pages) +{ + unsigned long curr_timestamp = jiffies; + struct wsr_report_bin *bin = &bins->bins[0]; + + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(timestamp + bin->idle_age, curr_timestamp)) + bin++; + + bin->nr_pages[type] += nr_pages; +} + +static void collect_reaccess_type(struct lru_gen_mm_walk *walk, + const struct lru_gen_folio *lrugen, + struct wsr_report_bin *bin, + unsigned long max_seq, unsigned long min_seq, + unsigned long curr_timestamp, int type) +{ + unsigned long seq; + + /* Skip max_seq because a reaccess moves a page from another seq + * to max_seq. We use the negative change in page count from + * other seqs to track the number of reaccesses. + */ + for (seq = max_seq - 1; seq + 1 > min_seq; seq--) { + int younger_gen, gen, zone; + unsigned long gen_end, gen_start; + long delta = 0; + + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + long nr_pages = walk->nr_pages[gen][type][zone]; + + if (nr_pages < 0) + delta += -nr_pages; + } + + gen_end = READ_ONCE(lrugen->timestamps[gen]); + younger_gen = lru_gen_from_seq(seq + 1); + gen_start = READ_ONCE(lrugen->timestamps[younger_gen]); + + /* ensure gen_start is within idle_age of bin */ + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(gen_start + bin->idle_age, curr_timestamp)) + bin++; + + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(gen_end + bin->idle_age, curr_timestamp)) { + unsigned long proportion = (long)gen_start - + (long)curr_timestamp + + (long)bin->idle_age; + unsigned long gen_len = (long)gen_start - (long)gen_end; + + if (!gen_len) + break; + if (proportion) { + unsigned long split_bin = + delta / gen_len * proportion; + bin->nr_pages[type] += split_bin; + delta -= split_bin; + } + gen_start = curr_timestamp - bin->idle_age; + bin++; + } + bin->nr_pages[type] += delta; + } +} + +/* + * Reaccesses are propagated up the memcg hierarchy during scanning/refault. + * Collect the reaccess information from a multi-gen LRU walk. + */ +static void lru_gen_collect_reaccess(struct wsr_report_bins *bins, + struct lru_gen_folio *lrugen, + struct lru_gen_mm_walk *walk) +{ + int type; + unsigned long curr_timestamp = jiffies; + unsigned long max_seq = READ_ONCE(walk->max_seq); + unsigned long min_seq[ANON_AND_FILE] = { + READ_ONCE(lrugen->min_seq[LRU_GEN_ANON]), + READ_ONCE(lrugen->min_seq[LRU_GEN_FILE]), + }; + + for (type = 0; type < ANON_AND_FILE; type++) { + struct wsr_report_bin *bin = &bins->bins[0]; + + collect_reaccess_type(walk, lrugen, bin, max_seq, + min_seq[type], curr_timestamp, type); + } +} + +void lru_gen_report_reaccess(struct lruvec *lruvec, struct lru_gen_mm_walk *walk) +{ + struct lru_gen_folio *lrugen = &lruvec->lrugen; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + + for (memcg = lruvec_memcg(lruvec); memcg; + memcg = parent_mem_cgroup(memcg)) { + struct lruvec *memcg_lruvec = + mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); + struct wsr_state *wsr = &memcg_lruvec->wsr; + struct wsr_reaccess_histo *reaccess; + + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + if (!reaccess) { + rcu_read_unlock(); + continue; + } + lru_gen_collect_reaccess(&reaccess->bins, lrugen, walk); + rcu_read_unlock(); + } +} + +static inline int evicted_gen_from_seq(unsigned long seq) +{ + return seq % MAX_NR_EVICTED_GENS; +} + +void report_lru_gen_eviction(struct lruvec *lruvec, int type, int min_seq) +{ + int seq; + struct wsr_reaccess_histo *reaccess = NULL; + struct lru_gen_folio *lrugen = &lruvec->lrugen; + struct wsr_state *wsr = &lruvec->wsr; + + /* + * Since file can go ahead of anon, min_seq[file] >= min_seq[anon] + * only record evictions when anon moves forward. + */ + if (type != LRU_GEN_ANON) + return; + + /* + * lru_lock is held during eviction, so reaccess accounting + * can be serialized. + */ + lockdep_assert_held(&lruvec->lru_lock); + + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + if (!reaccess) + goto unlock; + + for (seq = READ_ONCE(lrugen->min_seq[LRU_GEN_ANON]); seq < min_seq; + ++seq) { + int evicted_gen = evicted_gen_from_seq(seq); + int gen = lru_gen_from_seq(seq); + + WRITE_ONCE(reaccess->gens[evicted_gen].seq, seq); + WRITE_ONCE(reaccess->gens[evicted_gen].timestamp, + READ_ONCE(lrugen->timestamps[gen])); + } + +unlock: + rcu_read_unlock(); +} + +/* + * May yield an incorrect timestamp if the token collides with + * a recently evicted generation. + */ +static int timestamp_from_workingset_token(struct lruvec *lruvec, + unsigned long token, + unsigned long *timestamp) +{ + int type, err = -EEXIST; + unsigned long seq, evicted_min_seq; + struct wsr_reaccess_histo *reaccess = NULL; + struct lru_gen_folio *lrugen = &lruvec->lrugen; + struct wsr_state *wsr = &lruvec->wsr; + unsigned long min_seq[ANON_AND_FILE] = { + READ_ONCE(lrugen->min_seq[LRU_GEN_ANON]), + READ_ONCE(lrugen->min_seq[LRU_GEN_FILE]) + }; + + token >>= LRU_REFS_WIDTH; + + /* recent eviction */ + for (type = 0; type < ANON_AND_FILE; ++type) { + if (token == + (min_seq[type] & (EVICTION_MASK >> LRU_REFS_WIDTH))) { + int gen = lru_gen_from_seq(min_seq[type]); + + *timestamp = READ_ONCE(lrugen->timestamps[gen]); + return 0; + } + } + + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + if (!reaccess) + goto unlock; + + /* look up in evicted gen buffer */ + evicted_min_seq = min_seq[LRU_GEN_ANON] - MAX_NR_EVICTED_GENS; + if (min_seq[LRU_GEN_ANON] < MAX_NR_EVICTED_GENS) + evicted_min_seq = 0; + for (seq = min_seq[LRU_GEN_ANON]; seq > evicted_min_seq; --seq) { + int gen = evicted_gen_from_seq(seq - 1); + + if (token == (reaccess->gens[gen].seq & + (EVICTION_MASK >> LRU_REFS_WIDTH))) { + *timestamp = reaccess->gens[gen].timestamp; + + goto unlock; + } + } + +unlock: + rcu_read_unlock(); + return err; +} + +void report_reaccess_refault(struct lruvec *lruvec, unsigned long token, + int type, int nr_pages) +{ + unsigned long timestamp; + int err; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + + err = timestamp_from_workingset_token(lruvec, token, ×tamp); + if (err) + return; + + for (memcg = lruvec_memcg(lruvec); memcg; + memcg = parent_mem_cgroup(memcg)) { + struct lruvec *memcg_lruvec = + mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); + struct wsr_state *wsr = &memcg_lruvec->wsr; + struct wsr_reaccess_histo *reaccess = NULL; + + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + if (!reaccess) { + rcu_read_unlock(); + continue; + } + lru_gen_collect_reaccess_refault(&reaccess->bins, timestamp, + type, nr_pages); + rcu_read_unlock(); + } +} + static struct pglist_data *kobj_to_pgdat(struct kobject *kobj) { int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id :
For reliable and timely aging on memcgs, one has to read the page age histograms on time. A kernel thread makes it easier by aging memcgs with valid page_age_intervals and refresh_interval when they can be refreshed, and also reduces the latency of any userspace consumers of the page age histogram.
The kernel aging thread is gated behind CONFIG_WORKINGSET_REPORT_AGING. Debugging stats may be added in the future for when aging cannot keep up with the configured refresh_interval.
Signed-off-by: Yuanchu Xie yuanchu@google.com --- include/linux/workingset_report.h | 11 ++- mm/Kconfig | 6 ++ mm/Makefile | 1 + mm/memcontrol.c | 11 ++- mm/workingset_report.c | 14 +++- mm/workingset_report_aging.c | 127 ++++++++++++++++++++++++++++++ 6 files changed, 163 insertions(+), 7 deletions(-) create mode 100644 mm/workingset_report_aging.c
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index e908c5678b1e..759486a3a285 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -77,9 +77,18 @@ void wsr_destroy(struct lruvec *lruvec); * The next refresh time is stored in refresh_time. */ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat); + struct pglist_data *pgdat, unsigned long *refresh_time); void wsr_register_node(struct node *node); void wsr_unregister_node(struct node *node); + +#ifdef CONFIG_WORKINGSET_REPORT_AGING +void wsr_wakeup_aging_thread(void); +#else /* CONFIG_WORKINGSET_REPORT_AGING */ +static inline void wsr_wakeup_aging_thread(void) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT_AGING */ + #else static inline void wsr_init(struct lruvec *lruvec) { diff --git a/mm/Kconfig b/mm/Kconfig index 212f203b10b9..1e6aa1bd63f2 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1270,6 +1270,12 @@ config WORKINGSET_REPORT This option exports stats and events giving the user more insight into its memory working set.
+config WORKINGSET_REPORT_AGING + bool "Workingset report kernel aging thread" + depends on WORKINGSET_REPORT + help + Performs aging on memcgs with their configured refresh intervals. + source "mm/damon/Kconfig"
endmenu diff --git a/mm/Makefile b/mm/Makefile index 57093657030d..7caae7f2d6cf 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -93,6 +93,7 @@ obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o +obj-$(CONFIG_WORKINGSET_REPORT_AGING) += workingset_report_aging.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2a39a4445bb7..86e15b9fc8e2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7102,6 +7102,8 @@ static ssize_t memory_ws_page_age_intervals_write(struct kernfs_open_file *of, old = xchg(&wsr->page_age, page_age); mutex_unlock(&wsr->page_age_lock); kfree(old); + if (err && READ_ONCE(wsr->refresh_interval)) + wsr_wakeup_aging_thread(); return nbytes; failed: kfree(page_age); @@ -7227,14 +7229,17 @@ static ssize_t memory_ws_refresh_interval_write(struct kernfs_open_file *of, { unsigned int nid, msecs; struct wsr_state *wsr; + unsigned long old_interval; struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs);
if (ret < 0) return ret; - wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + old_interval = jiffies_to_msecs(READ_ONCE(wsr->refresh_interval)); WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(msecs)); + if (msecs && (!old_interval || jiffies_to_msecs(old_interval) > msecs)) + wsr_wakeup_aging_thread(); return ret; }
@@ -7285,7 +7290,7 @@ static int memory_ws_page_age_show(struct seq_file *m, void *v) if (!READ_ONCE(wsr->page_age)) continue;
- wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + wsr_refresh_report(wsr, memcg, NODE_DATA(nid), NULL); mutex_lock(&wsr->page_age_lock); if (!wsr->page_age) goto unlock; @@ -7325,7 +7330,7 @@ static int memory_ws_reaccess_histogram_show(struct seq_file *m, void *v) if (!reaccess) goto unlock;
- wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + wsr_refresh_report(wsr, memcg, NODE_DATA(nid), NULL);
seq_printf(m, "N%d\n", nid); for (bin = reaccess->bins.bins; diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 504d840bbe6a..da658967eac2 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -234,7 +234,7 @@ static void refresh_aggregate(struct wsr_page_age_histo *page_age, }
bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat) + struct pglist_data *pgdat, unsigned long *refresh_time) { struct wsr_page_age_histo *page_age = NULL; unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval); @@ -253,6 +253,8 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, goto unlock; refresh_scan(wsr, root, pgdat, refresh_interval); refresh_aggregate(page_age, root, pgdat); + if (refresh_time) + *refresh_time = page_age->timestamp + refresh_interval;
unlock: mutex_unlock(&wsr->page_age_lock); @@ -564,12 +566,16 @@ static ssize_t refresh_interval_store(struct kobject *kobj, unsigned int interval; int err; struct wsr_state *wsr = kobj_to_wsr(kobj); + unsigned long old_interval;
err = kstrtouint(buf, 0, &interval); if (err) return err;
- WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(interval)); + old_interval = xchg(&wsr->refresh_interval, msecs_to_jiffies(interval)); + if (interval && + (!old_interval || jiffies_to_msecs(old_interval) > interval)) + wsr_wakeup_aging_thread();
return len; } @@ -642,6 +648,8 @@ static ssize_t page_age_intervals_store(struct kobject *kobj, mutex_unlock(&wsr->page_age_lock); kfree(old); kfree(buf); + if (err && READ_ONCE(wsr->refresh_interval)) + wsr_wakeup_aging_thread(); return len; failed: kfree(page_age); @@ -663,7 +671,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, if (!READ_ONCE(wsr->page_age)) return -EINVAL;
- wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj)); + wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj), NULL);
mutex_lock(&wsr->page_age_lock); if (!wsr->page_age) { diff --git a/mm/workingset_report_aging.c b/mm/workingset_report_aging.c new file mode 100644 index 000000000000..91ad5020778a --- /dev/null +++ b/mm/workingset_report_aging.c @@ -0,0 +1,127 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Workingset report kernel aging thread + * + * Performs aging on behalf of memcgs with their configured refresh interval. + * While a userspace program can periodically read the page age breakdown + * per-memcg and trigger aging, the kernel performing aging is less overhead, + * more consistent, and more reliable for the use case where every memcg should + * be aged according to their refresh interval. + */ +#define pr_fmt(fmt) "workingset report aging: " fmt + +#include <linux/jiffies.h> +#include <linux/module.h> +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/kthread.h> +#include <linux/memcontrol.h> +#include <linux/swap.h> +#include <linux/wait.h> +#include <linux/mmzone.h> +#include <linux/workingset_report.h> + +static DECLARE_WAIT_QUEUE_HEAD(aging_wait); +static bool refresh_pending; + +static bool do_aging_node(int nid, unsigned long *next_wake_time) +{ + struct mem_cgroup *memcg; + bool should_wait = true; + struct pglist_data *pgdat = NODE_DATA(nid); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + struct wsr_state *wsr = &lruvec->wsr; + unsigned long refresh_time; + + /* use returned time to decide when to wake up next */ + if (wsr_refresh_report(wsr, memcg, pgdat, &refresh_time)) { + if (should_wait) { + should_wait = false; + *next_wake_time = refresh_time; + } else if (time_before(refresh_time, *next_wake_time)) { + *next_wake_time = refresh_time; + } + } + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + return should_wait; +} + +static int do_aging(void *unused) +{ + while (!kthread_should_stop()) { + int nid; + long timeout_ticks; + unsigned long next_wake_time; + bool should_wait = true; + + WRITE_ONCE(refresh_pending, false); + for_each_node_state(nid, N_MEMORY) { + unsigned long node_next_wake_time; + + if (do_aging_node(nid, &node_next_wake_time)) + continue; + if (should_wait) { + should_wait = false; + next_wake_time = node_next_wake_time; + } else if (time_before(node_next_wake_time, + next_wake_time)) { + next_wake_time = node_next_wake_time; + } + } + + if (should_wait) { + wait_event_interruptible(aging_wait, refresh_pending); + continue; + } + + /* sleep until next aging */ + timeout_ticks = next_wake_time - jiffies; + if (timeout_ticks > 0 && + timeout_ticks != MAX_SCHEDULE_TIMEOUT) { + schedule_timeout_idle(timeout_ticks); + continue; + } + } + return 0; +} + +/* Invoked when refresh_interval shortens or changes to a non-zero value. */ +void wsr_wakeup_aging_thread(void) +{ + WRITE_ONCE(refresh_pending, true); + wake_up_interruptible(&aging_wait); +} + +static struct task_struct *aging_thread; + +static int aging_init(void) +{ + struct task_struct *task; + + task = kthread_run(do_aging, NULL, "kagingd"); + + if (IS_ERR(task)) { + pr_err("Failed to create aging kthread\n"); + return PTR_ERR(task); + } + + aging_thread = task; + pr_info("module loaded\n"); + return 0; +} + +static void aging_exit(void) +{ + kthread_stop(aging_thread); + aging_thread = NULL; + pr_info("module unloaded\n"); +} + +module_init(aging_init); +module_exit(aging_exit);
A basic test that verifies the working set size of a simple memory accessor. It should work with or without the aging thread.
Question: I don't know how to best test file memory in selftests. Is there a place where I should put the temporary file? /tmp can be tmpfs mounted in many distros.
Signed-off-by: Yuanchu Xie yuanchu@google.com --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + .../testing/selftests/mm/workingset_report.c | 315 +++++++++++++++++ .../testing/selftests/mm/workingset_report.h | 37 ++ .../selftests/mm/workingset_report_test.c | 328 ++++++++++++++++++ 5 files changed, 684 insertions(+) create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 4ff10ea61461..14a2412c8257 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -46,3 +46,4 @@ gup_longterm mkdirty va_high_addr_switch hugetlb_fault_after_madv +workingset_report_test diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 2453add65d12..c0869bf07e99 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -70,6 +70,7 @@ TEST_GEN_FILES += ksm_tests TEST_GEN_FILES += ksm_functional_tests TEST_GEN_FILES += mdwe_test TEST_GEN_FILES += hugetlb_fault_after_madv +TEST_GEN_FILES += workingset_report_test
ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty @@ -123,6 +124,8 @@ $(TEST_GEN_FILES): vm_util.c thp_settings.c $(OUTPUT)/uffd-stress: uffd-common.c $(OUTPUT)/uffd-unit-tests: uffd-common.c
+$(OUTPUT)/workingset_report_test: workingset_report.c + ifeq ($(ARCH),x86_64) BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32)) BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64)) diff --git a/tools/testing/selftests/mm/workingset_report.c b/tools/testing/selftests/mm/workingset_report.c new file mode 100644 index 000000000000..93387f0f30ee --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report.c @@ -0,0 +1,315 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "workingset_report.h" + +#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <stdbool.h> +#include <unistd.h> +#include <string.h> +#include <sys/mman.h> +#include <sys/wait.h> + +#define SYSFS_NODE_ONLINE "/sys/devices/system/node/online" +#define PROC_DROP_CACHES "/proc/sys/vm/drop_caches" + +/* Returns read len on success, or -errno on failure. */ +static ssize_t read_text(const char *path, char *buf, size_t max_len) +{ + ssize_t len; + int fd, err; + size_t bytes_read = 0; + + if (!max_len) + return -EINVAL; + + fd = open(path, O_RDONLY); + if (fd < 0) + return -errno; + + while (bytes_read < max_len - 1) { + len = read(fd, buf + bytes_read, max_len - 1 - bytes_read); + + if (len <= 0) + break; + bytes_read += len; + } + + buf[bytes_read] = '\0'; + + err = -errno; + close(fd); + return len < 0 ? err : bytes_read; +} + +/* Returns written len on success, or -errno on failure. */ +static ssize_t write_text(const char *path, const char *buf, ssize_t max_len) +{ + int fd, len, err; + size_t bytes_written = 0; + + fd = open(path, O_WRONLY | O_APPEND); + if (fd < 0) + return -errno; + + while (bytes_written < max_len) { + len = write(fd, buf + bytes_written, max_len - bytes_written); + + if (len < 0) + break; + bytes_written += len; + } + + err = -errno; + close(fd); + return len < 0 ? err : bytes_written; +} + +static long read_num(const char *path) +{ + char buf[21]; + + if (read_text(path, buf, sizeof(buf)) <= 0) + return -1; + return (long)strtoul(buf, NULL, 10); +} + +static int write_num(const char *path, unsigned long n) +{ + char buf[21]; + + sprintf(buf, "%lu", n); + if (write_text(path, buf, strlen(buf)) < 0) + return -1; + return 0; +} + +long sysfs_get_refresh_interval(int nid) +{ + char file[128]; + + snprintf( + file, + sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/refresh_interval", + nid); + return read_num(file); +} + +int sysfs_set_refresh_interval(int nid, long interval) +{ + char file[128]; + + snprintf( + file, + sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/refresh_interval", + nid); + return write_num(file, interval); +} + +int sysfs_get_page_age_intervals_str(int nid, char *buf, int len) +{ + char path[128]; + + snprintf( + path, + sizeof(path), + "/sys/devices/system/node/node%d/workingset_report/page_age_intervals", + nid); + return read_text(path, buf, len); + +} + +int sysfs_set_page_age_intervals_str(int nid, const char *buf, int len) +{ + char path[128]; + + snprintf( + path, + sizeof(path), + "/sys/devices/system/node/node%d/workingset_report/page_age_intervals", + nid); + return write_text(path, buf, len); +} + +int sysfs_set_page_age_intervals(int nid, const char *intervals[], + int nr_intervals) +{ + char file[128]; + char buf[1024]; + int i; + int err, len = 0; + + for (i = 0; i < nr_intervals; ++i) { + err = snprintf(buf + len, sizeof(buf) - len, "%s", intervals[i]); + + if (err < 0) + return err; + len += err; + + if (i < nr_intervals - 1) { + err = snprintf(buf + len, sizeof(buf) - len, ","); + if (err < 0) + return err; + len += err; + } + } + + snprintf( + file, + sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/page_age_intervals", + nid); + return write_text(file, buf, len); +} + +int get_nr_nodes(void) +{ + char buf[22]; + char *found; + + if (read_text(SYSFS_NODE_ONLINE, buf, sizeof(buf)) <= 0) + return -1; + found = strstr(buf, "-"); + if (found) + return (int)strtoul(found + 1, NULL, 10) + 1; + return (long)strtoul(buf, NULL, 10) + 1; +} + +int drop_pagecache(void) +{ + return write_num(PROC_DROP_CACHES, 1); +} + +ssize_t sysfs_page_age_read(int nid, char *buf, size_t len) + +{ + char file[128]; + + snprintf(file, + sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/page_age", + nid); + return read_text(file, buf, len); +} + +/* + * Finds the first occurrence of "N<nid>\n" + * Modifies buf to terminate before the next occurrence of "N". + * Returns a substring of buf starting after "N<nid>\n" + */ +char *page_age_split_node(char *buf, int nid, char **next) +{ + char node_str[5]; + char *found; + int node_str_len; + + node_str_len = snprintf(node_str, sizeof(node_str), "N%u\n", nid); + + /* find the node prefix first */ + found = strstr(buf, node_str); + if (!found) { + fprintf(stderr, "cannot find '%s' in page_idle_age", node_str); + return NULL; + } + found += node_str_len; + + *next = strchr(found, 'N'); + if (*next) + *(*next - 1) = '\0'; + + return found; +} + +ssize_t page_age_read(const char *buf, const char *interval, int pagetype) +{ + static const char * const type[ANON_AND_FILE] = { "anon=", "file=" }; + char *found; + + found = strstr(buf, interval); + if (!found) { + fprintf(stderr, "cannot find %s in page_age", interval); + return -1; + } + found = strstr(found, type[pagetype]); + if (!found) { + fprintf(stderr, "cannot find %s in page_age", type[pagetype]); + return -1; + } + found += strlen(type[pagetype]); + return (long)strtoul(found, NULL, 10); +} + +static const char *TEMP_FILE = "/tmp/workingset_selftest"; +void cleanup_file_workingset(void) +{ + remove(TEMP_FILE); +} + +int alloc_file_workingset(void *arg) +{ + int err = 0; + char *ptr; + int fd; + int ppid; + char *mapped; + size_t size = (size_t)arg; + size_t page_size = getpagesize(); + + ppid = getppid(); + + fd = open(TEMP_FILE, O_RDWR | O_CREAT); + if (fd < 0) { + err = -errno; + perror("failed to open temp file\n"); + goto cleanup; + } + + if (fallocate(fd, 0, 0, size) < 0) { + err = -errno; + perror("fallocate"); + goto cleanup; + } + + mapped = (char *)mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + fd, 0); + if (mapped == NULL) { + err = -errno; + perror("mmap"); + goto cleanup; + } + + while (getppid() == ppid) { + sync(); + for (ptr = mapped; ptr < mapped + size; ptr += page_size) + *ptr = *ptr ^ 0xFF; + } + +cleanup: + cleanup_file_workingset(); + return err; +} + +int alloc_anon_workingset(void *arg) +{ + char *buf, *ptr; + int ppid = getppid(); + size_t size = (size_t)arg; + size_t page_size = getpagesize(); + + buf = malloc(size); + + if (!buf) { + fprintf(stderr, "cannot allocate anon workingset"); + exit(1); + } + + while (getppid() == ppid) { + for (ptr = buf; ptr < buf + size; ptr += page_size) + *ptr = *ptr ^ 0xFF; + } + + free(buf); + return 0; +} diff --git a/tools/testing/selftests/mm/workingset_report.h b/tools/testing/selftests/mm/workingset_report.h new file mode 100644 index 000000000000..f72a931298e0 --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report.h @@ -0,0 +1,37 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef WORKINGSET_REPORT_H_ +#define WORKINGSET_REPORT_H_ + +#define _GNU_SOURCE + +#include <fcntl.h> +#include <sys/stat.h> +#include <errno.h> +#include <stdint.h> +#include <sys/types.h> + +#define PAGETYPE_ANON 0 +#define PAGETYPE_FILE 1 +#define ANON_AND_FILE 2 + +int get_nr_nodes(void); +int drop_pagecache(void); + +long sysfs_get_refresh_interval(int nid); +int sysfs_set_refresh_interval(int nid, long interval); + +int sysfs_get_page_age_intervals_str(int nid, char *buf, int len); +int sysfs_set_page_age_intervals_str(int nid, const char *buf, int len); + +int sysfs_set_page_age_intervals(int nid, const char *intervals[], + int nr_intervals); + +char *page_age_split_node(char *buf, int nid, char **next); +ssize_t sysfs_page_age_read(int nid, char *buf, size_t len); +ssize_t page_age_read(const char *buf, const char *interval, int pagetype); + +int alloc_file_workingset(void *arg); +void cleanup_file_workingset(void); +int alloc_anon_workingset(void *arg); + +#endif /* WORKINGSET_REPORT_H_ */ diff --git a/tools/testing/selftests/mm/workingset_report_test.c b/tools/testing/selftests/mm/workingset_report_test.c new file mode 100644 index 000000000000..e6e857d8fe35 --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report_test.c @@ -0,0 +1,328 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "workingset_report.h" + +#include <stdlib.h> +#include <stdio.h> +#include <signal.h> +#include <time.h> + +#include "../clone3/clone3_selftests.h" + +#define REFRESH_INTERVAL 5000 +#define MB(x) (x << 20) + +static void sleep_ms(int milliseconds) +{ + struct timespec ts; + + ts.tv_sec = milliseconds / 1000; + ts.tv_nsec = (milliseconds % 1000) * 1000000; + nanosleep(&ts, NULL); +} + +/* + * Checks if two given values differ by less than err% of their sum. + */ +static inline int values_close(long a, long b, int err) +{ + return abs(a - b) <= (a + b) / 100 * err; +} + +static const char * const PAGE_AGE_INTERVALS[] = { + "6000", "10000", "15000", "18446744073709551615", +}; +#define NR_PAGE_AGE_INTERVALS (ARRAY_SIZE(PAGE_AGE_INTERVALS)) +/* add one for the catch all last interval */ + +static int set_page_age_intervals_all_nodes(const char *intervals, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + int err = sysfs_set_page_age_intervals_str( + i, &intervals[i * 1024], strlen(&intervals[i * 1024])); + + if (err < 0) + return err; + } + return 0; +} + +static int get_page_age_intervals_all_nodes(char *intervals, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + int err = sysfs_get_page_age_intervals_str( + i, &intervals[i * 1024], 1024); + + if (err < 0) + return err; + } + return 0; +} + +static int set_refresh_interval_all_nodes(const long *interval, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + int err = sysfs_set_refresh_interval(i, interval[i]); + + if (err < 0) + return err; + } + return 0; +} + +static int get_refresh_interval_all_nodes(long *interval, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + long val = sysfs_get_refresh_interval(i); + + if (val < 0) + return val; + interval[i] = val; + } + return 0; +} + +static pid_t clone_and_run(int fn(void *arg), void *arg) +{ + pid_t pid; + + struct __clone_args args = { + .exit_signal = SIGCHLD, + }; + + pid = sys_clone3(&args, sizeof(struct __clone_args)); + + if (pid == 0) + exit(fn(arg)); + + return pid; +} + +static int read_workingset(int pagetype, int nid, + unsigned long page_age[NR_PAGE_AGE_INTERVALS]) +{ + int i, err; + char buf[4096]; + + err = sysfs_page_age_read(nid, buf, sizeof(buf)); + if (err < 0) + return err; + + for (i = 0; i < NR_PAGE_AGE_INTERVALS; ++i) { + err = page_age_read(buf, PAGE_AGE_INTERVALS[i], pagetype); + if (err < 0) + return err; + page_age[i] = err; + } + + return 0; +} + +static ssize_t read_interval_all_nodes(int pagetype, int interval) +{ + int i, err; + unsigned long page_age[NR_PAGE_AGE_INTERVALS]; + ssize_t ret = 0; + int nr_nodes = get_nr_nodes(); + + for (i = 0; i < nr_nodes; ++i) { + err = read_workingset(pagetype, i, page_age); + if (err < 0) + return err; + + ret += page_age[interval]; + } + + return ret; +} + +#define TEST_SIZE MB(500l) + +static int run_test(int f(void)) +{ + int i, err, test_result; + long *old_refresh_intervals; + long *new_refresh_intervals; + char *old_page_age_intervals; + int nr_nodes = get_nr_nodes(); + + if (nr_nodes <= 0) { + fprintf(stderr, "failed to get nr_nodes\n"); + return KSFT_FAIL; + } + + old_refresh_intervals = calloc(nr_nodes, sizeof(long)); + new_refresh_intervals = calloc(nr_nodes, sizeof(long)); + old_page_age_intervals = calloc(nr_nodes, 1024); + + if (!(old_refresh_intervals && new_refresh_intervals && + old_page_age_intervals)) { + fprintf(stderr, "failed to allocate memory for intervals\n"); + return KSFT_FAIL; + } + + err = get_refresh_interval_all_nodes(old_refresh_intervals, nr_nodes); + if (err < 0) { + fprintf(stderr, "failed to read refresh interval\n"); + return KSFT_FAIL; + } + + err = get_page_age_intervals_all_nodes(old_page_age_intervals, nr_nodes); + if (err < 0) { + fprintf(stderr, "failed to read page age interval\n"); + return KSFT_FAIL; + } + + for (i = 0; i < nr_nodes; ++i) + new_refresh_intervals[i] = REFRESH_INTERVAL; + err = set_refresh_interval_all_nodes(new_refresh_intervals, nr_nodes); + if (err < 0) { + fprintf(stderr, "failed to set refresh interval\n"); + test_result = KSFT_FAIL; + goto fail; + } + + for (i = 0; i < nr_nodes; ++i) { + err = sysfs_set_page_age_intervals(i, PAGE_AGE_INTERVALS, + NR_PAGE_AGE_INTERVALS - 1); + if (err < 0) { + fprintf(stderr, "failed to set page age interval\n"); + test_result = KSFT_FAIL; + goto fail; + } + } + + sync(); + drop_pagecache(); + + test_result = f(); + +fail: + err = set_refresh_interval_all_nodes(old_refresh_intervals, nr_nodes); + if (err < 0) { + fprintf(stderr, "failed to restore refresh interval\n"); + test_result = KSFT_FAIL; + } + err = set_page_age_intervals_all_nodes(old_page_age_intervals, nr_nodes); + if (err < 0) { + fprintf(stderr, "failed to restore page age interval\n"); + test_result = KSFT_FAIL; + } + return test_result; +} + +static int test_file(void) +{ + ssize_t ws_size_ref, ws_size_test; + int ret = KSFT_FAIL, i; + pid_t pid = 0; + + ws_size_ref = read_interval_all_nodes(PAGETYPE_FILE, 0); + if (ws_size_ref < 0) + goto cleanup; + + pid = clone_and_run(alloc_file_workingset, (void *)TEST_SIZE); + if (pid < 0) + goto cleanup; + + read_interval_all_nodes(PAGETYPE_FILE, 0); + sleep_ms(REFRESH_INTERVAL); + + for (i = 0; i < 3; ++i) { + sleep_ms(REFRESH_INTERVAL); + ws_size_test = read_interval_all_nodes(PAGETYPE_FILE, 0); + + if (!values_close(ws_size_test - ws_size_ref, TEST_SIZE, 10)) { + fprintf(stderr, + "file working set size difference too large: actual=%ld, expected=%ld\n", + ws_size_test - ws_size_ref, TEST_SIZE); + goto cleanup; + } + } + ret = KSFT_PASS; + +cleanup: + if (pid > 0) + kill(pid, SIGKILL); + cleanup_file_workingset(); + return ret; +} + +static int test_anon(void) +{ + ssize_t ws_size_ref, ws_size_test; + pid_t pid = 0; + int ret = KSFT_FAIL, i; + + ws_size_ref = read_interval_all_nodes(PAGETYPE_ANON, 0); + if (ws_size_ref < 0) + goto cleanup; + + pid = clone_and_run(alloc_anon_workingset, (void *)TEST_SIZE); + if (pid < 0) + goto cleanup; + + sleep_ms(REFRESH_INTERVAL); + read_interval_all_nodes(PAGETYPE_ANON, 0); + + for (i = 0; i < 5; ++i) { + sleep_ms(REFRESH_INTERVAL); + ws_size_test = read_interval_all_nodes(PAGETYPE_ANON, 0); + if (ws_size_test < 0) + goto cleanup; + + if (!values_close(ws_size_test - ws_size_ref, TEST_SIZE, 10)) { + fprintf(stderr, + "anon working set size difference too large: actual=%ld, expected=%ld\n", + ws_size_test - ws_size_ref, TEST_SIZE); + /* goto cleanup; */ + } + } + ret = KSFT_PASS; + +cleanup: + if (pid > 0) + kill(pid, SIGKILL); + return ret; +} + + +#define T(x) { x, #x } +struct workingset_test { + int (*fn)(void); + const char *name; +} tests[] = { + T(test_anon), + T(test_file), +}; +#undef T + +int main(int argc, char **argv) +{ + int ret = EXIT_SUCCESS, i, err; + + for (i = 0; i < ARRAY_SIZE(tests); i++) { + err = run_test(tests[i].fn); + switch (err) { + case KSFT_PASS: + ksft_test_result_pass("%s\n", tests[i].name); + break; + case KSFT_SKIP: + ksft_test_result_skip("%s\n", tests[i].name); + break; + default: + ret = EXIT_FAILURE; + ksft_test_result_fail("%s with error %d\n", + tests[i].name, err); + break; + } + } + return ret; +}
Please add selftest tag in the subject in selftest patches.
On 3/28/24 2:31 AM, Yuanchu Xie wrote:
A basic test that verifies the working set size of a simple memory accessor. It should work with or without the aging thread.
Question: I don't know how to best test file memory in selftests. Is there a place where I should put the temporary file? /tmp can be tmpfs mounted in many distros.
Signed-off-by: Yuanchu Xie yuanchu@google.com
Thanks for writing most of the test in TAP compliant format. Only replace printing directly to strerr to ksft_exit_fail_msg() instead.
tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + .../testing/selftests/mm/workingset_report.c | 315 +++++++++++++++++ .../testing/selftests/mm/workingset_report.h | 37 ++ .../selftests/mm/workingset_report_test.c | 328 ++++++++++++++++++ 5 files changed, 684 insertions(+) create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 4ff10ea61461..14a2412c8257 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -46,3 +46,4 @@ gup_longterm mkdirty va_high_addr_switch hugetlb_fault_after_madv +workingset_report_test diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 2453add65d12..c0869bf07e99 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -70,6 +70,7 @@ TEST_GEN_FILES += ksm_tests TEST_GEN_FILES += ksm_functional_tests TEST_GEN_FILES += mdwe_test TEST_GEN_FILES += hugetlb_fault_after_madv +TEST_GEN_FILES += workingset_report_test ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty @@ -123,6 +124,8 @@ $(TEST_GEN_FILES): vm_util.c thp_settings.c $(OUTPUT)/uffd-stress: uffd-common.c $(OUTPUT)/uffd-unit-tests: uffd-common.c +$(OUTPUT)/workingset_report_test: workingset_report.c
ifeq ($(ARCH),x86_64) BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32)) BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64)) diff --git a/tools/testing/selftests/mm/workingset_report.c b/tools/testing/selftests/mm/workingset_report.c new file mode 100644 index 000000000000..93387f0f30ee --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report.c @@ -0,0 +1,315 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "workingset_report.h"
+#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <stdbool.h> +#include <unistd.h> +#include <string.h> +#include <sys/mman.h> +#include <sys/wait.h>
+#define SYSFS_NODE_ONLINE "/sys/devices/system/node/online" +#define PROC_DROP_CACHES "/proc/sys/vm/drop_caches"
+/* Returns read len on success, or -errno on failure. */ +static ssize_t read_text(const char *path, char *buf, size_t max_len) +{
- ssize_t len;
- int fd, err;
- size_t bytes_read = 0;
- if (!max_len)
return -EINVAL;
- fd = open(path, O_RDONLY);
- if (fd < 0)
return -errno;
- while (bytes_read < max_len - 1) {
len = read(fd, buf + bytes_read, max_len - 1 - bytes_read);
if (len <= 0)
break;
bytes_read += len;
- }
- buf[bytes_read] = '\0';
- err = -errno;
- close(fd);
- return len < 0 ? err : bytes_read;
+}
+/* Returns written len on success, or -errno on failure. */ +static ssize_t write_text(const char *path, const char *buf, ssize_t max_len) +{
- int fd, len, err;
- size_t bytes_written = 0;
- fd = open(path, O_WRONLY | O_APPEND);
- if (fd < 0)
return -errno;
- while (bytes_written < max_len) {
len = write(fd, buf + bytes_written, max_len - bytes_written);
if (len < 0)
break;
bytes_written += len;
- }
- err = -errno;
- close(fd);
- return len < 0 ? err : bytes_written;
+}
+static long read_num(const char *path) +{
- char buf[21];
- if (read_text(path, buf, sizeof(buf)) <= 0)
return -1;
- return (long)strtoul(buf, NULL, 10);
+}
+static int write_num(const char *path, unsigned long n) +{
- char buf[21];
- sprintf(buf, "%lu", n);
- if (write_text(path, buf, strlen(buf)) < 0)
return -1;
- return 0;
+}
+long sysfs_get_refresh_interval(int nid) +{
- char file[128];
- snprintf(
file,
sizeof(file),
"/sys/devices/system/node/node%d/workingset_report/refresh_interval",
nid);
- return read_num(file);
+}
+int sysfs_set_refresh_interval(int nid, long interval) +{
- char file[128];
- snprintf(
file,
sizeof(file),
"/sys/devices/system/node/node%d/workingset_report/refresh_interval",
nid);
- return write_num(file, interval);
+}
+int sysfs_get_page_age_intervals_str(int nid, char *buf, int len) +{
- char path[128];
- snprintf(
path,
sizeof(path),
"/sys/devices/system/node/node%d/workingset_report/page_age_intervals",
nid);
- return read_text(path, buf, len);
+}
+int sysfs_set_page_age_intervals_str(int nid, const char *buf, int len) +{
- char path[128];
- snprintf(
path,
sizeof(path),
"/sys/devices/system/node/node%d/workingset_report/page_age_intervals",
nid);
- return write_text(path, buf, len);
+}
+int sysfs_set_page_age_intervals(int nid, const char *intervals[],
int nr_intervals)
+{
- char file[128];
- char buf[1024];
- int i;
- int err, len = 0;
- for (i = 0; i < nr_intervals; ++i) {
err = snprintf(buf + len, sizeof(buf) - len, "%s", intervals[i]);
if (err < 0)
return err;
len += err;
if (i < nr_intervals - 1) {
err = snprintf(buf + len, sizeof(buf) - len, ",");
if (err < 0)
return err;
len += err;
}
- }
- snprintf(
file,
sizeof(file),
"/sys/devices/system/node/node%d/workingset_report/page_age_intervals",
nid);
- return write_text(file, buf, len);
+}
+int get_nr_nodes(void) +{
- char buf[22];
- char *found;
- if (read_text(SYSFS_NODE_ONLINE, buf, sizeof(buf)) <= 0)
return -1;
- found = strstr(buf, "-");
- if (found)
return (int)strtoul(found + 1, NULL, 10) + 1;
- return (long)strtoul(buf, NULL, 10) + 1;
+}
+int drop_pagecache(void) +{
- return write_num(PROC_DROP_CACHES, 1);
+}
+ssize_t sysfs_page_age_read(int nid, char *buf, size_t len)
+{
- char file[128];
- snprintf(file,
sizeof(file),
"/sys/devices/system/node/node%d/workingset_report/page_age",
nid);
- return read_text(file, buf, len);
+}
+/*
- Finds the first occurrence of "N<nid>\n"
- Modifies buf to terminate before the next occurrence of "N".
- Returns a substring of buf starting after "N<nid>\n"
- */
+char *page_age_split_node(char *buf, int nid, char **next) +{
- char node_str[5];
- char *found;
- int node_str_len;
- node_str_len = snprintf(node_str, sizeof(node_str), "N%u\n", nid);
- /* find the node prefix first */
- found = strstr(buf, node_str);
- if (!found) {
fprintf(stderr, "cannot find '%s' in page_idle_age", node_str);
return NULL;
- }
- found += node_str_len;
- *next = strchr(found, 'N');
- if (*next)
*(*next - 1) = '\0';
- return found;
+}
+ssize_t page_age_read(const char *buf, const char *interval, int pagetype) +{
- static const char * const type[ANON_AND_FILE] = { "anon=", "file=" };
- char *found;
- found = strstr(buf, interval);
- if (!found) {
fprintf(stderr, "cannot find %s in page_age", interval);
return -1;
- }
- found = strstr(found, type[pagetype]);
- if (!found) {
fprintf(stderr, "cannot find %s in page_age", type[pagetype]);
return -1;
- }
- found += strlen(type[pagetype]);
- return (long)strtoul(found, NULL, 10);
+}
+static const char *TEMP_FILE = "/tmp/workingset_selftest"; +void cleanup_file_workingset(void) +{
- remove(TEMP_FILE);
+}
+int alloc_file_workingset(void *arg) +{
- int err = 0;
- char *ptr;
- int fd;
- int ppid;
- char *mapped;
- size_t size = (size_t)arg;
- size_t page_size = getpagesize();
- ppid = getppid();
- fd = open(TEMP_FILE, O_RDWR | O_CREAT);
- if (fd < 0) {
err = -errno;
perror("failed to open temp file\n");
goto cleanup;
- }
- if (fallocate(fd, 0, 0, size) < 0) {
err = -errno;
perror("fallocate");
goto cleanup;
- }
- mapped = (char *)mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
fd, 0);
- if (mapped == NULL) {
err = -errno;
perror("mmap");
goto cleanup;
- }
- while (getppid() == ppid) {
sync();
for (ptr = mapped; ptr < mapped + size; ptr += page_size)
*ptr = *ptr ^ 0xFF;
- }
+cleanup:
- cleanup_file_workingset();
- return err;
+}
+int alloc_anon_workingset(void *arg) +{
- char *buf, *ptr;
- int ppid = getppid();
- size_t size = (size_t)arg;
- size_t page_size = getpagesize();
- buf = malloc(size);
- if (!buf) {
fprintf(stderr, "cannot allocate anon workingset");
exit(1);
- }
- while (getppid() == ppid) {
for (ptr = buf; ptr < buf + size; ptr += page_size)
*ptr = *ptr ^ 0xFF;
- }
- free(buf);
- return 0;
+} diff --git a/tools/testing/selftests/mm/workingset_report.h b/tools/testing/selftests/mm/workingset_report.h new file mode 100644 index 000000000000..f72a931298e0 --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report.h @@ -0,0 +1,37 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef WORKINGSET_REPORT_H_ +#define WORKINGSET_REPORT_H_
+#define _GNU_SOURCE
+#include <fcntl.h> +#include <sys/stat.h> +#include <errno.h> +#include <stdint.h> +#include <sys/types.h>
+#define PAGETYPE_ANON 0 +#define PAGETYPE_FILE 1 +#define ANON_AND_FILE 2
+int get_nr_nodes(void); +int drop_pagecache(void);
+long sysfs_get_refresh_interval(int nid); +int sysfs_set_refresh_interval(int nid, long interval);
+int sysfs_get_page_age_intervals_str(int nid, char *buf, int len); +int sysfs_set_page_age_intervals_str(int nid, const char *buf, int len);
+int sysfs_set_page_age_intervals(int nid, const char *intervals[],
int nr_intervals);
+char *page_age_split_node(char *buf, int nid, char **next); +ssize_t sysfs_page_age_read(int nid, char *buf, size_t len); +ssize_t page_age_read(const char *buf, const char *interval, int pagetype);
+int alloc_file_workingset(void *arg); +void cleanup_file_workingset(void); +int alloc_anon_workingset(void *arg);
+#endif /* WORKINGSET_REPORT_H_ */ diff --git a/tools/testing/selftests/mm/workingset_report_test.c b/tools/testing/selftests/mm/workingset_report_test.c new file mode 100644 index 000000000000..e6e857d8fe35 --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report_test.c @@ -0,0 +1,328 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "workingset_report.h"
+#include <stdlib.h> +#include <stdio.h> +#include <signal.h> +#include <time.h>
+#include "../clone3/clone3_selftests.h"
+#define REFRESH_INTERVAL 5000 +#define MB(x) (x << 20)
+static void sleep_ms(int milliseconds) +{
- struct timespec ts;
- ts.tv_sec = milliseconds / 1000;
- ts.tv_nsec = (milliseconds % 1000) * 1000000;
- nanosleep(&ts, NULL);
+}
+/*
- Checks if two given values differ by less than err% of their sum.
- */
+static inline int values_close(long a, long b, int err) +{
- return abs(a - b) <= (a + b) / 100 * err;
+}
+static const char * const PAGE_AGE_INTERVALS[] = {
- "6000", "10000", "15000", "18446744073709551615",
+}; +#define NR_PAGE_AGE_INTERVALS (ARRAY_SIZE(PAGE_AGE_INTERVALS)) +/* add one for the catch all last interval */
+static int set_page_age_intervals_all_nodes(const char *intervals, int nr_nodes) +{
- int i;
- for (i = 0; i < nr_nodes; ++i) {
int err = sysfs_set_page_age_intervals_str(
i, &intervals[i * 1024], strlen(&intervals[i * 1024]));
if (err < 0)
return err;
- }
- return 0;
+}
+static int get_page_age_intervals_all_nodes(char *intervals, int nr_nodes) +{
- int i;
- for (i = 0; i < nr_nodes; ++i) {
int err = sysfs_get_page_age_intervals_str(
i, &intervals[i * 1024], 1024);
if (err < 0)
return err;
- }
- return 0;
+}
+static int set_refresh_interval_all_nodes(const long *interval, int nr_nodes) +{
- int i;
- for (i = 0; i < nr_nodes; ++i) {
int err = sysfs_set_refresh_interval(i, interval[i]);
if (err < 0)
return err;
- }
- return 0;
+}
+static int get_refresh_interval_all_nodes(long *interval, int nr_nodes) +{
- int i;
- for (i = 0; i < nr_nodes; ++i) {
long val = sysfs_get_refresh_interval(i);
if (val < 0)
return val;
interval[i] = val;
- }
- return 0;
+}
+static pid_t clone_and_run(int fn(void *arg), void *arg) +{
- pid_t pid;
- struct __clone_args args = {
.exit_signal = SIGCHLD,
- };
- pid = sys_clone3(&args, sizeof(struct __clone_args));
- if (pid == 0)
exit(fn(arg));
- return pid;
+}
+static int read_workingset(int pagetype, int nid,
unsigned long page_age[NR_PAGE_AGE_INTERVALS])
+{
- int i, err;
- char buf[4096];
- err = sysfs_page_age_read(nid, buf, sizeof(buf));
- if (err < 0)
return err;
- for (i = 0; i < NR_PAGE_AGE_INTERVALS; ++i) {
err = page_age_read(buf, PAGE_AGE_INTERVALS[i], pagetype);
if (err < 0)
return err;
page_age[i] = err;
- }
- return 0;
+}
+static ssize_t read_interval_all_nodes(int pagetype, int interval) +{
- int i, err;
- unsigned long page_age[NR_PAGE_AGE_INTERVALS];
- ssize_t ret = 0;
- int nr_nodes = get_nr_nodes();
- for (i = 0; i < nr_nodes; ++i) {
err = read_workingset(pagetype, i, page_age);
if (err < 0)
return err;
ret += page_age[interval];
- }
- return ret;
+}
+#define TEST_SIZE MB(500l)
+static int run_test(int f(void)) +{
- int i, err, test_result;
- long *old_refresh_intervals;
- long *new_refresh_intervals;
- char *old_page_age_intervals;
- int nr_nodes = get_nr_nodes();
- if (nr_nodes <= 0) {
fprintf(stderr, "failed to get nr_nodes\n");
return KSFT_FAIL;
- }
- old_refresh_intervals = calloc(nr_nodes, sizeof(long));
- new_refresh_intervals = calloc(nr_nodes, sizeof(long));
- old_page_age_intervals = calloc(nr_nodes, 1024);
- if (!(old_refresh_intervals && new_refresh_intervals &&
old_page_age_intervals)) {
fprintf(stderr, "failed to allocate memory for intervals\n");
return KSFT_FAIL;
- }
- err = get_refresh_interval_all_nodes(old_refresh_intervals, nr_nodes);
- if (err < 0) {
fprintf(stderr, "failed to read refresh interval\n");
return KSFT_FAIL;
- }
- err = get_page_age_intervals_all_nodes(old_page_age_intervals, nr_nodes);
- if (err < 0) {
fprintf(stderr, "failed to read page age interval\n");
return KSFT_FAIL;
- }
- for (i = 0; i < nr_nodes; ++i)
new_refresh_intervals[i] = REFRESH_INTERVAL;
- err = set_refresh_interval_all_nodes(new_refresh_intervals, nr_nodes);
- if (err < 0) {
fprintf(stderr, "failed to set refresh interval\n");
test_result = KSFT_FAIL;
goto fail;
- }
- for (i = 0; i < nr_nodes; ++i) {
err = sysfs_set_page_age_intervals(i, PAGE_AGE_INTERVALS,
NR_PAGE_AGE_INTERVALS - 1);
if (err < 0) {
fprintf(stderr, "failed to set page age interval\n");
test_result = KSFT_FAIL;
goto fail;
}
- }
- sync();
- drop_pagecache();
- test_result = f();
+fail:
- err = set_refresh_interval_all_nodes(old_refresh_intervals, nr_nodes);
- if (err < 0) {
fprintf(stderr, "failed to restore refresh interval\n");
test_result = KSFT_FAIL;
- }
- err = set_page_age_intervals_all_nodes(old_page_age_intervals, nr_nodes);
- if (err < 0) {
fprintf(stderr, "failed to restore page age interval\n");
test_result = KSFT_FAIL;
- }
- return test_result;
+}
+static int test_file(void) +{
- ssize_t ws_size_ref, ws_size_test;
- int ret = KSFT_FAIL, i;
- pid_t pid = 0;
- ws_size_ref = read_interval_all_nodes(PAGETYPE_FILE, 0);
- if (ws_size_ref < 0)
goto cleanup;
- pid = clone_and_run(alloc_file_workingset, (void *)TEST_SIZE);
- if (pid < 0)
goto cleanup;
- read_interval_all_nodes(PAGETYPE_FILE, 0);
- sleep_ms(REFRESH_INTERVAL);
- for (i = 0; i < 3; ++i) {
sleep_ms(REFRESH_INTERVAL);
ws_size_test = read_interval_all_nodes(PAGETYPE_FILE, 0);
if (!values_close(ws_size_test - ws_size_ref, TEST_SIZE, 10)) {
fprintf(stderr,
"file working set size difference too large: actual=%ld, expected=%ld\n",
ws_size_test - ws_size_ref, TEST_SIZE);
goto cleanup;
}
- }
- ret = KSFT_PASS;
+cleanup:
- if (pid > 0)
kill(pid, SIGKILL);
- cleanup_file_workingset();
- return ret;
+}
+static int test_anon(void) +{
- ssize_t ws_size_ref, ws_size_test;
- pid_t pid = 0;
- int ret = KSFT_FAIL, i;
- ws_size_ref = read_interval_all_nodes(PAGETYPE_ANON, 0);
- if (ws_size_ref < 0)
goto cleanup;
- pid = clone_and_run(alloc_anon_workingset, (void *)TEST_SIZE);
- if (pid < 0)
goto cleanup;
- sleep_ms(REFRESH_INTERVAL);
- read_interval_all_nodes(PAGETYPE_ANON, 0);
- for (i = 0; i < 5; ++i) {
sleep_ms(REFRESH_INTERVAL);
ws_size_test = read_interval_all_nodes(PAGETYPE_ANON, 0);
if (ws_size_test < 0)
goto cleanup;
if (!values_close(ws_size_test - ws_size_ref, TEST_SIZE, 10)) {
fprintf(stderr,
"anon working set size difference too large: actual=%ld, expected=%ld\n",
ws_size_test - ws_size_ref, TEST_SIZE);
/* goto cleanup; */
}
- }
- ret = KSFT_PASS;
+cleanup:
- if (pid > 0)
kill(pid, SIGKILL);
- return ret;
+}
+#define T(x) { x, #x } +struct workingset_test {
- int (*fn)(void);
- const char *name;
+} tests[] = {
- T(test_anon),
- T(test_file),
+}; +#undef T
+int main(int argc, char **argv) +{
- int ret = EXIT_SUCCESS, i, err;
- for (i = 0; i < ARRAY_SIZE(tests); i++) {
err = run_test(tests[i].fn);
switch (err) {
case KSFT_PASS:
ksft_test_result_pass("%s\n", tests[i].name);
break;
case KSFT_SKIP:
ksft_test_result_skip("%s\n", tests[i].name);
break;
default:
ret = EXIT_FAILURE;
ksft_test_result_fail("%s with error %d\n",
tests[i].name, err);
break;
}
- }
- return ret;
+}
On Wed, Mar 27, 2024 at 02:30:59PM -0700, Yuanchu Xie wrote:
Promotion/Demotion Similar to proactive reclaim, a workingset report enables demotion to a slower tier of memory. For promotion, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1].
[1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-...
Sysfs and Cgroup Interfaces
The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g.
1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892
I realize this does not generalize well to hotness information, but I lack the intuition for an abstraction that presents hotness in a useful way. Based on a recent proposal for move_phys_pages[2], it seems like userspace tiering software would like to move specific physical pages, instead of informing the kernel "move x number of hot pages to y device". Please advise.
[2] https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge....
Please note that this proposed interface (move_phys_pages) is very unlikely to be received upstream due to side channel concerns. Instead, it's more likely that the tiering component will expose a "promote X pages from tier A to tier B", and the kernel component would then use/consume hotness information to determine which pages to promote.
(Just as one example, there are many more realistic designs)
So if there is a way to expose workingset data to the mm/memory_tiers.c component instead of via sysfs/cgroup - that is preferable.
The 'move_phys_pages' interface is more of an experimental interface to test the effectiveness of this approach without having to plumb out the entire system. Definitely anything userland interface should not be designed to generate physical address information for consumption unless it is hard-locked behind admin caps.
Regards, Gregory
On Wed, Mar 27, 2024 at 2:44 PM Gregory Price gregory.price@memverge.com wrote:
On Wed, Mar 27, 2024 at 02:30:59PM -0700, Yuanchu Xie wrote:
I realize this does not generalize well to hotness information, but I lack the intuition for an abstraction that presents hotness in a useful way. Based on a recent proposal for move_phys_pages[2], it seems like userspace tiering software would like to move specific physical pages, instead of informing the kernel "move x number of hot pages to y device". Please advise.
[2] https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge....
Please note that this proposed interface (move_phys_pages) is very unlikely to be received upstream due to side channel concerns. Instead, it's more likely that the tiering component will expose a "promote X pages from tier A to tier B", and the kernel component would then use/consume hotness information to determine which pages to promote.
I see that mm/memory-tiers.c only has support for demotion. What kind of hotness information do devices typically provide? The OCP proposal is not very specific about this. A list of hot pages with configurable threshold? Access frequency for all pages at configured granularity? Is there a way to tell which NUMA node is accessing them, for page promotion?
(Just as one example, there are many more realistic designs)
So if there is a way to expose workingset data to the mm/memory_tiers.c component instead of via sysfs/cgroup - that is preferable.
Appreciate the feedback. The data in its current form might be useful to inform demotion decisions, but for promotion, are you aware of any recent developments? I would like to encode hotness as workingset data as well.
The 'move_phys_pages' interface is more of an experimental interface to test the effectiveness of this approach without having to plumb out the entire system. Definitely anything userland interface should not be designed to generate physical address information for consumption unless it is hard-locked behind admin caps.
Regards, Gregory
On Wed, Mar 27, 2024 at 03:53:39PM -0700, Yuanchu Xie wrote:
On Wed, Mar 27, 2024 at 2:44 PM Gregory Price gregory.price@memverge.com wrote:
Please note that this proposed interface (move_phys_pages) is very unlikely to be received upstream due to side channel concerns. Instead, it's more likely that the tiering component will expose a "promote X pages from tier A to tier B", and the kernel component would then use/consume hotness information to determine which pages to promote.
I see that mm/memory-tiers.c only has support for demotion. What kind of hotness information do devices typically provide? The OCP proposal is not very specific about this. A list of hot pages with configurable threshold? Access frequency for all pages at configured granularity? Is there a way to tell which NUMA node is accessing them, for page promotion?
(caveat: i'm not a memory-tiers maintainer, you may want to poke at them directly for more information, this is simply spitballing an idea)
I don't know of any public proposals of explicit hotness information provided by hardware yet, just the general proposal.
For the sake of simplicity, I would make the assumption that you have the least information possible - a simple list of "hot addresses" in Host Physcal Address format.
I.e. there's some driver function that amounts to:
uint32_t device_get_hot_addresses(uint64_t *addresses, uint32_t buf_max);
Where the return value is number of addresses the device returned, and the buf_max is the number of addresses that can be read.
Drives providing this functionality would then register this as a callback when its memory becomes a member of some numa node.
Re: source node - Devices have no real way of determining upstream source information.
(Just as one example, there are many more realistic designs)
So if there is a way to expose workingset data to the mm/memory_tiers.c component instead of via sysfs/cgroup - that is preferable.
Appreciate the feedback. The data in its current form might be useful to inform demotion decisions, but for promotion, are you aware of any recent developments? I would like to encode hotness as workingset data as well.
There were some recent patches to DAMON about promotion/demotion. You might look there.
~Gregory
linux-kselftest-mirror@lists.linaro.org