The patch titled
Subject: mm/vmscan: restore zone_reclaim_mode ABI
has been added to the -mm tree. Its filename is
mm-vmscan-restore-zone_reclaim_mode-abi.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-vmscan-restore-zone_reclaim_mo…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-vmscan-restore-zone_reclaim_mo…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Dave Hansen <dave.hansen(a)linux.intel.com>
Subject: mm/vmscan: restore zone_reclaim_mode ABI
I went to go add a new RECLAIM_* mode for the zone_reclaim_mode sysctl.
Like a good kernel developer, I also went to go update the documentation.
I noticed that the bits in the documentation didn't match the bits in the
#defines.
The VM never explicitly checks the RECLAIM_ZONE bit. The bit is, however
implicitly checked when checking 'node_reclaim_mode==0'. The RECLAIM_ZONE
#define was removed in a cleanup. That, by itself is fine.
But, when the bit was removed (bit 0) the _other_ bit locations also got
changed. That's not OK because the bit values are documented to mean one
specific thing. Users surely do not expect the meaning to change from
kernel to kernel.
The end result is that if someone had a script that did:
sysctl vm.zone_reclaim_mode=1
it would have gone from enabling node reclaim for clean unmapped pages to
writing out pages during node reclaim after the commit in question.
That's not great.
Put the bits back the way they were and add a comment so something like
this is a bit harder to do again. Update the documentation to make it
clear that the first bit is ignored.
Link: https://lkml.kernel.org/r/20210219172555.FF0CDF23@viggo.jf.intel.com
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE")
Reviewed-by: Ben Widawsky <ben.widawsky(a)intel.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Acked-by: David Rientjes <rientjes(a)google.com>
Acked-by: Christoph Lameter <cl(a)linux.com>
Cc: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Daniel Wagner <dwagner(a)suse.de>
Cc: "Tobin C. Harding" <tobin(a)kernel.org>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Qian Cai <cai(a)lca.pw>
Cc: Daniel Wagner <dwagner(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
Documentation/admin-guide/sysctl/vm.rst | 10 +++++-----
mm/vmscan.c | 9 +++++++--
2 files changed, 12 insertions(+), 7 deletions(-)
--- a/Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-zone_reclaim_mode-abi
+++ a/Documentation/admin-guide/sysctl/vm.rst
@@ -983,11 +983,11 @@ that benefit from having their data cach
left disabled as the caching effect is likely to be more important than
data locality.
-zone_reclaim may be enabled if it's known that the workload is partitioned
-such that each partition fits within a NUMA node and that accessing remote
-memory would cause a measurable performance reduction. The page allocator
-will then reclaim easily reusable pages (those page cache pages that are
-currently not used) before allocating off node pages.
+Consider enabling one or more zone_reclaim mode bits if it's known that the
+workload is partitioned such that each partition fits within a NUMA node
+and that accessing remote memory would cause a measurable performance
+reduction. The page allocator will take additional actions before
+allocating off node pages.
Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
--- a/mm/vmscan.c~mm-vmscan-restore-zone_reclaim_mode-abi
+++ a/mm/vmscan.c
@@ -4085,8 +4085,13 @@ module_init(kswapd_init)
*/
int node_reclaim_mode __read_mostly;
-#define RECLAIM_WRITE (1<<0) /* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<1) /* Unmap pages during reclaim */
+/*
+ * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
+ * ABI. New bits are OK, but existing bits can never change.
+ */
+#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
+#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
/*
* Priority for NODE_RECLAIM. This determines the fraction of pages
_
Patches currently in -mm which might be from dave.hansen(a)linux.intel.com are
mm-vmscan-restore-zone_reclaim_mode-abi.patch
From: David Sterba <dsterba(a)suse.cz>
There's a mistake in backport of upstream commit 2175bf57dc95 ("btrfs:
fix possible free space tree corruption with online conversion") as
5.10.13 commit 2175bf57dc95.
The enum value BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED has been added to the
wrong enum set, colliding with value of BTRFS_FS_QUOTA_ENABLE. This
could cause problems during the tree conversion, where the quotas
wouldn't be set up properly but the related code executed anyway due to
the bit set.
Link: https://lore.kernel.org/linux-btrfs/20210219111741.95DD.409509F4@e16-tech.c…
Reported-by: Wang Yugui <wangyugui(a)e16-tech.com>
CC: stable(a)vger.kernel.org # 5.10.13+
Signed-off-by: David Sterba <dsterba(a)suse.com>
---
fs/btrfs/ctree.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 30ea9780725f..b6884eda9ff6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -146,9 +146,6 @@ enum {
BTRFS_FS_STATE_DEV_REPLACING,
/* The btrfs_fs_info created for self-tests */
BTRFS_FS_STATE_DUMMY_FS_INFO,
-
- /* Indicate that we can't trust the free space tree for caching yet */
- BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
};
#define BTRFS_BACKREF_REV_MAX 256
@@ -562,6 +559,9 @@ enum {
/* Indicate that the discard workqueue can service discards. */
BTRFS_FS_DISCARD_RUNNING,
+
+ /* Indicate that we can't trust the free space tree for caching yet */
+ BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
};
/*
--
2.29.2
From: Dave Hansen <dave.hansen(a)linux.intel.com>
I went to go add a new RECLAIM_* mode for the zone_reclaim_mode
sysctl. Like a good kernel developer, I also went to go update the
documentation. I noticed that the bits in the documentation didn't
match the bits in the #defines.
The VM never explicitly checks the RECLAIM_ZONE bit. The bit is,
however implicitly checked when checking 'node_reclaim_mode==0'.
The RECLAIM_ZONE #define was removed in a cleanup. That, by itself
is fine.
But, when the bit was removed (bit 0) the _other_ bit locations also
got changed. That's not OK because the bit values are documented to
mean one specific thing. Users surely do not expect the meaning to
change from kernel to kernel.
The end result is that if someone had a script that did:
sysctl vm.zone_reclaim_mode=1
it would have gone from enabling node reclaim for clean unmapped
pages to writing out pages during node reclaim after the commit in
question. That's not great.
Put the bits back the way they were and add a comment so something
like this is a bit harder to do again. Update the documentation to
make it clear that the first bit is ignored.
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE")
Reviewed-by: Ben Widawsky <ben.widawsky(a)intel.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Acked-by: David Rientjes <rientjes(a)google.com>
Acked-by: Christoph Lameter <cl(a)linux.com>
Cc: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Daniel Wagner <dwagner(a)suse.de>
Cc: "Tobin C. Harding" <tobin(a)kernel.org>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Qian Cai <cai(a)lca.pw>
Cc: Daniel Wagner <dwagner(a)suse.de>
Cc: stable(a)vger.kernel.org
--
Changes from v2:
* Update description to indicate that bit0 was used for clean
unmapped page node reclaim.
---
b/Documentation/admin-guide/sysctl/vm.rst | 10 +++++-----
b/mm/vmscan.c | 9 +++++++--
2 files changed, 12 insertions(+), 7 deletions(-)
diff -puN Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi 2021-02-19 09:25:26.656663105 -0800
+++ b/Documentation/admin-guide/sysctl/vm.rst 2021-02-19 09:25:26.662663105 -0800
@@ -983,11 +983,11 @@ that benefit from having their data cach
left disabled as the caching effect is likely to be more important than
data locality.
-zone_reclaim may be enabled if it's known that the workload is partitioned
-such that each partition fits within a NUMA node and that accessing remote
-memory would cause a measurable performance reduction. The page allocator
-will then reclaim easily reusable pages (those page cache pages that are
-currently not used) before allocating off node pages.
+Consider enabling one or more zone_reclaim mode bits if it's known that the
+workload is partitioned such that each partition fits within a NUMA node
+and that accessing remote memory would cause a measurable performance
+reduction. The page allocator will take additional actions before
+allocating off node pages.
Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
diff -puN mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi 2021-02-19 09:25:26.658663105 -0800
+++ b/mm/vmscan.c 2021-02-19 09:25:26.665663105 -0800
@@ -4095,8 +4095,13 @@ module_init(kswapd_init)
*/
int node_reclaim_mode __read_mostly;
-#define RECLAIM_WRITE (1<<0) /* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<1) /* Unmap pages during reclaim */
+/*
+ * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
+ * ABI. New bits are OK, but existing bits can never change.
+ */
+#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
+#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
/*
* Priority for NODE_RECLAIM. This determines the fraction of pages
_
>> On Fri, Feb 19, 2021 at 02:43:34PM +0800, Wen Yang wrote:
>> From: Peter Zijlstra <peterz(a)infradead.org>
>>
>> commit 19dbdcb8039cff16669a05136a29180778d16d0a upstream.
>>
>> It's clearly documented that smp function calls cannot be invoked from
>> softirq handling context. Unfortunately nothing enforces that or emits a
>> warning.
>>
>> A single function call can be invoked from softirq context only via
>> smp_call_function_single_async().
>>
>> The only legit context is task context, so add a warning to that effect.
>>
>> Reported-by: luferry <luferry(a)163.com>
>> Signed-off-by: Peter Zijlstra <peterz(a)infradead.org>
>> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
>> Link: https://lkml.kernel.org/r/20190718160601.GP3402@hirez.programming.kicks-ass…
>> Cc: stable <stable(a)vger.kernel.org> # 4.9.x
>> Signed-off-by: Wen Yang <simon.wy(a)alibaba-inc.com>
>> ---
>> kernel/smp.c | 16 ++++++++++++++++
>> 1 file changed, 16 insertions(+)
>>
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index 399905f..f2b29c4 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -276,6 +276,14 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
>> WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
>> && !oops_in_progress);
>>
>> + /*
>> + * When @wait we can deadlock when we interrupt between llist_add() and
>> + * arch_send_call_function_ipi*(); when !@wait we can deadlock due to
>> + * csd_lock() on because the interrupt context uses the same csd
>> + * storage.
>> + */
>> + WARN_ON_ONCE(!in_task());
>> +
>> csd = &csd_stack;
>> if (!wait) {
>> csd = this_cpu_ptr(&csd_data);
>> @@ -401,6 +409,14 @@ void smp_call_function_many(const struct cpumask *mask,
>> WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
>> && !oops_in_progress && !early_boot_irqs_disabled);
>>
>> + /*
>> + * When @wait we can deadlock when we interrupt between llist_add() and
>> + * arch_send_call_function_ipi*(); when !@wait we can deadlock due to
>> + * csd_lock() on because the interrupt context uses the same csd
>> + * storage.
>> + */
>> + WARN_ON_ONCE(!in_task());
>> +
>> /* Try to fastpath. So, what's a CPU they want? Ignoring this one. */
>> cpu = cpumask_first_and(mask, cpu_online_mask);
>> if (cpu == this_cpu)
>> --
>> 1.8.3.1
>>
>
> WHy do you want this in the 4.9.y kernel tree only, and not all others?
> What bug/problem does this fix? It seems that it will only report
> problems that other code has, not fix existing code. If anything, it's
> going to start causing machines to reboot that have "panic on warn" set,
> is that a good thing to do?
4.9, 4.14 and 4.19 should all need it.
We find that some third party kernel modules occasionally cause kernel
panic (such as watchdog reset). After further analysis, it is found that the
functions such as smp_call_function()/on_each_cpu() are called in the interrupt
context or softirq context.
Since these usages are illegal and cannot be prohibited, we should add a warning
to enhance the robustness of the stable kernel and/or facilitate the analysis of
the problems.
thanks,
Wen
From: Joerg Roedel <jroedel(a)suse.de>
The code in the NMI handler to adjust the #VC handler IST stack is
needed in case an NMI hits when the #VC handler is still using its IST
stack.
But the check for this condition also needs to look if the regs->sp
value is trusted, meaning it was not set by user-space. Extend the
check to not use regs->sp when the NMI interrupted user-space code or
the SYSCALL gap.
Reported-by: Andy Lutomirski <luto(a)kernel.org>
Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
Cc: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Joerg Roedel <jroedel(a)suse.de>
---
arch/x86/kernel/sev-es.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 84c1821819af..0df38b185d53 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -144,7 +144,9 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
/* Make room on the IST stack */
- if (on_vc_stack(regs->sp))
+ if (on_vc_stack(regs->sp) &&
+ !user_mode(regs) &&
+ !from_syscall_gap(regs))
new_ist = ALIGN_DOWN(regs->sp, 8) - sizeof(old_ist);
else
new_ist = old_ist - sizeof(old_ist);
--
2.30.0
From: Peter Zijlstra <peterz(a)infradead.org>
commit 19dbdcb8039cff16669a05136a29180778d16d0a upstream.
It's clearly documented that smp function calls cannot be invoked from
softirq handling context. Unfortunately nothing enforces that or emits a
warning.
A single function call can be invoked from softirq context only via
smp_call_function_single_async().
The only legit context is task context, so add a warning to that effect.
Reported-by: luferry <luferry(a)163.com>
Signed-off-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Link: https://lkml.kernel.org/r/20190718160601.GP3402@hirez.programming.kicks-ass…
Cc: stable <stable(a)vger.kernel.org> # 4.9.x
Signed-off-by: Wen Yang <simon.wy(a)alibaba-inc.com>
---
kernel/smp.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/smp.c b/kernel/smp.c
index 399905f..f2b29c4 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -276,6 +276,14 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress);
+ /*
+ * When @wait we can deadlock when we interrupt between llist_add() and
+ * arch_send_call_function_ipi*(); when !@wait we can deadlock due to
+ * csd_lock() on because the interrupt context uses the same csd
+ * storage.
+ */
+ WARN_ON_ONCE(!in_task());
+
csd = &csd_stack;
if (!wait) {
csd = this_cpu_ptr(&csd_data);
@@ -401,6 +409,14 @@ void smp_call_function_many(const struct cpumask *mask,
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress && !early_boot_irqs_disabled);
+ /*
+ * When @wait we can deadlock when we interrupt between llist_add() and
+ * arch_send_call_function_ipi*(); when !@wait we can deadlock due to
+ * csd_lock() on because the interrupt context uses the same csd
+ * storage.
+ */
+ WARN_ON_ONCE(!in_task());
+
/* Try to fastpath. So, what's a CPU they want? Ignoring this one. */
cpu = cpumask_first_and(mask, cpu_online_mask);
if (cpu == this_cpu)
--
1.8.3.1
From: Peter Zijlstra <peterz(a)infradead.org>
commit 19dbdcb8039cff16669a05136a29180778d16d0a upstream.
It's clearly documented that smp function calls cannot be invoked from
softirq handling context. Unfortunately nothing enforces that or emits a
warning.
A single function call can be invoked from softirq context only via
smp_call_function_single_async().
The only legit context is task context, so add a warning to that effect.
Reported-by: luferry <luferry(a)163.com>
Signed-off-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Link: https://lkml.kernel.org/r/20190718160601.GP3402@hirez.programming.kicks-ass…
Cc: stable <stable(a)vger.kernel.org> # 4.19.x
Signed-off-by: Wen Yang <simon.wy(a)alibaba-inc.com>
---
kernel/smp.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/smp.c b/kernel/smp.c
index 084c8b3..9afcbb4 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -290,6 +290,14 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress);
+ /*
+ * When @wait we can deadlock when we interrupt between llist_add() and
+ * arch_send_call_function_ipi*(); when !@wait we can deadlock due to
+ * csd_lock() on because the interrupt context uses the same csd
+ * storage.
+ */
+ WARN_ON_ONCE(!in_task());
+
csd = &csd_stack;
if (!wait) {
csd = this_cpu_ptr(&csd_data);
@@ -415,6 +423,14 @@ void smp_call_function_many(const struct cpumask *mask,
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress && !early_boot_irqs_disabled);
+ /*
+ * When @wait we can deadlock when we interrupt between llist_add() and
+ * arch_send_call_function_ipi*(); when !@wait we can deadlock due to
+ * csd_lock() on because the interrupt context uses the same csd
+ * storage.
+ */
+ WARN_ON_ONCE(!in_task());
+
/* Try to fastpath. So, what's a CPU they want? Ignoring this one. */
cpu = cpumask_first_and(mask, cpu_online_mask);
if (cpu == this_cpu)
--
1.8.3.1
From: Peter Zijlstra <peterz(a)infradead.org>
commit 19dbdcb8039cff16669a05136a29180778d16d0a upstream.
It's clearly documented that smp function calls cannot be invoked from
softirq handling context. Unfortunately nothing enforces that or emits a
warning.
A single function call can be invoked from softirq context only via
smp_call_function_single_async().
The only legit context is task context, so add a warning to that effect.
Reported-by: luferry <luferry(a)163.com>
Signed-off-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Link: https://lkml.kernel.org/r/20190718160601.GP3402@hirez.programming.kicks-ass…
Cc: stable <stable(a)vger.kernel.org> # 4.14.x
Signed-off-by: Wen Yang <simon.wy(a)alibaba-inc.com>
---
kernel/smp.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/smp.c b/kernel/smp.c
index c94dd85..7d00d3e 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -290,6 +290,14 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress);
+ /*
+ * When @wait we can deadlock when we interrupt between llist_add() and
+ * arch_send_call_function_ipi*(); when !@wait we can deadlock due to
+ * csd_lock() on because the interrupt context uses the same csd
+ * storage.
+ */
+ WARN_ON_ONCE(!in_task());
+
csd = &csd_stack;
if (!wait) {
csd = this_cpu_ptr(&csd_data);
@@ -415,6 +423,14 @@ void smp_call_function_many(const struct cpumask *mask,
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress && !early_boot_irqs_disabled);
+ /*
+ * When @wait we can deadlock when we interrupt between llist_add() and
+ * arch_send_call_function_ipi*(); when !@wait we can deadlock due to
+ * csd_lock() on because the interrupt context uses the same csd
+ * storage.
+ */
+ WARN_ON_ONCE(!in_task());
+
/* Try to fastpath. So, what's a CPU they want? Ignoring this one. */
cpu = cpumask_first_and(mask, cpu_online_mask);
if (cpu == this_cpu)
--
1.8.3.1