From: Michael Schaller <misch(a)google.com>
commit 336af6a4686d885a067ecea8c3c3dd129ba4fc75 upstream
Without this patch efivarfs_alloc_dentry creates dentries with slashes in
their name if the respective EFI variable has slashes in its name. This in
turn causes EIO on getdents64, which prevents a complete directory listing
of /sys/firmware/efi/efivars/.
This patch replaces the invalid shlashes with exclamation marks like
kobject_set_name_vargs does for /sys/firmware/efi/vars/ to have consistently
named dentries under /sys/firmware/efi/vars/ and /sys/firmware/efi/efivars/.
Signed-off-by: Michael Schaller <misch(a)google.com>
Link: https://lore.kernel.org/r/20200925074502.150448-1-misch@google.com
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
---
This addresses an issue that breaks Ubuntu installs on platforms that have
variable names as described above. One of our installers needs to sort
the BootOrder to keep BootCurrent at the front, but this fails when the
variable BootCurrent points at appears to not exist due to this issue.
Ref: https://bugs.launchpad.net/bugs/1899993
fs/efivarfs/super.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 28bb5689333a..15880a68faad 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -141,6 +141,9 @@ static int efivarfs_callback(efi_char16_t *name16, efi_guid_t vendor,
name[len + EFI_VARIABLE_GUID_LEN+1] = '\0';
+ /* replace invalid slashes like kobject_set_name_vargs does for /sys/firmware/efi/vars. */
+ strreplace(name, '/', '!');
+
inode = efivarfs_get_inode(sb, d_inode(root), S_IFREG | 0644, 0,
is_removable);
if (!inode)
--
2.29.1
Hi-
This commit from v4.4.241 breaks the v4.4.y build for powerpc:
217f139551c0 powerpc/powernv/dump: Fix race while processing OPAL dump
Like this:
.../arch/powerpc/platforms/powernv/opal-dump.c:409:7: error: void value not ignored as it ought to be
dump = create_dump_obj(dump_id, dump_size, dump_type);
^
The commit descriptions says:
"... the return value of create_dump_obj() function isn't being used today ..."
But that's only true for kernels >= v4.19, because they carry this commit:
b29336c0e178 powerpc/powernv/opal-dump : Use IRQ_HANDLED instead of numbers in interrupt handler
In v4.4 process_dump(), the only caller of create_dump_obj(), still tries to
use the return value (see the error above).
Applying "b29336c0e178 powerpc/powernv/opal-dump : Use IRQ_HANDLED ..." to
v4.4.y fixes the problem.
-Kamal
This is a note to let you know that I've just added the patch titled
coresight: cti: Initialize dynamic sysfs attributes
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 80624263fa289b3416f7ca309491f1b75e579477 Mon Sep 17 00:00:00 2001
From: Suzuki K Poulose <suzuki.poulose(a)arm.com>
Date: Thu, 29 Oct 2020 10:45:58 -0600
Subject: coresight: cti: Initialize dynamic sysfs attributes
With LOCKDEP enabled, CTI driver triggers the following splat due
to uninitialized lock class for dynamically allocated attribute
objects.
[ 5.372901] coresight etm0: CPU0: ETM v4.0 initialized
[ 5.376694] coresight etm1: CPU1: ETM v4.0 initialized
[ 5.380785] coresight etm2: CPU2: ETM v4.0 initialized
[ 5.385851] coresight etm3: CPU3: ETM v4.0 initialized
[ 5.389808] BUG: key ffff00000564a798 has not been registered!
[ 5.392456] ------------[ cut here ]------------
[ 5.398195] DEBUG_LOCKS_WARN_ON(1)
[ 5.398233] WARNING: CPU: 1 PID: 32 at kernel/locking/lockdep.c:4623 lockdep_init_map_waits+0x14c/0x260
[ 5.406149] Modules linked in:
[ 5.415411] CPU: 1 PID: 32 Comm: kworker/1:1 Not tainted 5.9.0-12034-gbbe85027ce80 #51
[ 5.418553] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
[ 5.426453] Workqueue: events amba_deferred_retry_func
[ 5.433299] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[ 5.438252] pc : lockdep_init_map_waits+0x14c/0x260
[ 5.444410] lr : lockdep_init_map_waits+0x14c/0x260
[ 5.449007] sp : ffff800012bbb720
...
[ 5.531561] Call trace:
[ 5.536847] lockdep_init_map_waits+0x14c/0x260
[ 5.539027] __kernfs_create_file+0xa8/0x1c8
[ 5.543539] sysfs_add_file_mode_ns+0xd0/0x208
[ 5.548054] internal_create_group+0x118/0x3c8
[ 5.552307] internal_create_groups+0x58/0xb8
[ 5.556733] sysfs_create_groups+0x2c/0x38
[ 5.561160] device_add+0x2d8/0x768
[ 5.565148] device_register+0x28/0x38
[ 5.568537] coresight_register+0xf8/0x320
[ 5.572358] cti_probe+0x1b0/0x3f0
...
Fix this by initializing the attributes when they are allocated.
Fixes: 3c5597e39812 ("coresight: cti: Add connection information to sysfs")
Reported-by: Leo Yan <leo.yan(a)linaro.org>
Tested-by: Leo Yan <leo.yan(a)linaro.org>
Cc: Mike Leach <mike.leach(a)linaro.org>
Cc: Mathieu Poirier <mathieu.poirier(a)linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Mathieu Poirier <mathieu.poirier(a)linaro.org>
Link: https://lore.kernel.org/r/20201029164559.1268531-2-mathieu.poirier@linaro.o…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/hwtracing/coresight/coresight-cti-sysfs.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/hwtracing/coresight/coresight-cti-sysfs.c b/drivers/hwtracing/coresight/coresight-cti-sysfs.c
index 392757f3a019..7ff7e7780bbf 100644
--- a/drivers/hwtracing/coresight/coresight-cti-sysfs.c
+++ b/drivers/hwtracing/coresight/coresight-cti-sysfs.c
@@ -1065,6 +1065,13 @@ static int cti_create_con_sysfs_attr(struct device *dev,
}
eattr->var = con;
con->con_attrs[attr_idx] = &eattr->attr.attr;
+ /*
+ * Initialize the dynamically allocated attribute
+ * to avoid LOCKDEP splat. See include/linux/sysfs.h
+ * for more details.
+ */
+ sysfs_attr_init(con->con_attrs[attr_idx]);
+
return 0;
}
--
2.29.1
Hi.
On Mon, Oct 26, 2020 at 04:13:26PM -0700, Roman Gushchin <guro(a)fb.com> wrote:
> Please note, that in the non-hierarchical mode all objcgs are always
> reparented to the root memory cgroup, even if the hierarchy has more
> than 1 level. This patch doesn't change it.
>
> The patch also doesn't affect how the hierarchical mode is working,
> which is the only sane and truly supported mode now.
I agree with the patch and you can add
Reviewed-by: Michal Koutný <mkoutny(a)suse.com>
However, it effectively switches any users of root.use_hierarchy=0 (if there
are any, watching the counters of root memcg) into root.use_hierarchy=1.
So I'd show them the warning even with a single level of cgroups, i.e.
add this hunk
@@ -5356,12 +5356,11 @@
page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
/*
- * Deeper hierachy with use_hierarchy == false doesn't make
+ * Hierachy with use_hierarchy == false doesn't make
* much sense so let cgroup subsystem know about this
* unfortunate state in our controller.
*/
- if (parent != root_mem_cgroup)
- memory_cgrp_subsys.broken_hierarchy = true;
+ memory_cgrp_subsys.broken_hierarchy = true;
}
/* The following stuff does not apply to the root */
What do you think?
Michal
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5da8e4a658109e3b7e1f45ae672b7c06ac3e7158 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams(a)intel.com>
Date: Mon, 5 Oct 2020 20:40:25 -0700
Subject: [PATCH] x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
The motivations to go rework memcpy_mcsafe() are that the benefit of
doing slow and careful copies is obviated on newer CPUs, and that the
current opt-in list of CPUs to instrument recovery is broken relative to
those CPUs. There is no need to keep an opt-in list up to date on an
ongoing basis if pmem/dax operations are instrumented for recovery by
default. With recovery enabled by default the old "mcsafe_key" opt-in to
careful copying can be made a "fragile" opt-out. Where the "fragile"
list takes steps to not consume poison across cachelines.
The discussion with Linus made clear that the current "_mcsafe" suffix
was imprecise to a fault. The operations that are needed by pmem/dax are
to copy from a source address that might throw #MC to a destination that
may write-fault, if it is a user page.
So copy_to_user_mcsafe() becomes copy_mc_to_user() to indicate
the separate precautions taken on source and destination.
copy_mc_to_kernel() is introduced as a non-SMAP version that does not
expect write-faults on the destination, but is still prepared to abort
with an error code upon taking #MC.
The original copy_mc_fragile() implementation had negative performance
implications since it did not use the fast-string instruction sequence
to perform copies. For this reason copy_mc_to_kernel() fell back to
plain memcpy() to preserve performance on platforms that did not indicate
the capability to recover from machine check exceptions. However, that
capability detection was not architectural and now that some platforms
can recover from fast-string consumption of memory errors the memcpy()
fallback now causes these more capable platforms to fail.
Introduce copy_mc_enhanced_fast_string() as the fast default
implementation of copy_mc_to_kernel() and finalize the transition of
copy_mc_fragile() to be a platform quirk to indicate 'copy-carefully'.
With this in place, copy_mc_to_kernel() is fast and recovery-ready by
default regardless of hardware capability.
Thanks to Vivek for identifying that copy_user_generic() is not suitable
as the copy_mc_to_user() backend since the #MC handler explicitly checks
ex_has_fault_handler(). Thanks to the 0day robot for catching a
performance bug in the x86/copy_mc_to_user implementation.
[ bp: Add the "why" for this change from the 0/2th message, massage. ]
Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Reported-by: Erwin Tsaur <erwin.tsaur(a)intel.com>
Reported-by: 0day robot <lkp(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Reviewed-by: Tony Luck <tony.luck(a)intel.com>
Tested-by: Erwin Tsaur <erwin.tsaur(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/160195562556.2163339.18063423034951948973.stgit@d…
diff --git a/arch/x86/lib/copy_mc.c b/arch/x86/lib/copy_mc.c
index 2633635530b7..c13e8c9ee926 100644
--- a/arch/x86/lib/copy_mc.c
+++ b/arch/x86/lib/copy_mc.c
@@ -45,6 +45,8 @@ void enable_copy_mc_fragile(void)
#define copy_mc_fragile_enabled (0)
#endif
+unsigned long copy_mc_enhanced_fast_string(void *dst, const void *src, unsigned len);
+
/**
* copy_mc_to_kernel - memory copy that handles source exceptions
*
@@ -52,9 +54,11 @@ void enable_copy_mc_fragile(void)
* @src: source address
* @len: number of bytes to copy
*
- * Call into the 'fragile' version on systems that have trouble
- * actually do machine check recovery. Everyone else can just
- * use memcpy().
+ * Call into the 'fragile' version on systems that benefit from avoiding
+ * corner case poison consumption scenarios, For example, accessing
+ * poison across 2 cachelines with a single instruction. Almost all
+ * other uses case can use copy_mc_enhanced_fast_string() for a fast
+ * recoverable copy, or fallback to plain memcpy.
*
* Return 0 for success, or number of bytes not copied if there was an
* exception.
@@ -63,6 +67,8 @@ unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigne
{
if (copy_mc_fragile_enabled)
return copy_mc_fragile(dst, src, len);
+ if (static_cpu_has(X86_FEATURE_ERMS))
+ return copy_mc_enhanced_fast_string(dst, src, len);
memcpy(dst, src, len);
return 0;
}
@@ -72,11 +78,19 @@ unsigned long __must_check copy_mc_to_user(void *dst, const void *src, unsigned
{
unsigned long ret;
- if (!copy_mc_fragile_enabled)
- return copy_user_generic(dst, src, len);
+ if (copy_mc_fragile_enabled) {
+ __uaccess_begin();
+ ret = copy_mc_fragile(dst, src, len);
+ __uaccess_end();
+ return ret;
+ }
+
+ if (static_cpu_has(X86_FEATURE_ERMS)) {
+ __uaccess_begin();
+ ret = copy_mc_enhanced_fast_string(dst, src, len);
+ __uaccess_end();
+ return ret;
+ }
- __uaccess_begin();
- ret = copy_mc_fragile(dst, src, len);
- __uaccess_end();
- return ret;
+ return copy_user_generic(dst, src, len);
}
diff --git a/arch/x86/lib/copy_mc_64.S b/arch/x86/lib/copy_mc_64.S
index c3b613c4544a..892d8915f609 100644
--- a/arch/x86/lib/copy_mc_64.S
+++ b/arch/x86/lib/copy_mc_64.S
@@ -124,4 +124,40 @@ EXPORT_SYMBOL_GPL(copy_mc_fragile)
_ASM_EXTABLE(.L_write_words, .E_write_words)
_ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
#endif /* CONFIG_X86_MCE */
+
+/*
+ * copy_mc_enhanced_fast_string - memory copy with exception handling
+ *
+ * Fast string copy + fault / exception handling. If the CPU does
+ * support machine check exception recovery, but does not support
+ * recovering from fast-string exceptions then this CPU needs to be
+ * added to the copy_mc_fragile_key set of quirks. Otherwise, absent any
+ * machine check recovery support this version should be no slower than
+ * standard memcpy.
+ */
+SYM_FUNC_START(copy_mc_enhanced_fast_string)
+ movq %rdi, %rax
+ movq %rdx, %rcx
+.L_copy:
+ rep movsb
+ /* Copy successful. Return zero */
+ xorl %eax, %eax
+ ret
+SYM_FUNC_END(copy_mc_enhanced_fast_string)
+
+ .section .fixup, "ax"
+.E_copy:
+ /*
+ * On fault %rcx is updated such that the copy instruction could
+ * optionally be restarted at the fault position, i.e. it
+ * contains 'bytes remaining'. A non-zero return indicates error
+ * to copy_mc_generic() users, or indicate short transfers to
+ * user-copy routines.
+ */
+ movq %rcx, %rax
+ ret
+
+ .previous
+
+ _ASM_EXTABLE_FAULT(.L_copy, .E_copy)
#endif /* !CONFIG_UML */
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 893f021fec63..b3e4efcf7ca6 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -550,6 +550,7 @@ static const char *uaccess_safe_builtin[] = {
"csum_partial_copy_generic",
"copy_mc_fragile",
"copy_mc_fragile_handle_tail",
+ "copy_mc_enhanced_fast_string",
"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
NULL
};
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5da8e4a658109e3b7e1f45ae672b7c06ac3e7158 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams(a)intel.com>
Date: Mon, 5 Oct 2020 20:40:25 -0700
Subject: [PATCH] x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
The motivations to go rework memcpy_mcsafe() are that the benefit of
doing slow and careful copies is obviated on newer CPUs, and that the
current opt-in list of CPUs to instrument recovery is broken relative to
those CPUs. There is no need to keep an opt-in list up to date on an
ongoing basis if pmem/dax operations are instrumented for recovery by
default. With recovery enabled by default the old "mcsafe_key" opt-in to
careful copying can be made a "fragile" opt-out. Where the "fragile"
list takes steps to not consume poison across cachelines.
The discussion with Linus made clear that the current "_mcsafe" suffix
was imprecise to a fault. The operations that are needed by pmem/dax are
to copy from a source address that might throw #MC to a destination that
may write-fault, if it is a user page.
So copy_to_user_mcsafe() becomes copy_mc_to_user() to indicate
the separate precautions taken on source and destination.
copy_mc_to_kernel() is introduced as a non-SMAP version that does not
expect write-faults on the destination, but is still prepared to abort
with an error code upon taking #MC.
The original copy_mc_fragile() implementation had negative performance
implications since it did not use the fast-string instruction sequence
to perform copies. For this reason copy_mc_to_kernel() fell back to
plain memcpy() to preserve performance on platforms that did not indicate
the capability to recover from machine check exceptions. However, that
capability detection was not architectural and now that some platforms
can recover from fast-string consumption of memory errors the memcpy()
fallback now causes these more capable platforms to fail.
Introduce copy_mc_enhanced_fast_string() as the fast default
implementation of copy_mc_to_kernel() and finalize the transition of
copy_mc_fragile() to be a platform quirk to indicate 'copy-carefully'.
With this in place, copy_mc_to_kernel() is fast and recovery-ready by
default regardless of hardware capability.
Thanks to Vivek for identifying that copy_user_generic() is not suitable
as the copy_mc_to_user() backend since the #MC handler explicitly checks
ex_has_fault_handler(). Thanks to the 0day robot for catching a
performance bug in the x86/copy_mc_to_user implementation.
[ bp: Add the "why" for this change from the 0/2th message, massage. ]
Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Reported-by: Erwin Tsaur <erwin.tsaur(a)intel.com>
Reported-by: 0day robot <lkp(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Reviewed-by: Tony Luck <tony.luck(a)intel.com>
Tested-by: Erwin Tsaur <erwin.tsaur(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/160195562556.2163339.18063423034951948973.stgit@d…
diff --git a/arch/x86/lib/copy_mc.c b/arch/x86/lib/copy_mc.c
index 2633635530b7..c13e8c9ee926 100644
--- a/arch/x86/lib/copy_mc.c
+++ b/arch/x86/lib/copy_mc.c
@@ -45,6 +45,8 @@ void enable_copy_mc_fragile(void)
#define copy_mc_fragile_enabled (0)
#endif
+unsigned long copy_mc_enhanced_fast_string(void *dst, const void *src, unsigned len);
+
/**
* copy_mc_to_kernel - memory copy that handles source exceptions
*
@@ -52,9 +54,11 @@ void enable_copy_mc_fragile(void)
* @src: source address
* @len: number of bytes to copy
*
- * Call into the 'fragile' version on systems that have trouble
- * actually do machine check recovery. Everyone else can just
- * use memcpy().
+ * Call into the 'fragile' version on systems that benefit from avoiding
+ * corner case poison consumption scenarios, For example, accessing
+ * poison across 2 cachelines with a single instruction. Almost all
+ * other uses case can use copy_mc_enhanced_fast_string() for a fast
+ * recoverable copy, or fallback to plain memcpy.
*
* Return 0 for success, or number of bytes not copied if there was an
* exception.
@@ -63,6 +67,8 @@ unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigne
{
if (copy_mc_fragile_enabled)
return copy_mc_fragile(dst, src, len);
+ if (static_cpu_has(X86_FEATURE_ERMS))
+ return copy_mc_enhanced_fast_string(dst, src, len);
memcpy(dst, src, len);
return 0;
}
@@ -72,11 +78,19 @@ unsigned long __must_check copy_mc_to_user(void *dst, const void *src, unsigned
{
unsigned long ret;
- if (!copy_mc_fragile_enabled)
- return copy_user_generic(dst, src, len);
+ if (copy_mc_fragile_enabled) {
+ __uaccess_begin();
+ ret = copy_mc_fragile(dst, src, len);
+ __uaccess_end();
+ return ret;
+ }
+
+ if (static_cpu_has(X86_FEATURE_ERMS)) {
+ __uaccess_begin();
+ ret = copy_mc_enhanced_fast_string(dst, src, len);
+ __uaccess_end();
+ return ret;
+ }
- __uaccess_begin();
- ret = copy_mc_fragile(dst, src, len);
- __uaccess_end();
- return ret;
+ return copy_user_generic(dst, src, len);
}
diff --git a/arch/x86/lib/copy_mc_64.S b/arch/x86/lib/copy_mc_64.S
index c3b613c4544a..892d8915f609 100644
--- a/arch/x86/lib/copy_mc_64.S
+++ b/arch/x86/lib/copy_mc_64.S
@@ -124,4 +124,40 @@ EXPORT_SYMBOL_GPL(copy_mc_fragile)
_ASM_EXTABLE(.L_write_words, .E_write_words)
_ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
#endif /* CONFIG_X86_MCE */
+
+/*
+ * copy_mc_enhanced_fast_string - memory copy with exception handling
+ *
+ * Fast string copy + fault / exception handling. If the CPU does
+ * support machine check exception recovery, but does not support
+ * recovering from fast-string exceptions then this CPU needs to be
+ * added to the copy_mc_fragile_key set of quirks. Otherwise, absent any
+ * machine check recovery support this version should be no slower than
+ * standard memcpy.
+ */
+SYM_FUNC_START(copy_mc_enhanced_fast_string)
+ movq %rdi, %rax
+ movq %rdx, %rcx
+.L_copy:
+ rep movsb
+ /* Copy successful. Return zero */
+ xorl %eax, %eax
+ ret
+SYM_FUNC_END(copy_mc_enhanced_fast_string)
+
+ .section .fixup, "ax"
+.E_copy:
+ /*
+ * On fault %rcx is updated such that the copy instruction could
+ * optionally be restarted at the fault position, i.e. it
+ * contains 'bytes remaining'. A non-zero return indicates error
+ * to copy_mc_generic() users, or indicate short transfers to
+ * user-copy routines.
+ */
+ movq %rcx, %rax
+ ret
+
+ .previous
+
+ _ASM_EXTABLE_FAULT(.L_copy, .E_copy)
#endif /* !CONFIG_UML */
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 893f021fec63..b3e4efcf7ca6 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -550,6 +550,7 @@ static const char *uaccess_safe_builtin[] = {
"csum_partial_copy_generic",
"copy_mc_fragile",
"copy_mc_fragile_handle_tail",
+ "copy_mc_enhanced_fast_string",
"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
NULL
};
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5da8e4a658109e3b7e1f45ae672b7c06ac3e7158 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams(a)intel.com>
Date: Mon, 5 Oct 2020 20:40:25 -0700
Subject: [PATCH] x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
The motivations to go rework memcpy_mcsafe() are that the benefit of
doing slow and careful copies is obviated on newer CPUs, and that the
current opt-in list of CPUs to instrument recovery is broken relative to
those CPUs. There is no need to keep an opt-in list up to date on an
ongoing basis if pmem/dax operations are instrumented for recovery by
default. With recovery enabled by default the old "mcsafe_key" opt-in to
careful copying can be made a "fragile" opt-out. Where the "fragile"
list takes steps to not consume poison across cachelines.
The discussion with Linus made clear that the current "_mcsafe" suffix
was imprecise to a fault. The operations that are needed by pmem/dax are
to copy from a source address that might throw #MC to a destination that
may write-fault, if it is a user page.
So copy_to_user_mcsafe() becomes copy_mc_to_user() to indicate
the separate precautions taken on source and destination.
copy_mc_to_kernel() is introduced as a non-SMAP version that does not
expect write-faults on the destination, but is still prepared to abort
with an error code upon taking #MC.
The original copy_mc_fragile() implementation had negative performance
implications since it did not use the fast-string instruction sequence
to perform copies. For this reason copy_mc_to_kernel() fell back to
plain memcpy() to preserve performance on platforms that did not indicate
the capability to recover from machine check exceptions. However, that
capability detection was not architectural and now that some platforms
can recover from fast-string consumption of memory errors the memcpy()
fallback now causes these more capable platforms to fail.
Introduce copy_mc_enhanced_fast_string() as the fast default
implementation of copy_mc_to_kernel() and finalize the transition of
copy_mc_fragile() to be a platform quirk to indicate 'copy-carefully'.
With this in place, copy_mc_to_kernel() is fast and recovery-ready by
default regardless of hardware capability.
Thanks to Vivek for identifying that copy_user_generic() is not suitable
as the copy_mc_to_user() backend since the #MC handler explicitly checks
ex_has_fault_handler(). Thanks to the 0day robot for catching a
performance bug in the x86/copy_mc_to_user implementation.
[ bp: Add the "why" for this change from the 0/2th message, massage. ]
Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Reported-by: Erwin Tsaur <erwin.tsaur(a)intel.com>
Reported-by: 0day robot <lkp(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Reviewed-by: Tony Luck <tony.luck(a)intel.com>
Tested-by: Erwin Tsaur <erwin.tsaur(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/160195562556.2163339.18063423034951948973.stgit@d…
diff --git a/arch/x86/lib/copy_mc.c b/arch/x86/lib/copy_mc.c
index 2633635530b7..c13e8c9ee926 100644
--- a/arch/x86/lib/copy_mc.c
+++ b/arch/x86/lib/copy_mc.c
@@ -45,6 +45,8 @@ void enable_copy_mc_fragile(void)
#define copy_mc_fragile_enabled (0)
#endif
+unsigned long copy_mc_enhanced_fast_string(void *dst, const void *src, unsigned len);
+
/**
* copy_mc_to_kernel - memory copy that handles source exceptions
*
@@ -52,9 +54,11 @@ void enable_copy_mc_fragile(void)
* @src: source address
* @len: number of bytes to copy
*
- * Call into the 'fragile' version on systems that have trouble
- * actually do machine check recovery. Everyone else can just
- * use memcpy().
+ * Call into the 'fragile' version on systems that benefit from avoiding
+ * corner case poison consumption scenarios, For example, accessing
+ * poison across 2 cachelines with a single instruction. Almost all
+ * other uses case can use copy_mc_enhanced_fast_string() for a fast
+ * recoverable copy, or fallback to plain memcpy.
*
* Return 0 for success, or number of bytes not copied if there was an
* exception.
@@ -63,6 +67,8 @@ unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigne
{
if (copy_mc_fragile_enabled)
return copy_mc_fragile(dst, src, len);
+ if (static_cpu_has(X86_FEATURE_ERMS))
+ return copy_mc_enhanced_fast_string(dst, src, len);
memcpy(dst, src, len);
return 0;
}
@@ -72,11 +78,19 @@ unsigned long __must_check copy_mc_to_user(void *dst, const void *src, unsigned
{
unsigned long ret;
- if (!copy_mc_fragile_enabled)
- return copy_user_generic(dst, src, len);
+ if (copy_mc_fragile_enabled) {
+ __uaccess_begin();
+ ret = copy_mc_fragile(dst, src, len);
+ __uaccess_end();
+ return ret;
+ }
+
+ if (static_cpu_has(X86_FEATURE_ERMS)) {
+ __uaccess_begin();
+ ret = copy_mc_enhanced_fast_string(dst, src, len);
+ __uaccess_end();
+ return ret;
+ }
- __uaccess_begin();
- ret = copy_mc_fragile(dst, src, len);
- __uaccess_end();
- return ret;
+ return copy_user_generic(dst, src, len);
}
diff --git a/arch/x86/lib/copy_mc_64.S b/arch/x86/lib/copy_mc_64.S
index c3b613c4544a..892d8915f609 100644
--- a/arch/x86/lib/copy_mc_64.S
+++ b/arch/x86/lib/copy_mc_64.S
@@ -124,4 +124,40 @@ EXPORT_SYMBOL_GPL(copy_mc_fragile)
_ASM_EXTABLE(.L_write_words, .E_write_words)
_ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
#endif /* CONFIG_X86_MCE */
+
+/*
+ * copy_mc_enhanced_fast_string - memory copy with exception handling
+ *
+ * Fast string copy + fault / exception handling. If the CPU does
+ * support machine check exception recovery, but does not support
+ * recovering from fast-string exceptions then this CPU needs to be
+ * added to the copy_mc_fragile_key set of quirks. Otherwise, absent any
+ * machine check recovery support this version should be no slower than
+ * standard memcpy.
+ */
+SYM_FUNC_START(copy_mc_enhanced_fast_string)
+ movq %rdi, %rax
+ movq %rdx, %rcx
+.L_copy:
+ rep movsb
+ /* Copy successful. Return zero */
+ xorl %eax, %eax
+ ret
+SYM_FUNC_END(copy_mc_enhanced_fast_string)
+
+ .section .fixup, "ax"
+.E_copy:
+ /*
+ * On fault %rcx is updated such that the copy instruction could
+ * optionally be restarted at the fault position, i.e. it
+ * contains 'bytes remaining'. A non-zero return indicates error
+ * to copy_mc_generic() users, or indicate short transfers to
+ * user-copy routines.
+ */
+ movq %rcx, %rax
+ ret
+
+ .previous
+
+ _ASM_EXTABLE_FAULT(.L_copy, .E_copy)
#endif /* !CONFIG_UML */
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 893f021fec63..b3e4efcf7ca6 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -550,6 +550,7 @@ static const char *uaccess_safe_builtin[] = {
"csum_partial_copy_generic",
"copy_mc_fragile",
"copy_mc_fragile_handle_tail",
+ "copy_mc_enhanced_fast_string",
"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
NULL
};
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5da8e4a658109e3b7e1f45ae672b7c06ac3e7158 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams(a)intel.com>
Date: Mon, 5 Oct 2020 20:40:25 -0700
Subject: [PATCH] x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
The motivations to go rework memcpy_mcsafe() are that the benefit of
doing slow and careful copies is obviated on newer CPUs, and that the
current opt-in list of CPUs to instrument recovery is broken relative to
those CPUs. There is no need to keep an opt-in list up to date on an
ongoing basis if pmem/dax operations are instrumented for recovery by
default. With recovery enabled by default the old "mcsafe_key" opt-in to
careful copying can be made a "fragile" opt-out. Where the "fragile"
list takes steps to not consume poison across cachelines.
The discussion with Linus made clear that the current "_mcsafe" suffix
was imprecise to a fault. The operations that are needed by pmem/dax are
to copy from a source address that might throw #MC to a destination that
may write-fault, if it is a user page.
So copy_to_user_mcsafe() becomes copy_mc_to_user() to indicate
the separate precautions taken on source and destination.
copy_mc_to_kernel() is introduced as a non-SMAP version that does not
expect write-faults on the destination, but is still prepared to abort
with an error code upon taking #MC.
The original copy_mc_fragile() implementation had negative performance
implications since it did not use the fast-string instruction sequence
to perform copies. For this reason copy_mc_to_kernel() fell back to
plain memcpy() to preserve performance on platforms that did not indicate
the capability to recover from machine check exceptions. However, that
capability detection was not architectural and now that some platforms
can recover from fast-string consumption of memory errors the memcpy()
fallback now causes these more capable platforms to fail.
Introduce copy_mc_enhanced_fast_string() as the fast default
implementation of copy_mc_to_kernel() and finalize the transition of
copy_mc_fragile() to be a platform quirk to indicate 'copy-carefully'.
With this in place, copy_mc_to_kernel() is fast and recovery-ready by
default regardless of hardware capability.
Thanks to Vivek for identifying that copy_user_generic() is not suitable
as the copy_mc_to_user() backend since the #MC handler explicitly checks
ex_has_fault_handler(). Thanks to the 0day robot for catching a
performance bug in the x86/copy_mc_to_user implementation.
[ bp: Add the "why" for this change from the 0/2th message, massage. ]
Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Reported-by: Erwin Tsaur <erwin.tsaur(a)intel.com>
Reported-by: 0day robot <lkp(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Reviewed-by: Tony Luck <tony.luck(a)intel.com>
Tested-by: Erwin Tsaur <erwin.tsaur(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/160195562556.2163339.18063423034951948973.stgit@d…
diff --git a/arch/x86/lib/copy_mc.c b/arch/x86/lib/copy_mc.c
index 2633635530b7..c13e8c9ee926 100644
--- a/arch/x86/lib/copy_mc.c
+++ b/arch/x86/lib/copy_mc.c
@@ -45,6 +45,8 @@ void enable_copy_mc_fragile(void)
#define copy_mc_fragile_enabled (0)
#endif
+unsigned long copy_mc_enhanced_fast_string(void *dst, const void *src, unsigned len);
+
/**
* copy_mc_to_kernel - memory copy that handles source exceptions
*
@@ -52,9 +54,11 @@ void enable_copy_mc_fragile(void)
* @src: source address
* @len: number of bytes to copy
*
- * Call into the 'fragile' version on systems that have trouble
- * actually do machine check recovery. Everyone else can just
- * use memcpy().
+ * Call into the 'fragile' version on systems that benefit from avoiding
+ * corner case poison consumption scenarios, For example, accessing
+ * poison across 2 cachelines with a single instruction. Almost all
+ * other uses case can use copy_mc_enhanced_fast_string() for a fast
+ * recoverable copy, or fallback to plain memcpy.
*
* Return 0 for success, or number of bytes not copied if there was an
* exception.
@@ -63,6 +67,8 @@ unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigne
{
if (copy_mc_fragile_enabled)
return copy_mc_fragile(dst, src, len);
+ if (static_cpu_has(X86_FEATURE_ERMS))
+ return copy_mc_enhanced_fast_string(dst, src, len);
memcpy(dst, src, len);
return 0;
}
@@ -72,11 +78,19 @@ unsigned long __must_check copy_mc_to_user(void *dst, const void *src, unsigned
{
unsigned long ret;
- if (!copy_mc_fragile_enabled)
- return copy_user_generic(dst, src, len);
+ if (copy_mc_fragile_enabled) {
+ __uaccess_begin();
+ ret = copy_mc_fragile(dst, src, len);
+ __uaccess_end();
+ return ret;
+ }
+
+ if (static_cpu_has(X86_FEATURE_ERMS)) {
+ __uaccess_begin();
+ ret = copy_mc_enhanced_fast_string(dst, src, len);
+ __uaccess_end();
+ return ret;
+ }
- __uaccess_begin();
- ret = copy_mc_fragile(dst, src, len);
- __uaccess_end();
- return ret;
+ return copy_user_generic(dst, src, len);
}
diff --git a/arch/x86/lib/copy_mc_64.S b/arch/x86/lib/copy_mc_64.S
index c3b613c4544a..892d8915f609 100644
--- a/arch/x86/lib/copy_mc_64.S
+++ b/arch/x86/lib/copy_mc_64.S
@@ -124,4 +124,40 @@ EXPORT_SYMBOL_GPL(copy_mc_fragile)
_ASM_EXTABLE(.L_write_words, .E_write_words)
_ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
#endif /* CONFIG_X86_MCE */
+
+/*
+ * copy_mc_enhanced_fast_string - memory copy with exception handling
+ *
+ * Fast string copy + fault / exception handling. If the CPU does
+ * support machine check exception recovery, but does not support
+ * recovering from fast-string exceptions then this CPU needs to be
+ * added to the copy_mc_fragile_key set of quirks. Otherwise, absent any
+ * machine check recovery support this version should be no slower than
+ * standard memcpy.
+ */
+SYM_FUNC_START(copy_mc_enhanced_fast_string)
+ movq %rdi, %rax
+ movq %rdx, %rcx
+.L_copy:
+ rep movsb
+ /* Copy successful. Return zero */
+ xorl %eax, %eax
+ ret
+SYM_FUNC_END(copy_mc_enhanced_fast_string)
+
+ .section .fixup, "ax"
+.E_copy:
+ /*
+ * On fault %rcx is updated such that the copy instruction could
+ * optionally be restarted at the fault position, i.e. it
+ * contains 'bytes remaining'. A non-zero return indicates error
+ * to copy_mc_generic() users, or indicate short transfers to
+ * user-copy routines.
+ */
+ movq %rcx, %rax
+ ret
+
+ .previous
+
+ _ASM_EXTABLE_FAULT(.L_copy, .E_copy)
#endif /* !CONFIG_UML */
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 893f021fec63..b3e4efcf7ca6 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -550,6 +550,7 @@ static const char *uaccess_safe_builtin[] = {
"csum_partial_copy_generic",
"copy_mc_fragile",
"copy_mc_fragile_handle_tail",
+ "copy_mc_enhanced_fast_string",
"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
NULL
};
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 18fce56134c987e5b4eceddafdbe4b00c07e2ae1 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Thu, 16 Jul 2020 17:11:09 +0100
Subject: [PATCH] arm64: Run ARCH_WORKAROUND_1 enabling code on all CPUs
Commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack
thereof") changed the way we deal with ARCH_WORKAROUND_1, by moving most
of the enabling code to the .matches() callback.
This has the unfortunate effect that the workaround gets only enabled on
the first affected CPU, and no other.
In order to address this, forcefully call the .matches() callback from a
.cpu_enable() callback, which brings us back to the original behaviour.
Fixes: 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof")
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 88966496806a..3fe64bf5a58d 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -599,6 +599,12 @@ check_branch_predictor(const struct arm64_cpu_capabilities *entry, int scope)
return (need_wa > 0);
}
+static void
+cpu_enable_branch_predictor_hardening(const struct arm64_cpu_capabilities *cap)
+{
+ cap->matches(cap, SCOPE_LOCAL_CPU);
+}
+
static const __maybe_unused struct midr_range tx2_family_cpus[] = {
MIDR_ALL_VERSIONS(MIDR_BRCM_VULCAN),
MIDR_ALL_VERSIONS(MIDR_CAVIUM_THUNDERX2),
@@ -890,9 +896,11 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
},
#endif
{
+ .desc = "Branch predictor hardening",
.capability = ARM64_HARDEN_BRANCH_PREDICTOR,
.type = ARM64_CPUCAP_LOCAL_CPU_ERRATUM,
.matches = check_branch_predictor,
+ .cpu_enable = cpu_enable_branch_predictor_hardening,
},
#ifdef CONFIG_RANDOMIZE_BASE
{
On Tue, Oct 27, 2020 at 10:12:49AM -0700, Saeed Mirzamohammadi wrote:
> Hi Greg,
>
> Sorry for the confusion. I’m requesting stable maintainers to cherry-pick this patch into stable 5.4 and 5.8.
> commit cc07057c7c88fb8eff3b1991131ded0f0bcfa7e3
> Author: Saeed Mirzamohammadi <saeed.mirzamohammadi(a)oracle.com>
> Date: Wed Oct 21 16:57:58 2020 -0700
>
> video: fbdev: fix divide error in fbcon_switch
I do not see that commit in Linus's tree, do you?
confused,
greg k-h
[ Upstream commit 4e3bbb33e6f36e4b05be1b1b9b02e3dd5aaa3e69 ]
SOCK_TSTAMP_NEW (timespec64 instead of timespec) is also used for
hardware time stamps (configured via SO_TIMESTAMPING_NEW).
User space (ptp4l) first configures hardware time stamping via
SO_TIMESTAMPING_NEW which sets SOCK_TSTAMP_NEW. In the next step, ptp4l
disables SO_TIMESTAMPNS(_NEW) (software time stamps), but this must not
switch hardware time stamps back to "32 bit mode".
This problem happens on 32 bit platforms were the libc has already
switched to struct timespec64 (from SO_TIMExxx_OLD to SO_TIMExxx_NEW
socket options). ptp4l complains with "missing timestamp on transmitted
peer delay request" because the wrong format is received (and
discarded).
Fixes: 887feae36aee ("socket: Add SO_TIMESTAMP[NS]_NEW")
Signed-off-by: Christian Eggers <ceggers(a)arri.de>
Acked-by: Willem de Bruijn <willemb(a)google.com>
Acked-by: Deepa Dinamani <deepa.kernel(a)gmail.com>
---
Hi Greg,
I just got your E-mail(s) that this patch has been applied to 5.8 and 5.9.
This is a back port for the same problem on 5.4. It does the same as the
upstream patch, only the affected code is at another position here. Please
decide yourself whether the Acked-by: tags (from the upstream patch) should
be kept or removed.
This back port is only required for 5.4, older kernels like 4.19 are not
affected.
regards
Christian
net/core/sock.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index 9a186d2ad36d..1eda7337b881 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -923,7 +923,6 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
} else {
sock_reset_flag(sk, SOCK_RCVTSTAMP);
sock_reset_flag(sk, SOCK_RCVTSTAMPNS);
- sock_reset_flag(sk, SOCK_TSTAMP_NEW);
}
break;
--
Christian Eggers
Embedded software developer
Arnold & Richter Cine Technik GmbH & Co. Betriebs KG
Sitz: Muenchen - Registergericht: Amtsgericht Muenchen - Handelsregisternummer: HRA 57918
Persoenlich haftender Gesellschafter: Arnold & Richter Cine Technik GmbH
Sitz: Muenchen - Registergericht: Amtsgericht Muenchen - Handelsregisternummer: HRB 54477
Geschaeftsfuehrer: Dr. Michael Neuhaeuser; Stephan Schenk; Walter Trauninger; Markus Zeiler
Attached the syzkaller C repro.
Tested-by: Saeed Mirzamohammadi <saeed.mirzamohammadi(a)oracle.com>
> On Oct 20, 2020, at 9:45 AM, Saeed Mirzamohammadi <saeed.mirzamohammadi(a)oracle.com> wrote:
>
> Thanks! Yes, that looks good to me.
>
> Saeed
>
>> On Oct 20, 2020, at 4:50 AM, Pablo Neira Ayuso <pablo(a)netfilter.org> wrote:
>>
>> On Mon, Oct 19, 2020 at 10:25:32AM -0700, saeed.mirzamohammadi(a)oracle.com wrote:
>>> From: Saeed Mirzamohammadi <saeed.mirzamohammadi(a)oracle.com>
>>>
>>> This patch fixes the issue due to:
>>>
>>> BUG: KASAN: slab-out-of-bounds in nft_flow_rule_create+0x622/0x6a2
>>> net/netfilter/nf_tables_offload.c:40
>>> Read of size 8 at addr ffff888103910b58 by task syz-executor227/16244
>>>
>>> The error happens when expr->ops is accessed early on before performing the boundary check and after nft_expr_next() moves the expr to go out-of-bounds.
>>>
>>> This patch checks the boundary condition before expr->ops that fixes the slab-out-of-bounds Read issue.
>>
>> Thanks. I made a slight variant of your patch.
>>
>> I'm attaching it, it is also fixing the problem but it introduced
>> nft_expr_more() and use it everywhere.
>>
>> Let me know if this looks fine to you.
>> <0001-netfilter-fix-KASAN-slab-out-of-bounds-Read-in-nft_f.patch>
>
This commit:
fd01b2597941 SUNRPC: ECONNREFUSED should cause a rebind.
(originally applied to v4.14-rc1) didn't appear to get a stable cc,
perhaps because it wasn't considered a common problem at the time.
A patch I'm shortly about to post, cc stable, depends on the above, so
could it please be cherry-picked for stable?
It applies cleanly to both v4.4.240 & v4.9.240
thank you,
calum.
Dear Stable kernel maintainers,
Please consider cherry picking
commit e81e07244325 ("objtool: Support Clang non-section symbols in
ORC generation")
to linux-5.4.y and linux-4.19.y. This allows us to use LLVM_IAS=1 for
x86_64 Android kernel builds without warning.
Its partner patch (8782e7cab51b6b) was already backported to
linux-5.4.y as 8c627d4b15de9, and linux-4.19.y as 6e575122cd956.
https://github.com/ClangBuiltLinux/linux/issues/669https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
--
Thanks,
~Nick Desaulniers
From: Kan Liang <kan.liang(a)linux.intel.com>
The event CYCLE_ACTIVITY.STALLS_MEM_ANY (0x14a3) should be available on
all 8 GP counters on ICL, but it's only scheduled on the first four
counters due to the current ICL constraint table.
Add a line for the CYCLE_ACTIVITY.STALLS_MEM_ANY event in the ICL
constraint table.
Correct the comments for the CYCLE_ACTIVITY.CYCLES_MEM_ANY event.
Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support")
Reported-by: Andi Kleen <ak(a)linux.intel.com>
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
arch/x86/events/intel/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c72e4904e056..b31ebb5f7fc4 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -257,7 +257,8 @@ static struct event_constraint intel_icl_event_constraints[] = {
INTEL_EVENT_CONSTRAINT_RANGE(0x48, 0x54, 0xf),
INTEL_EVENT_CONSTRAINT_RANGE(0x60, 0x8b, 0xf),
INTEL_UEVENT_CONSTRAINT(0x04a3, 0xff), /* CYCLE_ACTIVITY.STALLS_TOTAL */
- INTEL_UEVENT_CONSTRAINT(0x10a3, 0xff), /* CYCLE_ACTIVITY.STALLS_MEM_ANY */
+ INTEL_UEVENT_CONSTRAINT(0x10a3, 0xff), /* CYCLE_ACTIVITY.CYCLES_MEM_ANY */
+ INTEL_UEVENT_CONSTRAINT(0x14a3, 0xff), /* CYCLE_ACTIVITY.STALLS_MEM_ANY */
INTEL_EVENT_CONSTRAINT(0xa3, 0xf), /* CYCLE_ACTIVITY.* */
INTEL_EVENT_CONSTRAINT_RANGE(0xa8, 0xb0, 0xf),
INTEL_EVENT_CONSTRAINT_RANGE(0xb7, 0xbd, 0xf),
--
2.17.1
In our last attempt to fix races in the panfrost_job_timedout() path we
overlooked the case where a re-submitted job immediately triggers a
fault. This lead to a situation where we try to stop a scheduler that's
not resumed yet and lose the 'timedout' event without restarting the
timeout, thus blocking the whole queue.
Let's fix that by tracking timeouts occurring between the
drm_sched_resubmit_jobs() and drm_sched_start() calls.
v2:
- Fix another race (reported by Steven)
Fixes: 1a11a88cfd9a ("drm/panfrost: Fix job timeout handling")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Boris Brezillon <boris.brezillon(a)collabora.com>
---
drivers/gpu/drm/panfrost/panfrost_job.c | 61 +++++++++++++++++--------
1 file changed, 43 insertions(+), 18 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index d0469e944143..0f9a34f5c6d0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -26,6 +26,7 @@
struct panfrost_queue_state {
struct drm_gpu_scheduler sched;
bool stopped;
+ bool timedout;
struct mutex lock;
u64 fence_context;
u64 emit_seqno;
@@ -383,11 +384,33 @@ static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue,
queue->stopped = true;
stopped = true;
}
+ queue->timedout = true;
mutex_unlock(&queue->lock);
return stopped;
}
+static void panfrost_scheduler_start(struct panfrost_queue_state *queue)
+{
+ if (WARN_ON(!queue->stopped))
+ return;
+
+ mutex_lock(&queue->lock);
+ drm_sched_start(&queue->sched, true);
+
+ /*
+ * We might have missed fault-timeouts (AKA immediate timeouts) while
+ * the scheduler was stopped. Let's fake a new fault to trigger an
+ * immediate reset.
+ */
+ if (queue->timedout)
+ drm_sched_fault(&queue->sched);
+
+ queue->timedout = false;
+ queue->stopped = false;
+ mutex_unlock(&queue->lock);
+}
+
static void panfrost_job_timedout(struct drm_sched_job *sched_job)
{
struct panfrost_job *job = to_panfrost_job(sched_job);
@@ -422,27 +445,20 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job)
struct drm_gpu_scheduler *sched = &pfdev->js->queue[i].sched;
/*
- * If the queue is still active, make sure we wait for any
- * pending timeouts.
+ * Stop the scheduler and wait for any pending timeout handler
+ * to return.
*/
- if (!pfdev->js->queue[i].stopped)
+ panfrost_scheduler_stop(&pfdev->js->queue[i], NULL);
+ if (i != js)
cancel_delayed_work_sync(&sched->work_tdr);
/*
- * If the scheduler was not already stopped, there's a tiny
- * chance a timeout has expired just before we stopped it, and
- * drm_sched_stop() does not flush pending works. Let's flush
- * them now so the timeout handler doesn't get called in the
- * middle of a reset.
+ * We do another stop after cancel_delayed_work_sync() to make
+ * sure we don't race against another thread finishing its
+ * reset (the restart queue steps are not protected by the
+ * reset lock).
*/
- if (panfrost_scheduler_stop(&pfdev->js->queue[i], NULL))
- cancel_delayed_work_sync(&sched->work_tdr);
-
- /*
- * Now that we cancelled the pending timeouts, we can safely
- * reset the stopped state.
- */
- pfdev->js->queue[i].stopped = false;
+ panfrost_scheduler_stop(&pfdev->js->queue[i], NULL);
}
spin_lock_irqsave(&pfdev->js->job_lock, flags);
@@ -457,14 +473,23 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job)
panfrost_device_reset(pfdev);
- for (i = 0; i < NUM_JOB_SLOTS; i++)
+ for (i = 0; i < NUM_JOB_SLOTS; i++) {
+ /*
+ * The GPU is idle, and the scheduler is stopped, we can safely
+ * reset the ->timedout state without taking any lock. We need
+ * to do that before calling drm_sched_resubmit_jobs() though,
+ * because the resubmission might trigger immediate faults
+ * which we want to catch.
+ */
+ pfdev->js->queue[i].timedout = false;
drm_sched_resubmit_jobs(&pfdev->js->queue[i].sched);
+ }
mutex_unlock(&pfdev->reset_lock);
/* restart scheduler after GPU is usable again */
for (i = 0; i < NUM_JOB_SLOTS; i++)
- drm_sched_start(&pfdev->js->queue[i].sched, true);
+ panfrost_scheduler_start(&pfdev->js->queue[i]);
}
static const struct drm_sched_backend_ops panfrost_sched_ops = {
--
2.26.2
From: Alexander Usyskin <alexander.usyskin(a)intel.com>
A receive callback is queued while the client is still connected
but can still be called after the client was disconnected. Upon
disconnect cl->me_cl is set to NULL, hence we need to check
that ME client is not-NULL in mei_cl_mtu to avoid
null dereference.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Alexander Usyskin <alexander.usyskin(a)intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler(a)intel.com>
---
drivers/misc/mei/client.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/misc/mei/client.h b/drivers/misc/mei/client.h
index 64143d4ec758..9e08a9843bba 100644
--- a/drivers/misc/mei/client.h
+++ b/drivers/misc/mei/client.h
@@ -182,11 +182,11 @@ static inline u8 mei_cl_me_id(const struct mei_cl *cl)
*
* @cl: host client
*
- * Return: mtu
+ * Return: mtu or 0 if client is not connected
*/
static inline size_t mei_cl_mtu(const struct mei_cl *cl)
{
- return cl->me_cl->props.max_msg_length;
+ return cl->me_cl ? cl->me_cl->props.max_msg_length : 0;
}
/**
--
2.25.4
Generic Desktop DPAD usage is mapped by hid-input, that only the first
DPAD usage maps to usage type EV_ABS and code of an axis. If HID
descriptor has DPAD UP/DOWN/LEFT/RIGHT HID usages and each of usage size
is 1 bit, then only the first one will generate input event, the rest of
the HID usages will be assigned to hat direction only.
The hid input event should check the HID report value and generate
HID event for its hat direction.
Test: Connect HID device with Generic Desktop DPAD usage and press the
DPAD to generate input events.
Signed-off-by: Chris Ye <lzye(a)google.com>
---
drivers/hid/hid-input.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/hid/hid-input.c b/drivers/hid/hid-input.c
index 9770db624bfa..6c1007de3409 100644
--- a/drivers/hid/hid-input.c
+++ b/drivers/hid/hid-input.c
@@ -1269,7 +1269,7 @@ void hidinput_hid_event(struct hid_device *hid, struct hid_field *field, struct
struct input_dev *input;
unsigned *quirks = &hid->quirks;
- if (!usage->type)
+ if (!usage->type && !field->dpad)
return;
if (usage->type == EV_PWR) {
@@ -1286,9 +1286,17 @@ void hidinput_hid_event(struct hid_device *hid, struct hid_field *field, struct
int hat_dir = usage->hat_dir;
if (!hat_dir)
hat_dir = (value - usage->hat_min) * 8 / (usage->hat_max - usage->hat_min + 1) + 1;
- if (hat_dir < 0 || hat_dir > 8) hat_dir = 0;
- input_event(input, usage->type, usage->code , hid_hat_to_axis[hat_dir].x);
- input_event(input, usage->type, usage->code + 1, hid_hat_to_axis[hat_dir].y);
+ if (hat_dir < 0 || hat_dir > 8 || value == 0)
+ hat_dir = 0;
+ if (field->dpad) {
+ input_event(input, EV_ABS, field->dpad, hid_hat_to_axis[hat_dir].x);
+ input_event(input, EV_ABS, field->dpad + 1, hid_hat_to_axis[hat_dir].y);
+ } else {
+ input_event(input, usage->type, usage->code,
+ hid_hat_to_axis[hat_dir].x);
+ input_event(input, usage->type, usage->code + 1,
+ hid_hat_to_axis[hat_dir].y);
+ }
return;
}
--
2.29.1.341.ge80a0c044ae-goog
From: Chris Ye <lzye(a)google.com>
Generic Desktop DPAD usage is mapped by hid-input, that only the first
DPAD usage maps to usage type EV_ABS and code of an axis. If HID
descriptor has DPAD UP/DOWN/LEFT/RIGHT HID usages and each of usage size
is 1 bit, then only the first one will generate input event, the rest of
the HID usages will be assigned to hat direction only.
The hid input event should check the HID report value and generate
HID event for its hat direction.
Test: Connect HID device with Generic Desktop DPAD usage and press the
DPAD to generate input events.
Signed-off-by: Chris Ye <lzye(a)google.com>
---
drivers/hid/hid-input.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/hid/hid-input.c b/drivers/hid/hid-input.c
index 9770db624bfa..6c1007de3409 100644
--- a/drivers/hid/hid-input.c
+++ b/drivers/hid/hid-input.c
@@ -1269,7 +1269,7 @@ void hidinput_hid_event(struct hid_device *hid, struct hid_field *field, struct
struct input_dev *input;
unsigned *quirks = &hid->quirks;
- if (!usage->type)
+ if (!usage->type && !field->dpad)
return;
if (usage->type == EV_PWR) {
@@ -1286,9 +1286,17 @@ void hidinput_hid_event(struct hid_device *hid, struct hid_field *field, struct
int hat_dir = usage->hat_dir;
if (!hat_dir)
hat_dir = (value - usage->hat_min) * 8 / (usage->hat_max - usage->hat_min + 1) + 1;
- if (hat_dir < 0 || hat_dir > 8) hat_dir = 0;
- input_event(input, usage->type, usage->code , hid_hat_to_axis[hat_dir].x);
- input_event(input, usage->type, usage->code + 1, hid_hat_to_axis[hat_dir].y);
+ if (hat_dir < 0 || hat_dir > 8 || value == 0)
+ hat_dir = 0;
+ if (field->dpad) {
+ input_event(input, EV_ABS, field->dpad, hid_hat_to_axis[hat_dir].x);
+ input_event(input, EV_ABS, field->dpad + 1, hid_hat_to_axis[hat_dir].y);
+ } else {
+ input_event(input, usage->type, usage->code,
+ hid_hat_to_axis[hat_dir].x);
+ input_event(input, usage->type, usage->code + 1,
+ hid_hat_to_axis[hat_dir].y);
+ }
return;
}
--
2.29.1.341.ge80a0c044ae-goog
The patch titled
Subject: mm: memcg: link page counters to root if use_hierarchy is false
has been added to the -mm tree. Its filename is
mm-memcg-link-page-counters-to-root-if-use_hierarchy-is-false.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-link-page-counters-to-ro…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-memcg-link-page-counters-to-ro…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Roman Gushchin <guro(a)fb.com>
Subject: mm: memcg: link page counters to root if use_hierarchy is false
Richard reported a warning which can be reproduced by running the LTP
madvise6 test (cgroup v1 in the non-hierarchical mode should be used):
[ 9.841552] ------------[ cut here ]------------
[ 9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.841982] Modules linked in:
[ 9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
[ 9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[ 9.842571] Workqueue: events drain_local_stock
[ 9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[ 9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
[ 9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
[ 9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
[ 9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
[ 9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
[ 9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
[ 9.844696] FS: 0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
[ 9.844915] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
[ 9.845319] Call Trace:
[ 9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[ 9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[ 9.845684] drain_local_stock (mm/memcontrol.c:2255)
[ 9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
[ 9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
[ 9.846034] ? process_one_work (kernel/workqueue.c:2358)
[ 9.846162] kthread (kernel/kthread.c:292)
[ 9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[ 9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[ 9.846531] ---[ end trace 8b5647c1eba9d18a ]---
The problem occurs because in the non-hierarchical mode non-root page
counters are not linked to root page counters, so the charge is not
propagated to the root memory cgroup.
After the removal of the original memory cgroup and reparenting of the
object cgroup, the root cgroup might be uncharged by draining a objcg
stock, for example. It leads to an eventual underflow of the charge and
triggers a warning.
Fix it by linking all page counters to corresponding root page counters in
the non-hierarchical mode.
Please note, that in the non-hierarchical mode all objcgs are always
reparented to the root memory cgroup, even if the hierarchy has more than
1 level. This patch doesn't change it.
The patch also doesn't affect how the hierarchical mode is working, which
is the only sane and truly supported mode now.
Thanks to Richard for reporting, debugging and providing an alternative
version of the fix!
Link: https://lkml.kernel.org/r/20201026231326.3212225-1-guro@fb.com
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Roman Gushchin <guro(a)fb.com>
Debugged-by: Richard Palethorpe <rpalethorpe(a)suse.com>
Reported-by: <ltp(a)lists.linux.it>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
--- a/mm/memcontrol.c~mm-memcg-link-page-counters-to-root-if-use_hierarchy-is-false
+++ a/mm/memcontrol.c
@@ -5345,17 +5345,22 @@ mem_cgroup_css_alloc(struct cgroup_subsy
memcg->swappiness = mem_cgroup_swappiness(parent);
memcg->oom_kill_disable = parent->oom_kill_disable;
}
- if (parent && parent->use_hierarchy) {
+ if (!parent) {
+ page_counter_init(&memcg->memory, NULL);
+ page_counter_init(&memcg->swap, NULL);
+ page_counter_init(&memcg->kmem, NULL);
+ page_counter_init(&memcg->tcpmem, NULL);
+ } else if (parent->use_hierarchy) {
memcg->use_hierarchy = true;
page_counter_init(&memcg->memory, &parent->memory);
page_counter_init(&memcg->swap, &parent->swap);
page_counter_init(&memcg->kmem, &parent->kmem);
page_counter_init(&memcg->tcpmem, &parent->tcpmem);
} else {
- page_counter_init(&memcg->memory, NULL);
- page_counter_init(&memcg->swap, NULL);
- page_counter_init(&memcg->kmem, NULL);
- page_counter_init(&memcg->tcpmem, NULL);
+ page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
+ page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
+ page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
+ page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
/*
* Deeper hierachy with use_hierarchy == false doesn't make
* much sense so let cgroup subsystem know about this
_
Patches currently in -mm which might be from guro(a)fb.com are
mm-memcg-link-page-counters-to-root-if-use_hierarchy-is-false.patch
mm-memcontrol-use-helpers-to-read-pages-memcg-data.patch
mm-memcontrol-slab-use-helpers-to-access-slab-pages-memcg_data.patch
mm-introduce-page-memcg-flags.patch
mm-convert-page-kmemcg-type-to-a-page-memcg-flag.patch
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch
Retry.
On Wed, Oct 28, 2020 at 10:08:53AM -0700, Guenter Roeck wrote:
> On Tue, Oct 27, 2020 at 02:47:35PM +0100, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.14.203 release.
> > There are 191 patches in this series, all will be posted as a response
> > to this one. If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Thu, 29 Oct 2020 13:48:36 +0000.
> > Anything received after that time might be too late.
> >
>
> Build results:
> total: 168 pass: 166 fail: 2
> Failed builds:
> powerpc:defconfig
> powerpc:allmodconfig
> Qemu test results:
> total: 404 pass: 385 fail: 19
> Failed tests:
> <various powerpc64>
>
> Error log:
> arch/powerpc/platforms/powernv/opal-dump.c: In function 'process_dump':
> arch/powerpc/platforms/powernv/opal-dump.c:409:7: error: void value not ignored as it ought to be
>
> Tested-by: Guenter Roeck <linux(a)roeck-us.net>
>
> Guenter
The patch titled
Subject: mm/slub: fix panic in slab_alloc_node()
has been added to the -mm tree. Its filename is
mm-slub-fix-panic-in-slab_alloc_node.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-slub-fix-panic-in-slab_alloc_n…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-slub-fix-panic-in-slab_alloc_n…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Laurent Dufour <ldufour(a)linux.ibm.com>
Subject: mm/slub: fix panic in slab_alloc_node()
While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
with 11TB of ram, I hit the following panic:
BUG: Kernel NULL pointer dereference on read at 0x00000007
Faulting instruction address: 0xc000000000456048
Oops: Kernel access of bad area, sig: 11 [#2]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp
CPU: 160 PID: 1 Comm: systemd Tainted: G D 5.9.0 #1
NIP: c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
REGS: c00006028d1b77a0 TRAP: 0300 Tainted: G D (5.9.0)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004228 XER: 00000000
CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
NIP [c000000000456048] __kmalloc_node+0x108/0x790
LR [c000000000455fd4] __kmalloc_node+0x94/0x790
Call Trace:
[c00006028d1b7a30] [c00007c51af92000] 0xc00007c51af92000 (unreliable)
[c00006028d1b7aa0] [c0000000003c0ff8] kvmalloc_node+0x58/0x110
[c00006028d1b7ae0] [c00000000047b45c] mem_cgroup_css_online+0x10c/0x270
[c00006028d1b7b30] [c000000000241fd8] online_css+0x48/0xd0
[c00006028d1b7b60] [c00000000024af14] cgroup_apply_control_enable+0x2c4/0x470
[c00006028d1b7c40] [c00000000024e838] cgroup_mkdir+0x408/0x5f0
[c00006028d1b7cb0] [c0000000005a4ef0] kernfs_iop_mkdir+0x90/0x100
[c00006028d1b7cf0] [c0000000004b8168] vfs_mkdir+0x138/0x250
[c00006028d1b7d40] [c0000000004baf04] do_mkdirat+0x154/0x1c0
[c00006028d1b7dc0] [c000000000032b38] system_call_exception+0xf8/0x200
[c00006028d1b7e20] [c00000000000c740] system_call_common+0xf0/0x27c
Instruction dump:
e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
2fbc0000 419e0018 41920230 e9270010 <89290007> 7f994800 419e0220 7ee6bb78
This pointing to the following code:
mm/slub.c:2851
if (unlikely(!object || !node_match(page, node))) {
c000000000456038: 00 00 bc 2f cmpdi cr7,r28,0
c00000000045603c: 18 00 9e 41 beq cr7,c000000000456054 <__kmalloc_node+0x114>
node_match():
mm/slub.c:2491
if (node != NUMA_NO_NODE && page_to_nid(page) != node)
c000000000456040: 30 02 92 41 beq cr4,c000000000456270 <__kmalloc_node+0x330>
page_to_nid():
include/linux/mm.h:1294
c000000000456044: 10 00 27 e9 ld r9,16(r7)
c000000000456048: 07 00 29 89 lbz r9,7(r9) <<<< r9 = NULL
node_match():
mm/slub.c:2491
c00000000045604c: 00 48 99 7f cmpw cr7,r25,r9
c000000000456050: 20 02 9e 41 beq cr7,c000000000456270 <__kmalloc_node+0x330>
The panic occurred in slab_alloc_node() when checking for the page's node:
object = c->freelist;
page = c->page;
if (unlikely(!object || !node_match(page, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c);
stat(s, ALLOC_SLOWPATH);
The issue is that object is not NULL while page is NULL which is odd but
may happen if the cache flush happened after loading object but before
loading page. Thus checking for the page pointer is required too.
The cache flush is done through an inter processor interrupt when a piece
of memory is off-lined. That interrupt is triggered when a memory
hot-unplug operation is initiated and offline_pages() is calling the
slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback() which
is calling flush_cpu_slab(). If that interrupt is caught between the
reading of c->freelist and the reading of c->page, this could lead to such
a situation. That situation is expected and the later call to
this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
the whole operation.
In commit 6159d0f5c03e ("mm/slub.c: page is always non-NULL in
node_match()") check on the page pointer has been removed assuming that
page is always valid when it is called. It happens that this is not true
in that particular case, so check for page before calling node_match()
here.
Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.com
Fixes: 6159d0f5c03e ("mm/slub.c: page is always non-NULL in node_match()")
Signed-off-by: Laurent Dufour <ldufour(a)linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Wei Yang <richard.weiyang(a)gmail.com>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Nathan Lynch <nathanl(a)linux.ibm.com>
Cc: Scott Cheloha <cheloha(a)linux.ibm.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/slub.c~mm-slub-fix-panic-in-slab_alloc_node
+++ a/mm/slub.c
@@ -2852,7 +2852,7 @@ redo:
object = c->freelist;
page = c->page;
- if (unlikely(!object || !node_match(page, node))) {
+ if (unlikely(!object || !page || !node_match(page, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c);
} else {
void *next_object = get_freepointer_safe(s, object);
_
Patches currently in -mm which might be from ldufour(a)linux.ibm.com are
mm-slub-fix-panic-in-slab_alloc_node.patch