On 9/22/22 2:01 PM, Dave Hansen wrote:
> On 9/22/22 11:53, Rafael J. Wysocki wrote:
>> Acked-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
>>
>> or do you want me to pick this up?
>
> I'll just stick it in x86/urgent.
>
> It's modifying code in a x86 #ifdef. I'll call it a small enclave of
> sovereign x86 territory in ACPI land, just like an embassy. ;)
Can it be cc:stable@vger.kernel.org, since it applies cleanly as far
back as this v5.4 commit?:
commit fa583f71a99c85e52781ed877c82c8757437b680
Author: Yin Fengwei <fengwei.yin(a)intel.com>
Date: Thu Oct 24 15:04:20 2019 +0800
ACPI: processor_idle: Skip dummy wait if kernel is in guest
Thanks,
Kim
From: Tianyu Lan <Tianyu.Lan(a)microsoft.com>
commit 82806744fd7dde603b64c151eeddaa4ee62193fd upstream.
swiotlb_find_slots() skips slots according to io tlb aligned mask
calculated from min aligned mask and original physical address
offset. This affects max mapping size. The mapping size can't
achieve the IO_TLB_SEGSIZE * IO_TLB_SIZE when original offset is
non-zero. This will cause system boot up failure in Hyper-V
Isolation VM where swiotlb force is enabled. Scsi layer use return
value of dma_max_mapping_size() to set max segment size and it
finally calls swiotlb_max_mapping_size(). Hyper-V storage driver
sets min align mask to 4k - 1. Scsi layer may pass 256k length of
request buffer with 0~4k offset and Hyper-V storage driver can't
get swiotlb bounce buffer via DMA API. Swiotlb_find_slots() can't
find 256k length bounce buffer with offset. Make swiotlb_max_mapping
_size() take min align mask into account.
Signed-off-by: Tianyu Lan <Tianyu.Lan(a)microsoft.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Rishabh Bhatnagar <risbhat(a)amazon.com>
---
kernel/dma/swiotlb.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 018f140aaaf4..a9849670bdb5 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -709,7 +709,18 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
size_t swiotlb_max_mapping_size(struct device *dev)
{
- return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
+ int min_align_mask = dma_get_min_align_mask(dev);
+ int min_align = 0;
+
+ /*
+ * swiotlb_find_slots() skips slots according to
+ * min align mask. This affects max mapping size.
+ * Take it into acount here.
+ */
+ if (min_align_mask)
+ min_align = roundup(min_align_mask, IO_TLB_SIZE);
+
+ return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE - min_align;
}
bool is_swiotlb_active(struct device *dev)
--
2.37.1
From: Tianyu Lan <Tianyu.Lan(a)microsoft.com>
commit 82806744fd7dde603b64c151eeddaa4ee62193fd upstream.
swiotlb_find_slots() skips slots according to io tlb aligned mask
calculated from min aligned mask and original physical address
offset. This affects max mapping size. The mapping size can't
achieve the IO_TLB_SEGSIZE * IO_TLB_SIZE when original offset is
non-zero. This will cause system boot up failure in Hyper-V
Isolation VM where swiotlb force is enabled. Scsi layer use return
value of dma_max_mapping_size() to set max segment size and it
finally calls swiotlb_max_mapping_size(). Hyper-V storage driver
sets min align mask to 4k - 1. Scsi layer may pass 256k length of
request buffer with 0~4k offset and Hyper-V storage driver can't
get swiotlb bounce buffer via DMA API. Swiotlb_find_slots() can't
find 256k length bounce buffer with offset. Make swiotlb_max_mapping
_size() take min align mask into account.
Signed-off-by: Tianyu Lan <Tianyu.Lan(a)microsoft.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Rishabh Bhatnagar <risbhat(a)amazon.com>
---
kernel/dma/swiotlb.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 4a9831d01f0e..d897d161366a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -734,7 +734,18 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
size_t swiotlb_max_mapping_size(struct device *dev)
{
- return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
+ int min_align_mask = dma_get_min_align_mask(dev);
+ int min_align = 0;
+
+ /*
+ * swiotlb_find_slots() skips slots according to
+ * min align mask. This affects max mapping size.
+ * Take it into acount here.
+ */
+ if (min_align_mask)
+ min_align = roundup(min_align_mask, IO_TLB_SIZE);
+
+ return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE - min_align;
}
bool is_swiotlb_active(void)
--
2.37.1
Processors based on the Zen microarchitecture support IOPORT based deeper
C-states. The ACPI idle driver reads the
acpi_gbl_FADT.xpm_timer_block.address in the IOPORT based C-state exit
path which is claimed to be a "Dummy wait op" and has been around since
ACPI's introduction to Linux dating back to Andy Grover's Mar 14, 2002
posting [1].
Old, circa 2002 chipsets have a bug which was elaborated by Andreas Mohr
back in 2006 in commit b488f02156d3d ("ACPI: restore comment justifying
'extra' P_LVLx access") where the commit log claims:
"this dummy read was about: STPCLK# doesn't get asserted in time on
(some) chipsets, which is why we need to have a dummy I/O read to delay
further instruction processing until the CPU is fully stopped."
This workaround is very painful on modern systems with a large number of
cores. The "inl()" can take thousands of cycles. Sampling certain
workloads with IBS on AMD Zen3 system shows that a significant amount of
time is spent in the dummy op, which incorrectly gets accounted as
C-State residency. A large C-State residency value can prime the cpuidle
governor to recommend a deeper C-State during the subsequent idle
instances, starting a vicious cycle, leading to performance degradation
on workloads that rapidly switch between busy and idle phases.
(For the extent of the performance degradation refer link [2])
The dummy wait is unnecessary on processors based on the Zen
microarchitecture (AMD family 17h+ and HYGON). Skip it to prevent
polluting the C-state residency information. Among the pre-family 17h
AMD processors, there has been at least one report of an AMD Athlon on a
VIA chipset (circa 2006) where this this problem was seen (see [3] for
report by Andreas Mohr).
Modern Intel processors use MWAIT based C-States in the intel_idle driver
and are not impacted by this code path. For older Intel processors that
use the acpi_idle driver, a workaround was suggested by Dave Hansen and
Rafael J. Wysocki to regard all Intel chipsets using the IOPORT based
C-state management as being affected by this problem (see [4] for
workaround proposed).
For these reasons, mark all the Intel processors and pre-family 17h
AMD processors with x86_BUG_STPCLK. In the acpi_idle driver, restrict the
dummy wait during IOPORT based C-state transitions to only these
processors.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/c… [1]
Link: https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/ [2]
Link: https://lore.kernel.org/lkml/Yyy6l94G0O2B7Yh1@rhlx01.hs-esslingen.de/ [3]
Link: https://lore.kernel.org/lkml/88c17568-8694-940a-0f1f-9d345e8dcbdb@intel.com/ [4]
Suggested-by: Calvin Ong <calvin.ong(a)amd.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Len Brown <lenb(a)kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
CC: Pu Wen <puwen(a)hygon.cn>
Cc: stable(a)vger.kernel.org
Signed-off-by: K Prateek Nayak <kprateek.nayak(a)amd.com>
---
v1->v2:
o Introduce X86_BUG_STPCLK to mark chipsets as being affected by the
STPCLK# signal assertion issue.
o Mark all Intel chipsets and pre fam-17h AMD chipsets as being affected
by the X86_BUG_STPCLK.
o Skip dummy xpm_timer_block read in IOPORT based C-state exit path in
ACPI processor_idle if chipset is not affected by X86_BUG_STPCLK.
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/amd.c | 12 ++++++++++++
arch/x86/kernel/cpu/intel.c | 12 ++++++++++++
drivers/acpi/processor_idle.c | 8 ++++++++
4 files changed, 33 insertions(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ef4775c6db01..fcd3617ed315 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -460,5 +460,6 @@
#define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
#define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
#define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
+#define X86_BUG_STPCLK X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */
#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 48276c0e479d..8cb5887a53a3 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -988,6 +988,18 @@ static void init_amd(struct cpuinfo_x86 *c)
if (!cpu_has(c, X86_FEATURE_XENPV))
set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
+ /*
+ * CPUs based on the Zen microarchitecture (Fam 17h onward) can
+ * guarantee that STPCLK# signal is asserted in time after the
+ * P_LVL2 read to freeze execution after an IOPORT based C-state
+ * entry. Among the older AMD processors, there has been at least
+ * one report of an AMD Athlon processor on a VIA chipset
+ * (circa 2006) having this issue. Mark all these older AMD
+ * processor families as being affected.
+ */
+ if (c->x86 < 0x17)
+ set_cpu_bug(c, X86_BUG_STPCLK);
+
/*
* Turn on the Instructions Retired free counter on machines not
* susceptible to erratum #1054 "Instructions Retired Performance
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 2d7ea5480ec3..96fe1320c238 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -696,6 +696,18 @@ static void init_intel(struct cpuinfo_x86 *c)
((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
set_cpu_bug(c, X86_BUG_MONITOR);
+ /*
+ * Intel chipsets prior to Nehalem used the ACPI processor_idle
+ * driver for C-state management. Some of these processors that
+ * used IOPORT based C-states could not guarantee that STPCLK#
+ * signal gets asserted in time after P_LVL2 read to freeze
+ * execution properly. Since a clear cut-off point is not known
+ * as to when this bug was solved, mark all the chipsets as
+ * being affected. Only the ones that use IOPORT based C-state
+ * transitions via the acpi_idle driver will be impacted.
+ */
+ set_cpu_bug(c, X86_BUG_STPCLK);
+
#ifdef CONFIG_X86_64
if (c->x86 == 15)
c->x86_cache_alignment = c->x86_clflush_size * 2;
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 16a1663d02d4..493f9ccdb72d 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -528,6 +528,14 @@ static int acpi_idle_bm_check(void)
static void wait_for_freeze(void)
{
#ifdef CONFIG_X86
+ /*
+ * A dummy wait operation is only required for those chipsets
+ * that cannot assert STPCLK# signal in time after P_LVL2 read.
+ * If a chipset is not affected by this problem, skip it.
+ */
+ if (!static_cpu_has_bug(X86_BUG_STPCLK))
+ return;
+
/* No delay is needed if we are in guest */
if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
return;
--
2.25.1
The quilt patch titled
Subject: x86/uaccess: avoid check_object_size() in copy_from_user_nmi()
has been removed from the -mm tree. Its filename was
x86-uaccess-avoid-check_object_size-in-copy_from_user_nmi.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Kees Cook <keescook(a)chromium.org>
Subject: x86/uaccess: avoid check_object_size() in copy_from_user_nmi()
Date: Mon, 19 Sep 2022 13:16:48 -0700
The check_object_size() helper under CONFIG_HARDENED_USERCOPY is designed
to skip any checks where the length is known at compile time as a
reasonable heuristic to avoid "likely known-good" cases. However, it can
only do this when the copy_*_user() helpers are, themselves, inline too.
Using find_vmap_area() requires taking a spinlock. The
check_object_size() helper can call find_vmap_area() when the destination
is in vmap memory. If show_regs() is called in interrupt context, it will
attempt a call to copy_from_user_nmi(), which may call check_object_size()
and then find_vmap_area(). If something in normal context happens to be
in the middle of calling find_vmap_area() (with the spinlock held), the
interrupt handler will hang forever.
The copy_from_user_nmi() call is actually being called with a fixed-size
length, so check_object_size() should never have been called in the first
place. Given the narrow constraints, just replace the
__copy_from_user_inatomic() call with an open-coded version that calls
only into the sanitizers and not check_object_size(), followed by a call
to raw_copy_from_user().
[akpm(a)linux-foundation.org: no instrument_copy_from_user() in my tree...]
Link: https://lkml.kernel.org/r/20220919201648.2250764-1-keescook@chromium.org
Link: https://lore.kernel.org/all/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9…
Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns")
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Reported-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: Florian Lehner <dev(a)der-flo.net>
Suggested-by: Andrew Morton <akpm(a)linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Tested-by: Florian Lehner <dev(a)der-flo.net>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Josh Poimboeuf <jpoimboe(a)kernel.org>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/lib/usercopy.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/lib/usercopy.c~x86-uaccess-avoid-check_object_size-in-copy_from_user_nmi
+++ a/arch/x86/lib/usercopy.c
@@ -44,7 +44,7 @@ copy_from_user_nmi(void *to, const void
* called from other contexts.
*/
pagefault_disable();
- ret = __copy_from_user_inatomic(to, from, n);
+ ret = raw_copy_from_user(to, from, n);
pagefault_enable();
return ret;
_
Patches currently in -mm which might be from keescook(a)chromium.org are