There are few uses of CoCo that don't rely on working cryptography and
hence a working RNG. Unfortunately, the CoCo threat model means that the
VM host cannot be trusted and may actively work against guests to
extract secrets or manipulate computation. Since a malicious host can
modify or observe nearly all inputs to guests, the only remaining source
of entropy for CoCo guests is RDRAND.
If RDRAND is broken -- due to CPU hardware fault -- the RNG as a whole
is meant to gracefully continue on gathering entropy from other sources,
but since there aren't other sources on CoCo, this is catastrophic.
This is mostly a concern at boot time when initially seeding the RNG, as
after that the consequences of a broken RDRAND are much more
theoretical.
So, try at boot to seed the RNG using 256 bits of RDRAND output. If this
fails, panic(). This will also trigger if the system is booted without
RDRAND, as RDRAND is essential for a safe CoCo boot.
This patch is deliberately written to be "just a CoCo x86 driver
feature" and not part of the RNG itself. Many device drivers and
platforms have some desire to contribute something to the RNG, and
add_device_randomness() is specifically meant for this purpose. Any
driver can call this with seed data of any quality, or even garbage
quality, and it can only possibly make the quality of the RNG better or
have no effect, but can never make it worse. Rather than trying to
build something into the core of the RNG, this patch interprets the
particular CoCo issue as just a CoCo issue, and therefore separates this
all out into driver (well, arch/platform) code.
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Daniel P. Berrangé <berrange(a)redhat.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: H. Peter Anvin <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Reviewed-by: Elena Reshetova <elena.reshetova(a)intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: Theodore Ts'o <tytso(a)mit.edu>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
---
Changes v3->v4:
- Add stable@ tag and reviewed-by lines.
- Add comment for Dave explaining where the "32" comes from.
arch/x86/coco/core.c | 40 +++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/coco.h | 2 ++
arch/x86/kernel/setup.c | 2 ++
3 files changed, 44 insertions(+)
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index eeec9986570e..0e988bff4aec 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -3,13 +3,16 @@
* Confidential Computing Platform Capability checks
*
* Copyright (C) 2021 Advanced Micro Devices, Inc.
+ * Copyright (C) 2024 Jason A. Donenfeld <Jason(a)zx2c4.com>. All Rights Reserved.
*
* Author: Tom Lendacky <thomas.lendacky(a)amd.com>
*/
#include <linux/export.h>
#include <linux/cc_platform.h>
+#include <linux/random.h>
+#include <asm/archrandom.h>
#include <asm/coco.h>
#include <asm/processor.h>
@@ -153,3 +156,40 @@ __init void cc_set_mask(u64 mask)
{
cc_mask = mask;
}
+
+__init void cc_random_init(void)
+{
+ /*
+ * The seed is 32 bytes (in units of longs), which is 256 bits, which
+ * is the security level that the RNG is targeting.
+ */
+ unsigned long rng_seed[32 / sizeof(long)];
+ size_t i, longs;
+
+ if (cc_vendor == CC_VENDOR_NONE)
+ return;
+
+ /*
+ * Since the CoCo threat model includes the host, the only reliable
+ * source of entropy that can be neither observed nor manipulated is
+ * RDRAND. Usually, RDRAND failure is considered tolerable, but since
+ * CoCo guests have no other unobservable source of entropy, it's
+ * important to at least ensure the RNG gets some initial random seeds.
+ */
+ for (i = 0; i < ARRAY_SIZE(rng_seed); i += longs) {
+ longs = arch_get_random_longs(&rng_seed[i], ARRAY_SIZE(rng_seed) - i);
+
+ /*
+ * A zero return value means that the guest doesn't have RDRAND
+ * or the CPU is physically broken, and in both cases that
+ * means most crypto inside of the CoCo instance will be
+ * broken, defeating the purpose of CoCo in the first place. So
+ * just panic here because it's absolutely unsafe to continue
+ * executing.
+ */
+ if (longs == 0)
+ panic("RDRAND is defective.");
+ }
+ add_device_randomness(rng_seed, sizeof(rng_seed));
+ memzero_explicit(rng_seed, sizeof(rng_seed));
+}
diff --git a/arch/x86/include/asm/coco.h b/arch/x86/include/asm/coco.h
index 76c310b19b11..e9d059449885 100644
--- a/arch/x86/include/asm/coco.h
+++ b/arch/x86/include/asm/coco.h
@@ -15,6 +15,7 @@ extern enum cc_vendor cc_vendor;
void cc_set_mask(u64 mask);
u64 cc_mkenc(u64 val);
u64 cc_mkdec(u64 val);
+void cc_random_init(void);
#else
#define cc_vendor (CC_VENDOR_NONE)
@@ -27,6 +28,7 @@ static inline u64 cc_mkdec(u64 val)
{
return val;
}
+static inline void cc_random_init(void) { }
#endif
#endif /* _ASM_X86_COCO_H */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 84201071dfac..30a653cfc7d2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -36,6 +36,7 @@
#include <asm/bios_ebda.h>
#include <asm/bugs.h>
#include <asm/cacheinfo.h>
+#include <asm/coco.h>
#include <asm/cpu.h>
#include <asm/efi.h>
#include <asm/gart.h>
@@ -994,6 +995,7 @@ void __init setup_arch(char **cmdline_p)
* memory size.
*/
mem_encrypt_setup_arch();
+ cc_random_init();
efi_fake_memmap();
efi_find_mirror();
--
2.43.2
Here's the recently merged mds improvement patches adapted to latest stable tree.
I've only compile tested them, but since I have also done similar backports for
older kernels I'm sure they should work.
The main difference is in the definition of the CLEAR_CPU_BUFFERS macro since
5.4 doesn't contains the alternative relocation handling logic hence the verw
instruction is moved out of the alternative definition and instead we have a jump which
skips the verw instruction there. That way the relocation will be handled by the
toolchain rather than the kernel.
Since I don't know if I will have time to work on the other branches this patchset
can be used as basis for the rest of the stable kernels. The main difference would be
which bit is used for CLEAR_CPU_BUFFERS. For kernel 6.6 the 2nd patch can be used verbatim
from upstrem (unlike this modified version) since the alternative relocation
did land in v6.5. However, even if used as-is from this patchset it's not a problem.
V2:
Added upstream commit id to individual patches.
H. Peter Anvin (Intel) (1):
x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix
Pawan Gupta (5):
x86/bugs: Add asm helpers for executing VERW
x86/entry_64: Add VERW just before userspace transition
x86/entry_32: Add VERW just before userspace transition
x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
KVM/VMX: Move VERW closer to VMentry for MDS mitigation
Sean Christopherson (1):
KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
Documentation/x86/mds.rst | 38 ++++++++++++++++++++--------
arch/x86/entry/Makefile | 2 +-
arch/x86/entry/common.c | 2 --
arch/x86/entry/entry.S | 23 +++++++++++++++++
arch/x86/entry/entry_32.S | 3 +++
arch/x86/entry/entry_64.S | 10 ++++++++
arch/x86/entry/entry_64_compat.S | 1 +
arch/x86/include/asm/asm.h | 6 ++++-
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/include/asm/irqflags.h | 1 +
arch/x86/include/asm/nospec-branch.h | 26 ++++++++++---------
arch/x86/kernel/cpu/bugs.c | 15 +++++------
arch/x86/kernel/nmi.c | 3 ---
arch/x86/kvm/vmx/run_flags.h | 7 +++--
arch/x86/kvm/vmx/vmenter.S | 9 ++++---
arch/x86/kvm/vmx/vmx.c | 12 ++++++---
16 files changed, 111 insertions(+), 49 deletions(-)
create mode 100644 arch/x86/entry/entry.S
--
2.34.1
From: Charan Teja Kalla <quic_charante(a)quicinc.com>
This fix is applicable for LTS kernel, 6.1.y. In latest kernels, this race
issue is fixed by the patch series [1] and [2]. The right thing to do here
would have been propagating these changes from latest kernel to the stable
branch, 6.1.y. However, these changes seems too intrusive to be picked for
stable branches. Hence, the fix proposed can be taken as an alternative
instead of backporting the patch series.
[1] https://lore.kernel.org/all/0-v8-81230027b2fa+9d-iommu_all_defdom_jgg@nvidi…
[2] https://lore.kernel.org/all/0-v5-1b99ae392328+44574-iommu_err_unwind_jgg@nv…
Issue:
A race condition is observed when arm_smmu_device_probe and
modprobe of client devices happens in parallel. This results
in the allocation of a new default domain for the iommu group
even though it was previously allocated and the respective iova
domain(iovad) was initialized. However, for this newly allocated
default domain, iovad will not be initialized. As a result, for
devices requesting dma allocations, this uninitialized iovad will
be used, thereby causing NULL pointer dereference issue.
Flow:
- During arm_smmu_device_probe, bus_iommu_probe() will be called
as part of iommu_device_register(). This results in the device probe,
__iommu_probe_device().
- When the modprobe of the client device happens in parallel, it
sets up the DMA configuration for the device using of_dma_configure_id(),
which inturn calls iommu_probe_device(). Later, default domain is
allocated and attached using iommu_alloc_default_domain() and
__iommu_attach_device() respectively. It then ends up initializing a
mapping domain(IOVA domain) and rcaches for the device via
arch_setup_dma_ops()->iommu_setup_dma_ops().
- Now, in the bus_iommu_probe() path, it again tries to allocate
a default domain via probe_alloc_default_domain(). This results in
allocating a new default domain(along with IOVA domain) via
__iommu_domain_alloc(). However, this newly allocated IOVA domain
will not be initialized.
- Now, when the same client device tries dma allocations via
iommu_dma_alloc(), it ends up accessing the rcaches of the newly
allocated IOVA domain, which is not initialized. This results
into NULL pointer dereferencing.
Fix this issue by adding a check in probe_alloc_default_domain()
to see if the iommu_group already has a default domain allocated
and initialized.
Signed-off-by: Charan Teja Kalla <quic_charante(a)quicinc.com>
Co-developed-by: Nikhil V <quic_nprakash(a)quicinc.com>
Signed-off-by: Nikhil V <quic_nprakash(a)quicinc.com>
---
drivers/iommu/iommu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8b3897239477..83736824f17d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1741,6 +1741,9 @@ static void probe_alloc_default_domain(struct bus_type *bus,
{
struct __group_domain_type gtype;
+ if (group->default_domain)
+ return;
+
memset(>ype, 0, sizeof(gtype));
/* Ask for default domain requirements of all devices in the group */
--
2.17.1
RISC-V PLIC cannot "end-of-interrupt" (EOI) disabled interrupts, as
explained in the description of Interrupt Completion in the PLIC spec:
"The PLIC signals it has completed executing an interrupt handler by
writing the interrupt ID it received from the claim to the claim/complete
register. The PLIC does not check whether the completion ID is the same
as the last claim ID for that target. If the completion ID does not match
an interrupt source that *is currently enabled* for the target, the
completion is silently ignored."
Commit 69ea463021be ("irqchip/sifive-plic: Fixup EOI failed when masked")
ensured that EOI is successful by enabling interrupt first, before EOI.
Commit a1706a1c5062 ("irqchip/sifive-plic: Separate the enable and mask
operations") removed the interrupt enabling code from the previous
commit, because it assumes that interrupt should already be enabled at the
point of EOI. However, this is incorrect: there is a window after a hart
claiming an interrupt and before irq_desc->lock getting acquired,
interrupt can be disabled during this window. Thus, EOI can be invoked
while the interrupt is disabled, effectively nullify this EOI. This
results in the interrupt never gets asserted again, and the device who
uses this interrupt appears frozen.
Make sure that interrupt is really enabled before EOI.
Fixes: a1706a1c5062 ("irqchip/sifive-plic: Separate the enable and mask operations")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Nam Cao <namcao(a)linutronix.de>
---
v2:
- add unlikely() for optimization
- re-word commit message to make it clearer
drivers/irqchip/irq-sifive-plic.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/irqchip/irq-sifive-plic.c b/drivers/irqchip/irq-sifive-plic.c
index e1484905b7bd..0a233e9d9607 100644
--- a/drivers/irqchip/irq-sifive-plic.c
+++ b/drivers/irqchip/irq-sifive-plic.c
@@ -148,7 +148,13 @@ static void plic_irq_eoi(struct irq_data *d)
{
struct plic_handler *handler = this_cpu_ptr(&plic_handlers);
- writel(d->hwirq, handler->hart_base + CONTEXT_CLAIM);
+ if (unlikely(irqd_irq_disabled(d))) {
+ plic_toggle(handler, d->hwirq, 1);
+ writel(d->hwirq, handler->hart_base + CONTEXT_CLAIM);
+ plic_toggle(handler, d->hwirq, 0);
+ } else {
+ writel(d->hwirq, handler->hart_base + CONTEXT_CLAIM);
+ }
}
#ifdef CONFIG_SMP
--
2.39.2
[Resending as per Russell's request...]
Hello,
I noticed after a recent kernel update that my ARM926 system started
segfaulting on any execve() after calling prctl(PR_SET_MDWE). After
some investigation it appears that ARMv5 is incapable of providing the
appropriate protections for MDWE, since any readable memory is also
implicitly executable.
(Note that I'm not an expert in either ARM arch details or the mm
subsystem, so please bear with me if I've botched something in the
above analysis.)
The prctl_set_mdwe() function already had some special-case logic
added disabling it on PARISC (commit 793838138c15, "prctl: Disable
prctl(PR_SET_MDWE) on parisc"); this patch series (1) generalizes that
check to use an arch_*() function, and (2) adds a corresponding
override for ARM to disable MDWE on pre-ARMv6 CPUs.
With the series applied, prctl(PR_SET_MDWE) is rejected on ARMv5 and
subsequent execve() calls (as well as mmap(PROT_READ|PROT_WRITE)) can
succeed instead of unconditionally failing; on ARMv6 the prctl works
as it did previously.
Thanks,
Zev
[0] https://lore.kernel.org/all/2023112456-linked-nape-bf19@gregkh/
Zev Weiss (2):
prctl: Generalize PR_SET_MDWE support check to be per-arch
ARM: prctl: Reject PR_SET_MDWE on pre-ARMv6
arch/arm/include/asm/mman.h | 14 ++++++++++++++
arch/parisc/include/asm/mman.h | 14 ++++++++++++++
include/linux/mman.h | 8 ++++++++
kernel/sys.c | 7 +++++--
4 files changed, 41 insertions(+), 2 deletions(-)
create mode 100644 arch/arm/include/asm/mman.h
create mode 100644 arch/parisc/include/asm/mman.h
--
2.43.2
The commit 3a2dbc510c43 ("driver core: fw_devlink: Don't purge child
fwnode's consumer links") introduces the possibility to use the
supplier's parent device instead of the supplier itself.
In that case the supplier fwnode used is not updated and is no more
consistent with the supplier device used.
Use the fwnode consistent with the supplier device when checking flags.
Fixes: 3a2dbc510c43 ("driver core: fw_devlink: Don't purge child fwnode's consumer links")
Cc: stable(a)vger.kernel.org
Signed-off-by: Herve Codina <herve.codina(a)bootlin.com>
---
Changes v2 -> v3:
Do not update the supplier handle in order to keep the original handle
for debug traces.
Changes v1 -> v2:
Remove sup_handle check and related pr_debug() call as sup_handle cannot be
invalid if sup_dev is valid.
drivers/base/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 4d8b315c48a1..440b52ec027f 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2082,7 +2082,7 @@ static int fw_devlink_create_devlink(struct device *con,
* supplier device indefinitely.
*/
if (sup_dev->links.status == DL_DEV_NO_DRIVER &&
- sup_handle->flags & FWNODE_FLAG_INITIALIZED) {
+ sup_dev->fwnode->flags & FWNODE_FLAG_INITIALIZED) {
dev_dbg(con,
"Not linking %pfwf - dev might never probe\n",
sup_handle);
--
2.41.0
After upgrading from 6.7.0 to 6.7.1 a couple of my systems with md
RAID-5 arrays started experiencing hangs. It starts with some
processes which write to the array getting stuck. The whole system
eventually becomes unresponsive and unclean shutdown must be performed
(poweroff and reboot don't work).
While trying to diagnose the issue, I noticed that the md0_raid5
kernel thread consumes 100% CPU after the issue occurs. No relevant
warnings or errors were found in dmesg.
On 6.7.1, I can reproduce the issue somewhat reliably by copying a
large amount of data to the array. I am unable to reproduce the issue
at all on 6.7.0. The bisection was a bit difficult since I don't have
a 100% reliable method to reproduce the problem, but with some
perseverence I eventually managed to whittle it down to commit
0de40f76d567 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
raid5d"). After reverting that commit (i.e. reapplying the reverted
commit) on top of 6.7.1 I can no longer reproduce the problem at all.
Some details that might be relevant:
- Both systems are running MD RAID-5 with a journal device.
- mdadm in monitor mode is always running on both systems.
- Both systems were previously running 6.7.0 and earlier just fine.
- The older of the two systems has been running a raid5 array without
incident for many years (kernel going back to at least 5.1) -- this
is the first raid5 issue it has encountered.
Please let me know if there is any other helpful information that I
might be able to provide.
-- Dan
#regzbot introduced: 0de40f76d567133b871cd6ad46bb87afbce46983