From: Kai Huang kai.huang@intel.com
[ Upstream commit 10df8607bf1a22249d21859f56eeb61e9a033313 ]
On TDX platforms, dirty cacheline aliases with and without encryption bits can coexist, and the cpu can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel otherwise the dirty cachelines could silently corrupt the memory used by the new kernel due to different encryption property.
A percpu boolean is used to mark whether the cache of a given CPU may be in an incoherent state, and the kexec performs WBINVD on the CPUs with that boolean turned on.
For TDX, only the TDX module or the TDX guests can generate dirty cachelines of TDX private memory, i.e., they are only generated when the kernel does a SEAMCALL.
Set that boolean when the kernel does SEAMCALL so that kexec can flush the cache correctly.
The kernel provides both the __seamcall*() assembly functions and the seamcall*() wrapper ones which additionally handle running out of entropy error in a loop. Most of the SEAMCALLs are called using the seamcall*(), except TDH.VP.ENTER and TDH.PHYMEM.PAGE.RDMD which are called using __seamcall*() variant directly.
To cover the two special cases, add a new __seamcall_dirty_cache() helper which only sets the percpu boolean and calls the __seamcall*(), and change the special cases to use the new helper. To cover all other SEAMCALLs, change seamcall*() to call the new helper.
For the SEAMCALLs invoked via seamcall*(), they can be made from both task context and IRQ disabled context. Given SEAMCALL is just a lengthy instruction (e.g., thousands of cycles) from kernel's point of view and preempt_{disable|enable}() is cheap compared to it, just unconditionally disable preemption during setting the boolean and making SEAMCALL.
Signed-off-by: Kai Huang kai.huang@intel.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Reviewed-by: Chao Gao chao.gao@intel.com Reviewed-by: Rick Edgecombe rick.p.edgecombe@intel.com Tested-by: Farrah Chen farrah.chen@intel.com Link: https://lore.kernel.org/all/20250901160930.1785244-4-pbonzini%40redhat.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
Why this fixes a real bug - TDX can leave dirty cachelines for private memory with different encryption attributes (C-bit aliases). If kexec interrupts a CPU during a SEAMCALL, its dirty private cachelines can later be flushed in the wrong order and silently corrupt the new kernel’s memory. Marking the CPU’s cache state as “incoherent” before executing SEAMCALL ensures kexec will WBINVD on that CPU and avoid corruption.
What changed (key points with code references) - New helper marks per-CPU cache incoherent before any SEAMCALL: - arch/x86/include/asm/tdx.h:111 sets `this_cpu_write(cache_state_incoherent, true)` in `__seamcall_dirty_cache()` and asserts preemption is disabled (lines 111–128). - Wrap all `seamcall*()` paths with preemption-disabled critical section: - arch/x86/include/asm/tdx.h:130–147 uses `preempt_disable()/preempt_enable()` in `sc_retry()` so the same CPU that sets the flag executes the SEAMCALL, avoiding migration races. - Convert special direct callers to use the new helper: - arch/x86/virt/vmx/tdx/tdx.c:1271 changes `paddr_is_tdx_private()` to call `__seamcall_dirty_cache(__seamcall_ret, TDH_PHYMEM_PAGE_RDMD, ...)`. - arch/x86/virt/vmx/tdx/tdx.c:1522 changes `tdh_vp_enter()` to call `__seamcall_dirty_cache(__seamcall_saved_ret, TDH_VP_ENTER, ...)`. - Consumers of the per-CPU flag during kexec/CPU stop: - arch/x86/kernel/process.c:99 defines `cache_state_incoherent` and uses it in `stop_this_cpu()` to WBINVD if set (arch/x86/kernel/process.c:840). - arch/x86/kernel/machine_kexec_64.c:449 sets `RELOC_KERNEL_CACHE_INCOHERENT` when the per-CPU flag is set so `relocate_kernel_64.S` executes WBINVD (relocate path). - The TDX-specific flush routine will WBINVD and clear the flag if needed (arch/x86/virt/vmx/tdx/tdx.c:1872–1887).
Why it’s safe to backport - Scope-limited: touches only TDX host paths and the seamcall wrappers; no ABI or architectural changes. - Minimal risk: setting a per-CPU boolean and wrapping SEAMCALLs with preempt disable. SEAMCALLs are long; added preemption control is negligible overhead and avoids CPU migration races. - Correctness across contexts: SEAMCALLs can happen with IRQs disabled; the helper asserts preemption is off, and the wrappers explicitly ensure it. The two special direct-call sites run in contexts where IRQs are off or preemption is already disabled. - Aligns with existing kexec logic: Stable trees already check `cache_state_incoherent` during CPU stop and relocation (arch/x86/kernel/process.c:840, arch/x86/kernel/machine_kexec_64.c:449).
Dependencies/assumptions for stable trees - Requires the per-CPU `cache_state_incoherent` infrastructure and kexec consumers: - Declaration: arch/x86/include/asm/processor.h:734 - Definition/usage: arch/x86/kernel/process.c:99, arch/x86/kernel/process.c:840 - Kexec integration: arch/x86/kernel/machine_kexec_64.c:449 and arch/x86/kernel/relocate_kernel_64.S (WBINVD when `RELOC_KERNEL_CACHE_INCOHERENT` set)
Summary - This is a focused, low-risk bugfix preventing silent memory corruption on TDX hosts during kexec by correctly marking and subsequently flushing CPUs that might have generated dirty private cachelines during SEAMCALLs. It satisfies stable backport criteria (user-visible correctness fix, minimal change, localized impact).
arch/x86/include/asm/tdx.h | 25 ++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.c | 4 ++-- 2 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 7ddef3a698668..0922265c6bdcb 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -102,10 +102,31 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args); u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args); void tdx_init(void);
+#include <linux/preempt.h> #include <asm/archrandom.h> +#include <asm/processor.h>
typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
+static __always_inline u64 __seamcall_dirty_cache(sc_func_t func, u64 fn, + struct tdx_module_args *args) +{ + lockdep_assert_preemption_disabled(); + + /* + * SEAMCALLs are made to the TDX module and can generate dirty + * cachelines of TDX private memory. Mark cache state incoherent + * so that the cache can be flushed during kexec. + * + * This needs to be done before actually making the SEAMCALL, + * because kexec-ing CPU could send NMI to stop remote CPUs, + * in which case even disabling IRQ won't help here. + */ + this_cpu_write(cache_state_incoherent, true); + + return func(fn, args); +} + static __always_inline u64 sc_retry(sc_func_t func, u64 fn, struct tdx_module_args *args) { @@ -113,7 +134,9 @@ static __always_inline u64 sc_retry(sc_func_t func, u64 fn, u64 ret;
do { - ret = func(fn, args); + preempt_disable(); + ret = __seamcall_dirty_cache(func, fn, args); + preempt_enable(); } while (ret == TDX_RND_NO_ENTROPY && --retry);
return ret; diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index c7a9a087ccaf5..3ea6f587c81a3 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1266,7 +1266,7 @@ static bool paddr_is_tdx_private(unsigned long phys) return false;
/* Get page type from the TDX module */ - sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args); + sret = __seamcall_dirty_cache(__seamcall_ret, TDH_PHYMEM_PAGE_RDMD, &args);
/* * The SEAMCALL will not return success unless there is a @@ -1522,7 +1522,7 @@ noinstr __flatten u64 tdh_vp_enter(struct tdx_vp *td, struct tdx_module_args *ar { args->rcx = tdx_tdvpr_pa(td);
- return __seamcall_saved_ret(TDH_VP_ENTER, args); + return __seamcall_dirty_cache(__seamcall_saved_ret, TDH_VP_ENTER, args); } EXPORT_SYMBOL_GPL(tdh_vp_enter);