Changes since v1 [1]: * Fix the fact that devmem_is_allowed() == 2 does not prevent mmap access (Kees) * Rather than teach devmem_is_allowed() == 2 to map zero pages in the mmap case, just fail (Nikolay)
[1]: http://lore.kernel.org/67f5b75c37143_71fe2949b@dwillia2-xfh.jf.intel.com.not...
--- The story starts with Nikolay reporting an SEPT violation due to mismatched encrypted/non-encrypted mappings of the BIOS data space [2].
An initial suggestion to just make sure that the BIOS data space is mapped consistently [3] ran into another issue that TDX and SEV-SNP disagree about when that space can be mapped as encrypted.
Then, in response to a partial patch to allow SEV-SNP to block BIOS data space for other reasons [4], Dave asked why not just give up on /dev/mem access entirely in the confidential VM case [5].
Enter this series to:
1/ Close a subtle hole whereby /dev/mem that is supposed return zeros in lieu of access only enforces that for read()/write()
2/ Use that new closed hole to reliably disable all /dev/mem access for confidential x86 VMs
[2]: http://lore.kernel.org/20250318113604.297726-1-nik.borisov@suse.com [3]: http://lore.kernel.org/174346288005.2166708.14425674491111625620.stgit@dwill... [4]: http://lore.kernel.org/20250403120228.2344377-1-naveen@kernel.org [5]: http://lore.kernel.org/fd683daa-d953-48ca-8c5d-6f4688ad442c@intel.com ---
Dan Williams (3): x86/devmem: Remove duplicate range_is_allowed() definition devmem: Block mmap access when read/write access is restricted x86/devmem: Restrict /dev/mem access for potentially unaccepted memory by default
arch/x86/Kconfig | 2 ++ arch/x86/include/asm/x86_init.h | 2 ++ arch/x86/kernel/x86_init.c | 6 ++++++ arch/x86/mm/init.c | 23 +++++++++++++++++------ arch/x86/mm/pat/memtype.c | 31 ++++--------------------------- drivers/char/mem.c | 18 ------------------ include/linux/io.h | 26 ++++++++++++++++++++++++++ 7 files changed, 57 insertions(+), 51 deletions(-)
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
Nikolay reports [1] that accessing BIOS data (first 1MB of the physical address space) via /dev/mem results in an SEPT violation.
The cause is ioremap() (via xlate_dev_mem_ptr()) establishes an unencrypted mapping where the kernel had established an encrypted mapping previously.
An initial attempt to fix this revealed that TDX and SEV-SNP have different expectations about which and when address ranges can be mapped via /dev/mem.
Rather than develop a precise set of allowed /dev/mem capable TVM address ranges, teach devmem_is_allowed() to always restrict access to the BIOS data space. This means return 0s for read(), drop write(), and -EPERM mmap(). This can still be later relaxed as specific needs arise, but in the meantime, close off this source of mismatched IORES_MAP_ENCRYPTED expectations.
Cc: x86@kernel.org Cc: Ingo Molnar mingo@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Vishal Annapurve vannapurve@google.com Cc: Kirill Shutemov kirill.shutemov@linux.intel.com Reported-by: Nikolay Borisov nik.borisov@suse.com Closes: http://lore.kernel.org/20250318113604.297726-1-nik.borisov@suse.com [1] Reviewed-by: Nikolay Borisov nik.borisov@suse.com Fixes: 9aa6ea69852c ("x86/tdx: Make pages shared in ioremap()") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com --- arch/x86/Kconfig | 2 ++ arch/x86/include/asm/x86_init.h | 2 ++ arch/x86/kernel/x86_init.c | 6 ++++++ arch/x86/mm/init.c | 23 +++++++++++++++++------ 4 files changed, 27 insertions(+), 6 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 4b9f378e05f6..12a1b5acd55b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -891,6 +891,7 @@ config INTEL_TDX_GUEST depends on X86_X2APIC depends on EFI_STUB depends on PARAVIRT + depends on STRICT_DEVMEM select ARCH_HAS_CC_PLATFORM select X86_MEM_ENCRYPT select X86_MCE @@ -1510,6 +1511,7 @@ config AMD_MEM_ENCRYPT bool "AMD Secure Memory Encryption (SME) support" depends on X86_64 && CPU_SUP_AMD depends on EFI_STUB + depends on STRICT_DEVMEM select DMA_COHERENT_POOL select ARCH_USE_MEMREMAP_PROT select INSTRUCTION_DECODER diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 213cf5379a5a..0ae436b34b88 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -305,6 +305,7 @@ struct x86_hyper_runtime { * semantics. * @realmode_reserve: reserve memory for realmode trampoline * @realmode_init: initialize realmode trampoline + * @devmem_is_allowed restrict /dev/mem and PCI sysfs resource access * @hyper: x86 hypervisor specific runtime callbacks */ struct x86_platform_ops { @@ -323,6 +324,7 @@ struct x86_platform_ops { void (*set_legacy_features)(void); void (*realmode_reserve)(void); void (*realmode_init)(void); + bool (*devmem_is_allowed)(unsigned long pfn); struct x86_hyper_runtime hyper; struct x86_guest guest; }; diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index 0a2bbd674a6d..346301375bd4 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -143,6 +143,11 @@ static void enc_kexec_begin_noop(void) {} static void enc_kexec_finish_noop(void) {} static bool is_private_mmio_noop(u64 addr) {return false; }
+static bool platform_devmem_is_allowed(unsigned long pfn) +{ + return !cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT); +} + struct x86_platform_ops x86_platform __ro_after_init = { .calibrate_cpu = native_calibrate_cpu_early, .calibrate_tsc = native_calibrate_tsc, @@ -156,6 +161,7 @@ struct x86_platform_ops x86_platform __ro_after_init = { .restore_sched_clock_state = tsc_restore_sched_clock_state, .realmode_reserve = reserve_real_mode, .realmode_init = init_real_mode, + .devmem_is_allowed = platform_devmem_is_allowed, .hyper.pin_vcpu = x86_op_int_noop, .hyper.is_private_mmio = is_private_mmio_noop,
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index bfa444a7dbb0..df5435c8dbea 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -861,18 +861,23 @@ void __init poking_init(void) * area traditionally contains BIOS code and data regions used by X, dosemu, * and similar apps. Since they map the entire memory range, the whole range * must be allowed (for mapping), but any areas that would otherwise be - * disallowed are flagged as being "zero filled" instead of rejected. + * disallowed are flagged as being "zero filled" instead of rejected, for + * read()/write(). + * * Access has to be given to non-kernel-ram areas as well, these contain the * PCI mmio resources as well as potential bios/acpi data regions. */ int devmem_is_allowed(unsigned long pagenr) { + bool platform_allowed = x86_platform.devmem_is_allowed(pagenr); + if (region_intersects(PFN_PHYS(pagenr), PAGE_SIZE, IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE) != REGION_DISJOINT) { /* - * For disallowed memory regions in the low 1MB range, - * request that the page be shown as all zeros. + * For disallowed memory regions in the low 1MB range, request + * that the page be shown as all zeros for read()/write(), fail + * mmap() */ if (pagenr < 256) return 2; @@ -885,14 +890,20 @@ int devmem_is_allowed(unsigned long pagenr) * restricted resource under CONFIG_STRICT_DEVMEM. */ if (iomem_is_exclusive(pagenr << PAGE_SHIFT)) { - /* Low 1MB bypasses iomem restrictions. */ - if (pagenr < 256) + /* + * Low 1MB bypasses iomem restrictions unless the platform says + * the physical address is not suitable for direct access. + */ + if (pagenr < 256) { + if (!platform_allowed) + return 2; return 1; + }
return 0; }
- return 1; + return platform_allowed; }
void free_init_pages(const char *what, unsigned long begin, unsigned long end)
On Thu, Apr 10, 2025 at 06:22:38PM -0700, Dan Williams wrote:
Nikolay reports [1] that accessing BIOS data (first 1MB of the physical address space) via /dev/mem results in an SEPT violation.
The cause is ioremap() (via xlate_dev_mem_ptr()) establishes an unencrypted mapping where the kernel had established an encrypted mapping previously.
An initial attempt to fix this revealed that TDX and SEV-SNP have different expectations about which and when address ranges can be mapped via /dev/mem.
Rather than develop a precise set of allowed /dev/mem capable TVM address ranges, teach devmem_is_allowed() to always restrict access to the BIOS data space.
This patch does more than just restrict the BIOS data space - it rejects all accesses to /dev/mem _apart_ from the first 1MB. That should be made clear here.
This means return 0s for read(), drop write(), and -EPERM mmap(). This can still be later relaxed as specific needs arise, but in the meantime, close off this source of mismatched IORES_MAP_ENCRYPTED expectations.
Cc: x86@kernel.org Cc: Ingo Molnar mingo@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Vishal Annapurve vannapurve@google.com Cc: Kirill Shutemov kirill.shutemov@linux.intel.com Reported-by: Nikolay Borisov nik.borisov@suse.com Closes: http://lore.kernel.org/20250318113604.297726-1-nik.borisov@suse.com [1] Reviewed-by: Nikolay Borisov nik.borisov@suse.com Fixes: 9aa6ea69852c ("x86/tdx: Make pages shared in ioremap()") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com
arch/x86/Kconfig | 2 ++ arch/x86/include/asm/x86_init.h | 2 ++ arch/x86/kernel/x86_init.c | 6 ++++++ arch/x86/mm/init.c | 23 +++++++++++++++++------ 4 files changed, 27 insertions(+), 6 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 4b9f378e05f6..12a1b5acd55b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -891,6 +891,7 @@ config INTEL_TDX_GUEST depends on X86_X2APIC depends on EFI_STUB depends on PARAVIRT
- depends on STRICT_DEVMEM select ARCH_HAS_CC_PLATFORM select X86_MEM_ENCRYPT select X86_MCE
@@ -1510,6 +1511,7 @@ config AMD_MEM_ENCRYPT bool "AMD Secure Memory Encryption (SME) support" depends on X86_64 && CPU_SUP_AMD depends on EFI_STUB
- depends on STRICT_DEVMEM select DMA_COHERENT_POOL select ARCH_USE_MEMREMAP_PROT select INSTRUCTION_DECODER
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 213cf5379a5a..0ae436b34b88 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -305,6 +305,7 @@ struct x86_hyper_runtime {
semantics.
- @realmode_reserve: reserve memory for realmode trampoline
- @realmode_init: initialize realmode trampoline
*/
- @devmem_is_allowed restrict /dev/mem and PCI sysfs resource access
- @hyper: x86 hypervisor specific runtime callbacks
struct x86_platform_ops { @@ -323,6 +324,7 @@ struct x86_platform_ops { void (*set_legacy_features)(void); void (*realmode_reserve)(void); void (*realmode_init)(void);
- bool (*devmem_is_allowed)(unsigned long pfn); struct x86_hyper_runtime hyper; struct x86_guest guest;
}; diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index 0a2bbd674a6d..346301375bd4 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -143,6 +143,11 @@ static void enc_kexec_begin_noop(void) {} static void enc_kexec_finish_noop(void) {} static bool is_private_mmio_noop(u64 addr) {return false; } +static bool platform_devmem_is_allowed(unsigned long pfn) +{
- return !cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT);
+}
struct x86_platform_ops x86_platform __ro_after_init = { .calibrate_cpu = native_calibrate_cpu_early, .calibrate_tsc = native_calibrate_tsc, @@ -156,6 +161,7 @@ struct x86_platform_ops x86_platform __ro_after_init = { .restore_sched_clock_state = tsc_restore_sched_clock_state, .realmode_reserve = reserve_real_mode, .realmode_init = init_real_mode,
- .devmem_is_allowed = platform_devmem_is_allowed, .hyper.pin_vcpu = x86_op_int_noop, .hyper.is_private_mmio = is_private_mmio_noop,
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index bfa444a7dbb0..df5435c8dbea 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -861,18 +861,23 @@ void __init poking_init(void)
- area traditionally contains BIOS code and data regions used by X, dosemu,
- and similar apps. Since they map the entire memory range, the whole range
- must be allowed (for mapping), but any areas that would otherwise be
- disallowed are flagged as being "zero filled" instead of rejected.
- disallowed are flagged as being "zero filled" instead of rejected, for
- read()/write().
*/
- Access has to be given to non-kernel-ram areas as well, these contain the
- PCI mmio resources as well as potential bios/acpi data regions.
int devmem_is_allowed(unsigned long pagenr) {
- bool platform_allowed = x86_platform.devmem_is_allowed(pagenr);
If we are going to do this, I don't see the point of having an x86_platform_op. It may be better to simply gate this on cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) directly here.
Thanks, Naveen
if (region_intersects(PFN_PHYS(pagenr), PAGE_SIZE, IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE) != REGION_DISJOINT) { /*
* For disallowed memory regions in the low 1MB range,
* request that the page be shown as all zeros.
* For disallowed memory regions in the low 1MB range, request
* that the page be shown as all zeros for read()/write(), fail
*/ if (pagenr < 256) return 2;* mmap()
@@ -885,14 +890,20 @@ int devmem_is_allowed(unsigned long pagenr) * restricted resource under CONFIG_STRICT_DEVMEM. */ if (iomem_is_exclusive(pagenr << PAGE_SHIFT)) {
/* Low 1MB bypasses iomem restrictions. */
if (pagenr < 256)
/*
* Low 1MB bypasses iomem restrictions unless the platform says
* the physical address is not suitable for direct access.
*/
if (pagenr < 256) {
if (!platform_allowed)
return 2; return 1;
}
return 0; }
- return 1;
- return platform_allowed;
} void free_init_pages(const char *what, unsigned long begin, unsigned long end)
Naveen N Rao wrote:
On Thu, Apr 10, 2025 at 06:22:38PM -0700, Dan Williams wrote:
Nikolay reports [1] that accessing BIOS data (first 1MB of the physical address space) via /dev/mem results in an SEPT violation.
The cause is ioremap() (via xlate_dev_mem_ptr()) establishes an unencrypted mapping where the kernel had established an encrypted mapping previously.
An initial attempt to fix this revealed that TDX and SEV-SNP have different expectations about which and when address ranges can be mapped via /dev/mem.
Rather than develop a precise set of allowed /dev/mem capable TVM address ranges, teach devmem_is_allowed() to always restrict access to the BIOS data space.
This patch does more than just restrict the BIOS data space - it rejects all accesses to /dev/mem _apart_ from the first 1MB. That should be made clear here.
Agree, and per the follow on conversation [1] even that low 1MB access to return zeroes is not helpful. Confidential Computing userspace should drop its dependency on interfaces that are difficult to make compatible with consistent encryption policy and acceptance status.
http://lore.kernel.org/67f98e27a799e_7205294e3@dwillia2-xfh.jf.intel.com.not... [1]
[..]
int devmem_is_allowed(unsigned long pagenr) {
- bool platform_allowed = x86_platform.devmem_is_allowed(pagenr);
If we are going to do this, I don't see the point of having an x86_platform_op. It may be better to simply gate this on cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) directly here.
That is fair, no point in premature flexibility.
On 11.04.25 г. 4:22 ч., Dan Williams wrote:
Changes since v1 [1]:
- Fix the fact that devmem_is_allowed() == 2 does not prevent mmap access (Kees)
- Rather than teach devmem_is_allowed() == 2 to map zero pages in the mmap case, just fail (Nikolay)
The story starts with Nikolay reporting an SEPT violation due to mismatched encrypted/non-encrypted mappings of the BIOS data space [2].
An initial suggestion to just make sure that the BIOS data space is mapped consistently [3] ran into another issue that TDX and SEV-SNP disagree about when that space can be mapped as encrypted.
Then, in response to a partial patch to allow SEV-SNP to block BIOS data space for other reasons [4], Dave asked why not just give up on /dev/mem access entirely in the confidential VM case [5].
Enter this series to:
1/ Close a subtle hole whereby /dev/mem that is supposed return zeros in lieu of access only enforces that for read()/write()
2/ Use that new closed hole to reliably disable all /dev/mem access for confidential x86 VMs
[5]: http://lore.kernel.org/fd683daa-d953-48ca-8c5d-6f4688ad442c@intel.com
Dan Williams (3): x86/devmem: Remove duplicate range_is_allowed() definition devmem: Block mmap access when read/write access is restricted x86/devmem: Restrict /dev/mem access for potentially unaccepted memory by default
arch/x86/Kconfig | 2 ++ arch/x86/include/asm/x86_init.h | 2 ++ arch/x86/kernel/x86_init.c | 6 ++++++ arch/x86/mm/init.c | 23 +++++++++++++++++------ arch/x86/mm/pat/memtype.c | 31 ++++--------------------------- drivers/char/mem.c | 18 ------------------ include/linux/io.h | 26 ++++++++++++++++++++++++++ 7 files changed, 57 insertions(+), 51 deletions(-)
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
Reviewed-by: Nikolay Borisov nik.borisov@suse.com
linux-stable-mirror@lists.linaro.org