On Wed, Apr 30, 2025, at 04:46, Dan Williams wrote:
While there is an existing mitigation to simulate and redirect access to the BIOS data area with STRICT_DEVMEM=y, it is insufficient. Specifically, STRICT_DEVMEM=y traps read(2) access to the BIOS data area, and returns a zeroed buffer. However, it turns out the kernel fails to enforce the same via mmap(2), and a direct mapping is established. This is a hole, and unfortunately userspace has learned to exploit it [2].
As far as I can tell, this was a deliberate design choice in commit a4866aa81251 ("mm: Tighten x86 /dev/mem with zeroing reads"), which did not try to forbid it completely but mainly avoids triggering the hardened usercopy check.
The simplest option for now is arrange for /dev/mem to always behave as if lockdown is enabled for confidential guests. Require confidential guest userspace to jettison legacy dependencies on /dev/mem similar to how other legacy mechanisms are jettisoned for confidential operation. Recall that modern methods for BIOS data access are available like /sys/firmware/dmi/tables.
Restricting /dev/mem further is a good idea, but it would be nice if that could be done without adding yet another special case.
An even more radical approach would be to just disallow CONFIG_DEVMEM for any configuration that includes ARCH_HAS_CC_PLATFORM, but that may go a little too far.
The existing rules that I can see are:
- readl/write is only allowed on actual (lowmem) RAM, not on MMIO registers, enforced by valid_phys_addr_range() - with STRICT_DEVMEM, read/write is disallowed on both RAM and MMIO - an an exception, x86 additionally allows read/write on the low 1MB MMIO region and 32-bit PCI MMIO BAR space, with a custom xlate_dev_mem_ptr() that calls either memremap() or ioremap() on the physical address. - as another exception from that, the low 1MB on x86 behaves like /dev/zero for memory pages when STRICT_DEVMEM is set, and ignores conflicting drivers for MMIO registers - The PowerPC sys_rtas syscall has another exception in order to ignore the STRICT_DEVMEM and write to a portion of kernel memory to talk to firmware - on the mmap() side, x86 has another special to allow mapping RAM in the first 1MB despite STRICT_DEVMEM
How about changing x86 to work more like the others and removing the special cases for the first 1MB and for the 32-bit PCI BAR space? If Xorg, and dmidecode are able to do this differently, maybe the hacks can just go away, or be guarded by a Kconfig option that is mutually exclusive with ARCH_HAS_CC_PLATFORM?
@@ -595,6 +596,15 @@ static int open_port(struct inode *inode, struct file *filp) if (rc) return rc;
- /*
* Enforce encrypted mapping consistency and avoid unaccepted
* memory conflicts, "lockdown" /dev/mem for confidential
* guests.
*/
- if (IS_ENABLED(CONFIG_STRICT_DEVMEM) &&
cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
return -EPERM;
The description only talks about /dev/mem, but it looks like this blocks /dev/port as well. Blocking /dev/port may also be a good idea, but I don't see why that would be conditional on CC_ATTR_GUEST_MEM_ENCRYPT.
When CONFIG_DEVMEM=y and CONFIG_STRICT_DEVMEM=n, doesn't this still have the same problem for CC guests?
Arnd