Arnd Bergmann wrote:
On Wed, Apr 30, 2025, at 04:46, Dan Williams wrote:
While there is an existing mitigation to simulate and redirect access to the BIOS data area with STRICT_DEVMEM=y, it is insufficient. Specifically, STRICT_DEVMEM=y traps read(2) access to the BIOS data area, and returns a zeroed buffer. However, it turns out the kernel fails to enforce the same via mmap(2), and a direct mapping is established. This is a hole, and unfortunately userspace has learned to exploit it [2].
As far as I can tell, this was a deliberate design choice in commit a4866aa81251 ("mm: Tighten x86 /dev/mem with zeroing reads"), which did not try to forbid it completely but mainly avoids triggering the hardened usercopy check.
I would say not a "design choice", but rather a known leftover hole that nobody has had the initiative to close since 2022.
https://lore.kernel.org/all/202204071526.37364B5E3@keescook/
The simplest option for now is arrange for /dev/mem to always behave as if lockdown is enabled for confidential guests. Require confidential guest userspace to jettison legacy dependencies on /dev/mem similar to how other legacy mechanisms are jettisoned for confidential operation. Recall that modern methods for BIOS data access are available like /sys/firmware/dmi/tables.
Restricting /dev/mem further is a good idea, but it would be nice if that could be done without adding yet another special case.
An even more radical approach would be to just disallow CONFIG_DEVMEM for any configuration that includes ARCH_HAS_CC_PLATFORM, but that may go a little too far.
Right, for example the policy could go as far as to always require generic LOCKDOWN_KERNEL for confidential guests, but a distro likely wants to be able to build confidential guests and bare metal host kernels from the same kernel config. At a minimum it seems difficult to get away from a runtime "is_confidential_guest()" check.
The other observation is that generic LOCKDOWN_KERNEL is about protecting against root being able to compromise platform integrity where confidential computing is full trust within the TEE, including root to run amok, and no trust outside that.
The existing rules that I can see are:
- readl/write is only allowed on actual (lowmem) RAM, not on MMIO registers, enforced by valid_phys_addr_range()
- with STRICT_DEVMEM, read/write is disallowed on both RAM and MMIO
- an an exception, x86 additionally allows read/write on the low 1MB MMIO region and 32-bit PCI MMIO BAR space, with a custom xlate_dev_mem_ptr() that calls either memremap() or ioremap() on the physical address.
- as another exception from that, the low 1MB on x86 behaves like /dev/zero for memory pages when STRICT_DEVMEM is set, and ignores conflicting drivers for MMIO registers
- The PowerPC sys_rtas syscall has another exception in order to ignore the STRICT_DEVMEM and write to a portion of kernel memory to talk to firmware
- on the mmap() side, x86 has another special to allow mapping RAM in the first 1MB despite STRICT_DEVMEM
How about changing x86 to work more like the others and removing the special cases for the first 1MB and for the 32-bit PCI BAR space? If Xorg, and dmidecode are able to do this differently, maybe the hacks can just go away, or be guarded by a Kconfig option that is mutually exclusive with ARCH_HAS_CC_PLATFORM?
I see the 1MB MMIO special-case in x86::devmem_is_allowed(), but where is the 32-bit PCI BAR space workaround? Just to make sure I am not missing a detail here.
Note, this devmem exclusion effort previously went through a phase of always returning "0" (no access) from x86::devmem_is_allowed() [1]. The rationale for hacking the special case into open_port() was to maintain ABI consistency with LOCKDOWN_KERNEL. Right now x86 userspace expects either LOCKDOWN_KERNEL semantics, or read(2) returns zero and mmap(2) is unrestricted. If devmem_is_allowed() always says "no" then userspace is introduced to a new failure mode.
I am open to rip the band-aid off and see what happens, but Robustness Principle suggested mimicking semantics that LOCKDOWN_KERNEL has already socialized.
[1]: http://lore.kernel.org/67f8a1a15cc29_7205294d7@dwillia2-xfh.jf.intel.com.not...
@@ -595,6 +596,15 @@ static int open_port(struct inode *inode, struct file *filp) if (rc) return rc;
- /*
* Enforce encrypted mapping consistency and avoid unaccepted
* memory conflicts, "lockdown" /dev/mem for confidential
* guests.
*/
- if (IS_ENABLED(CONFIG_STRICT_DEVMEM) &&
cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
return -EPERM;
The description only talks about /dev/mem, but it looks like this blocks /dev/port as well. Blocking /dev/port may also be a good idea, but I don't see why that would be conditional on CC_ATTR_GUEST_MEM_ENCRYPT.
That is more a side effect of wanting to mimic the security_locked_down(LOCKDOWN_DEV_MEM) behavior. That hook implies all of /dev/{mem,kmem,port} follow the same policy.
When CONFIG_DEVMEM=y and CONFIG_STRICT_DEVMEM=n, doesn't this still have the same problem for CC guests?
It does, but that's the point. The CC guests that need the exclusion have "select STRICT_DEVMEM", and *maybe* some CC guest arch could tolerate raw devmem access.
However, that is unlikely. The observations here and from Greg point to security_locked_down() should be providing the answer here. Which completes the full circle back towards Nikolay's original proposal of allowing lockdown policy to handle a bitmap of options [2].
Nikolay, part of me is glad to have done the full exploration of the problem space here. I learned something. At the same time it is humbling to realize I could have saved everyone's time just supporting your effort. Please pick up your lockdown bitmap proposal and consider this thread a long-winded Reviewed-by.
[2]: http://lore.kernel.org/20250321102422.640271-1-nik.borisov@suse.com
BTW, you can avoid the IO_STRICT_DEVMEM complexity by including LOCKDOWN_PCI_ACCESS in the list of LOCKDOWN bits that CC guests always enable. If someone wants to enable confidential userspace PCI drivers they can do the work to switch from LOCKDOWN_PCI_ACCESS to IO_STRICT_DEVMEM.