Re: [PATCH v5] x86/devmem: Drop /dev/mem access for confidential guests

1 May 2025


      Arnd Bergmann wrote:
...
On Wed, Apr 30, 2025, at 04:46, Dan Williams wrote:
...
While there is an existing mitigation to simulate and redirect access to
the BIOS data area with STRICT_DEVMEM=y, it is insufficient.
Specifically, STRICT_DEVMEM=y traps read(2) access to the BIOS data
area, and returns a zeroed buffer.  However, it turns out the kernel
fails to enforce the same via mmap(2), and a direct mapping is
established. This is a hole, and unfortunately userspace has learned to
exploit it [2].
As far as I can tell, this was a deliberate design choice in
commit a4866aa81251 ("mm: Tighten x86 /dev/mem with zeroing reads"),
which did not try to forbid it completely but mainly avoids triggering
the hardened usercopy check.
I would say not a "design choice", but rather a known leftover hole that
nobody has had the initiative to close since 2022.
https://lore.kernel.org/all/202204071526.37364B5E3@keescook/
...
...
The simplest option for now is arrange for /dev/mem to always behave as
if lockdown is enabled for confidential guests. Require confidential
guest userspace to jettison legacy dependencies on /dev/mem similar to
how other legacy mechanisms are jettisoned for confidential operation.
Recall that modern methods for BIOS data access are available like
/sys/firmware/dmi/tables.
Restricting /dev/mem further is a good idea, but it would be nice
if that could be done without adding yet another special case.
An even more radical approach would be to just disallow CONFIG_DEVMEM
for any configuration that includes ARCH_HAS_CC_PLATFORM, but that
may go a little too far.
Right, for example the policy could go as far as to always require
generic LOCKDOWN_KERNEL for confidential guests, but a distro likely
wants to be able to build confidential guests and bare metal host
kernels from the same kernel config. At a minimum it seems difficult to
get away from a runtime "is_confidential_guest()" check.
The other observation is that generic LOCKDOWN_KERNEL is about
protecting against root being able to compromise platform integrity
where confidential computing is full trust within the TEE, including
root to run amok, and no trust outside that.
...
The existing rules that I can see are:

readl/write is only allowed on actual (lowmem) RAM, not
on MMIO registers, enforced by valid_phys_addr_range()
with STRICT_DEVMEM, read/write is disallowed on both
RAM and MMIO
an an exception, x86 additionally allows read/write on the
low 1MB MMIO region and 32-bit PCI MMIO BAR space, with
a custom xlate_dev_mem_ptr() that calls either memremap()
or ioremap() on the physical address.
as another exception from that, the low 1MB on x86 behaves
like /dev/zero for memory pages when STRICT_DEVMEM
is set, and ignores conflicting drivers for MMIO registers
The PowerPC sys_rtas syscall has another exception in
order to ignore the STRICT_DEVMEM and write to a portion
of kernel memory to talk to firmware
on the mmap() side, x86 has another special to allow
mapping RAM in the first 1MB despite STRICT_DEVMEM

How about changing x86 to work more like the others and
removing the special cases for the first 1MB and for the
32-bit PCI BAR space? If Xorg, and dmidecode are able to
do this differently, maybe the hacks can just go away, or
be guarded by a Kconfig option that is mutually exclusive
with ARCH_HAS_CC_PLATFORM?
I see the 1MB MMIO special-case in x86::devmem_is_allowed(), but where
is the 32-bit PCI BAR space workaround? Just to make sure I am not
missing a detail here.
Note, this devmem exclusion effort previously went through a phase of
always returning "0" (no access) from x86::devmem_is_allowed() [1]. The
rationale for hacking the special case into open_port() was to maintain
ABI consistency with LOCKDOWN_KERNEL. Right now x86 userspace expects
either LOCKDOWN_KERNEL semantics, or read(2) returns zero and mmap(2) is
unrestricted. If devmem_is_allowed() always says "no" then userspace is
introduced to a new failure mode.
I am open to rip the band-aid off and see what happens, but Robustness
Principle suggested mimicking semantics that LOCKDOWN_KERNEL has already
socialized.
[1]: http://lore.kernel.org/67f8a1a15cc29_7205294d7@dwillia2-xfh.jf.intel.com.not...
...
...
@@ -595,6 +596,15 @@ static int open_port(struct inode *inode, struct 
file *filp)
   if (rc)
   	return rc;

/*
* Enforce encrypted mapping consistency and avoid unaccepted


* memory conflicts, "lockdown" /dev/mem for confidential


* guests.


*/


if (IS_ENABLED(CONFIG_STRICT_DEVMEM) &&
   cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))


return -EPERM;




The description only talks about /dev/mem, but it looks like this
blocks /dev/port as well. Blocking /dev/port may also be a good
idea, but I don't see why that would be conditional on
CC_ATTR_GUEST_MEM_ENCRYPT.
That is more a side effect of wanting to mimic the
security_locked_down(LOCKDOWN_DEV_MEM) behavior. That hook implies all
of /dev/{mem,kmem,port} follow the same policy.
...
When CONFIG_DEVMEM=y and CONFIG_STRICT_DEVMEM=n, doesn't this still
have the same problem for CC guests?
It does, but that's the point. The CC guests that need the exclusion
have "select STRICT_DEVMEM", and *maybe* some CC guest arch could
tolerate raw devmem access.
However, that is unlikely. The observations here and from Greg point to
security_locked_down() should be providing the answer here. Which
completes the full circle back towards Nikolay's original proposal of
allowing lockdown policy to handle a bitmap of options [2].
Nikolay, part of me is glad to have done the full exploration of the
problem space here. I learned something. At the same time it is humbling
to realize I could have saved everyone's time just supporting your
effort. Please pick up your lockdown bitmap proposal and consider this
thread a long-winded Reviewed-by.
[2]: http://lore.kernel.org/20250321102422.640271-1-nik.borisov@suse.com
BTW, you can avoid the IO_STRICT_DEVMEM complexity by including
LOCKDOWN_PCI_ACCESS in the list of LOCKDOWN bits that CC guests always
enable. If someone wants to enable confidential userspace PCI drivers
they can do the work to switch from LOCKDOWN_PCI_ACCESS to
IO_STRICT_DEVMEM.

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v5] x86/devmem: Drop /dev/mem access for confidential guests