On Thu, May 1, 2025, at 02:56, Dan Williams wrote:
Arnd Bergmann wrote:
On Wed, Apr 30, 2025, at 04:46, Dan Williams wrote:
While there is an existing mitigation to simulate and redirect access to the BIOS data area with STRICT_DEVMEM=y, it is insufficient. Specifically, STRICT_DEVMEM=y traps read(2) access to the BIOS data area, and returns a zeroed buffer. However, it turns out the kernel fails to enforce the same via mmap(2), and a direct mapping is established. This is a hole, and unfortunately userspace has learned to exploit it [2].
As far as I can tell, this was a deliberate design choice in commit a4866aa81251 ("mm: Tighten x86 /dev/mem with zeroing reads"), which did not try to forbid it completely but mainly avoids triggering the hardened usercopy check.
I would say not a "design choice", but rather a known leftover hole that nobody has had the initiative to close since 2022.
https://lore.kernel.org/all/202204071526.37364B5E3@keescook/
Ok, I see.
The existing rules that I can see are:
- readl/write is only allowed on actual (lowmem) RAM, not on MMIO registers, enforced by valid_phys_addr_range()
- with STRICT_DEVMEM, read/write is disallowed on both RAM and MMIO
- an an exception, x86 additionally allows read/write on the low 1MB MMIO region and 32-bit PCI MMIO BAR space, with a custom xlate_dev_mem_ptr() that calls either memremap() or ioremap() on the physical address.
- as another exception from that, the low 1MB on x86 behaves like /dev/zero for memory pages when STRICT_DEVMEM is set, and ignores conflicting drivers for MMIO registers
- The PowerPC sys_rtas syscall has another exception in order to ignore the STRICT_DEVMEM and write to a portion of kernel memory to talk to firmware
- on the mmap() side, x86 has another special to allow mapping RAM in the first 1MB despite STRICT_DEVMEM
How about changing x86 to work more like the others and removing the special cases for the first 1MB and for the 32-bit PCI BAR space? If Xorg, and dmidecode are able to do this differently, maybe the hacks can just go away, or be guarded by a Kconfig option that is mutually exclusive with ARCH_HAS_CC_PLATFORM?
I see the 1MB MMIO special-case in x86::devmem_is_allowed(), but where is the 32-bit PCI BAR space workaround? Just to make sure I am not missing a detail here.
The main difference on x86 is the xlate_dev_mem_ptr() function that does an extra memremap() of the physical address, everything else just does a phys_to_virt(). The only other architecture with an xlate_dev_mem_ptr() implementation is s390, which uses it to work around the first physical page being different per CPU. ia64 had something similar to x86 but is gone now.
The other bit of the puzzle is that memremap() on x86 silently falls back to ioremap() for non-RAM pages. This was originally added in 2008 commit e045fb2a988a ("x86: PAT avoid aliasing in /dev/mem read/write"). I'm not sure what happened exactly, but I suspect that the low 1MB was already mapped at the time through a cached mapping, while the PCI MMIO hole was perhaps not mapped. On x86-32, the 32-bit PCI BAR area should not be included here (since it's above high_memory), but the 16MB hold may be.
The address is first checked by valid_phys_addr_range(), which is defined in an architecture specific way, so it's possible that there are additional architectures on which that includes MMIO ranges that can be accesses through phys_to_virt(), but I could not find any.
The default valid_phys_addr_range() just checks against 'high_memory' to see whether the address is in the linear map. This works on all architectures that don't have holes in the memory map for MMIO (most of them) or that don't just ioremap() all MMIO space into the hole (most of the rest).
arm64 and loongarch check memblock allow known RAM both in the linear map and outside of it, while arm32 and sh explicitly exclude addresses before PHYS_OFFSET on machines where RAM does not start at address 0.
Arnd