On 8/6/25 23:03, Dave Hansen wrote:
On 8/5/25 22:25, Lu Baolu wrote:
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The Linux x86 architecture maps the kernel address space into the upper portion of every process’s page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel space mappings. However, the Linux kernel currently lacks a notification mechanism for kernel space mapping changes. This means the IOMMU driver is not aware of such changes, leading to a break in IOMMU cache coherence.
FWIW, I wouldn't use the term "cache coherence" in this context. I'd probably just call them "stale IOTLB entries".
I also think this over states the problem. There is currently no problem with "kernel space mapping changes". The issue is solely when kernel page table pages are freed and reused.
Modern IOMMUs often cache page table entries of the intermediate-level page table as long as the entry is valid, no matter the permissions, to optimize walk performance. Currently the iommu driver is notified only for changes of user VA mappings, so the IOMMU's internal caches may retain stale entries for kernel VA. When kernel page table mappings are changed (e.g., by vfree()), but the IOMMU's internal caches retain stale entries, Use-After-Free (UAF) vulnerability condition arises.
If these freed page table pages are reallocated for a different purpose, potentially by an attacker, the IOMMU could misinterpret the new data as valid page table entries. This allows the IOMMU to walk into attacker- controlled memory, leading to arbitrary physical memory DMA access or privilege escalation.
Note that it's not just use-after-free. It's literally that the IOMMU will keep writing Accessed and Dirty bits while it thinks the page is still a page table. The IOMMU will sit there happily setting bits. So, it's_write_ after free too.
To mitigate this, introduce a new iommu interface to flush IOMMU caches. This interface should be invoked from architecture-specific code that manages combined user and kernel page tables, whenever a kernel page table update is done and the CPU TLB needs to be flushed.
There's one tidbit missing from this:
Currently SVA contexts are all unprivileged. They can only access user mappings and never kernel mappings. However, IOMMU still walk kernel-only page tables all the way down to the leaf where they realize that the entry is a kernel mapping and error out.
Thank you for the guidance. I will improve the commit message accordingly, as follows.
iommu/sva: Invalidate stale IOTLB entries for kernel address space
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The x86 architecture maps the kernel's virtual address space into the upper portion of every process's page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel page table entries.
The Linux kernel currently lacks a notification mechanism for kernel page table changes, specifically when page table pages are freed and reused. The IOMMU driver is only notified of changes to user virtual address mappings. This can cause the IOMMU's internal caches to retain stale entries for kernel VA.
A Use-After-Free (UAF) and Write-After-Free (WAF) condition arises when kernel page table pages are freed and later reallocated. The IOMMU could misinterpret the new data as valid page table entries. The IOMMU might then walk into attacker-controlled memory, leading to arbitrary physical memory DMA access or privilege escalation. This is also a Write-After-Free issue, as the IOMMU will potentially continue to write Accessed and Dirty bits to the freed memory while attempting to walk the stale page tables.
Currently, SVA contexts are unprivileged and cannot access kernel mappings. However, the IOMMU will still walk kernel-only page tables all the way down to the leaf entries, where it realizes the mapping is for the kernel and errors out. This means the IOMMU still caches these intermediate page table entries, making the described vulnerability a real concern.
To mitigate this, a new IOMMU interface is introduced to flush IOTLB entries for the kernel address space. This interface is invoked from the x86 architecture code that manages combined user and kernel page tables, specifically when a kernel page table update requires a CPU TLB flush.
Thanks, baolu