Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush

11 Jul 2025


      On Fri, Jul 11, 2025 at 11:00:06AM +0800, Baolu Lu wrote:
...
Hi Peter Z,
On 7/10/25 21:54, Peter Zijlstra wrote:
...
On Wed, Jul 09, 2025 at 02:28:00PM +0800, Lu Baolu wrote:
...
The vmalloc() and vfree() functions manage virtually contiguous, but not
necessarily physically contiguous, kernel memory regions. When vfree()
unmaps such a region, it tears down the associated kernel page table
entries and frees the physical pages.
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. Architectures like x86 share
static kernel address mappings across all user page tables, allowing the
IOMMU to access the kernel portion of these tables.
Modern IOMMUs often cache page table entries to optimize walk performance,
even for intermediate page table levels. If kernel page table mappings are
changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
entries, Use-After-Free (UAF) vulnerability condition arises. If these
freed page table pages are reallocated for a different purpose, potentially
by an attacker, the IOMMU could misinterpret the new data as valid page
table entries. This allows the IOMMU to walk into attacker-controlled
memory, leading to arbitrary physical memory DMA access or privilege
escalation.
To mitigate this, introduce a new iommu interface to flush IOMMU caches
and fence pending page table walks when kernel page mappings are updated.
This interface should be invoked from architecture-specific code that
manages combined user and kernel page tables.
I must say I liked the kPTI based idea better. Having to iterate and
invalidate an unspecified number of IOMMUs from non-preemptible context
seems 'unfortunate'.
The cache invalidation path in IOMMU drivers is already critical and
operates within a non-preemptible context. This approach is, in fact,
already utilized for user-space page table updates since the beginning
of SVA support.
OK, fair enough I suppose. What kind of delays are we talking about
here? The fact that you basically have a unbounded list of IOMMUs
(although in practise I suppose it is limited by the amount of GPUs and
other fancy stuff you can stick in your machine) does slightly worry me.
At some point the low latency folks are going to come hunting you down.
Do you have a plan on how to deal with this; or are we throwing up our
hands an say, the hardware sucks, deal with it?

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush