Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush

10 Jul 2025


      On Wed, Jul 09, 2025 at 02:28:00PM +0800, Lu Baolu wrote:
...
The vmalloc() and vfree() functions manage virtually contiguous, but not
necessarily physically contiguous, kernel memory regions. When vfree()
unmaps such a region, it tears down the associated kernel page table
entries and frees the physical pages.
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. Architectures like x86 share
static kernel address mappings across all user page tables, allowing the
IOMMU to access the kernel portion of these tables.
Modern IOMMUs often cache page table entries to optimize walk performance,
even for intermediate page table levels. If kernel page table mappings are
changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
entries, Use-After-Free (UAF) vulnerability condition arises. If these
freed page table pages are reallocated for a different purpose, potentially
by an attacker, the IOMMU could misinterpret the new data as valid page
table entries. This allows the IOMMU to walk into attacker-controlled
memory, leading to arbitrary physical memory DMA access or privilege
escalation.
To mitigate this, introduce a new iommu interface to flush IOMMU caches
and fence pending page table walks when kernel page mappings are updated.
This interface should be invoked from architecture-specific code that
manages combined user and kernel page tables.
I must say I liked the kPTI based idea better. Having to iterate and
invalidate an unspecified number of IOMMUs from non-preemptible context
seems 'unfortunate'.
Why was this approach chosen over the kPTI one, where we keep a
page-table root that simply does not include the kernel bits, and
therefore the IOMMU will never see them (change) and we'll never have to
invalidate?
...
@@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
   if (ret)
   	goto out_free_domain;
   domain->users = 1;

list_add(&domain->next, &mm->iommu_mm->sva_domains);


if (list_empty(&iommu_mm->sva_domains)) {
scoped_guard(spinlock_irqsave, &iommu_mms_lock) {


	if (list_empty(&iommu_sva_mms))


		static_branch_enable(&iommu_sva_present);


	list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);


}


}
list_add(&domain->next, &iommu_mm->sva_domains);

out:
   refcount_set(&handle->users, 1);
   mutex_unlock(&iommu_sva_lock);
@@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
   	list_del(&domain->next);
   	iommu_domain_free(domain);
   }


if (list_empty(&iommu_mm->sva_domains)) {
scoped_guard(spinlock_irqsave, &iommu_mms_lock) {


	list_del(&iommu_mm->mm_list_elm);


	if (list_empty(&iommu_sva_mms))


		static_branch_disable(&iommu_sva_present);


}


}
mutex_unlock(&iommu_sva_lock);
 kfree(handle);

}
This seems an odd coding style choice; why the extra unneeded
indentation? That is, what's wrong with:
if (list_empty()) {
    	guard(spinlock_irqsave)(&iommu_mms_lock);
    	list_del();
    	if (list_empty()
    		static_branch_disable();
    }
...
@@ -312,3 +332,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
 
   return domain;
 }



+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{

struct iommu_mm_data *iommu_mm;

if (!static_branch_unlikely(&iommu_sva_present))
return;



guard(spinlock_irqsave)(&iommu_mms_lock);
list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);



+}
This is absolutely the wrong way to use static_branch. You want them in
inline functions guarding the function call, not inside the function
call.

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush