Re: [PATCH v6 00/15] Consolidate iommu page table implementations (AMD)

7 Oct 2025

      On Tue, Oct 7, 2025 at 12:12 PM Jason Gunthorpe jgg@nvidia.com wrote:
...
[All the precursor patches are merged now and AMD/RISCV/VTD conversions
are written]
Currently each of the iommu page table formats duplicates all of the logic
to maintain the page table and perform map/unmap/etc operations. There are
several different versions of the algorithms between all the different
formats. The io-pgtable system provides an interface to help isolate the
page table code from the iommu driver, but doesn't provide tools to
implement the common algorithms.
This makes it very hard to improve the state of the pagetable code under
the iommu domains as any proposed improvement needs to alter a large
number of different driver code paths. Combined with a lack of software
based testing this makes improvement in this area very hard.
iommufd wants several new page table operations:

More efficient map/unmap operations, using iommufd's batching logic
unmap that returns the physical addresses into a batch as it progresses
cut that allows splitting areas so large pages can have holes
poked in them dynamically (ie guestmemfd hitless shared/private
transitions)
More agressive freeing of table memory to avoid waste
Fragmenting large pages so that dirty tracking can be more granular
Reassembling large pages so that VMs can run at full IO performance
in migration/dirty tracking error flows
KHO integration for kernel live upgrade

Together these are algorithmically complex enough to be a very significant
task to go and implement in all the page table formats we support. Just
the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86
PAE / AMDv1 / VT-D SS / RISCV)
Instead of doing the duplicated work, this series takes the first step to
consolidate the algorithms into one places. In spirit it is similar to the
work Christoph did a few years back to pull the redundant get_user_pages()
implementations out of the arch code into core MM. This unlocked a great
deal of improvement in that space in the following years. I would like to
see the same benefit in iommu as well.
My first RFC showed a bigger picture with all most all formats and more
algorithms. This series reorganizes that to be narrowly focused on just
enough to convert the AMD driver to use the new mechanism.
kunit tests are provided that allow good testing of the algorithms and all
formats on x86, nothing is arch specific.
AMD is one of the simpler options as the HW is quite uniform with few
different options/bugs while still requiring the complicated contiguous
pages support. The HW also has a very simple range based invalidation
approach that is easy to implement.
The AMD v1 and AMD v2 page table formats are implemented bit for bit
identical to the current code, tested using a compare kunit test that
checks against the io-pgtable version (on github, see below).
Updating the AMD driver to replace the io-pgtable layer with the new stuff
is fairly straightforward now. The layering is fixed up in the new version
so that all the invalidation goes through function pointers.
Several small fixing patches have come out of this as I've been fixing the
problems that the test suite uncovers in the current code, and
implementing the fixed version in iommupt.
On performance, there is a quite wide variety of implementation designs
across all the drivers. Looking at some key performance across
the main formats:
iommu_map():
   pgsz  ,avg new,old ns, min new,old ns  , min % (+ve is better)
     2^12,     53,66    ,      51,63      ,  19.19 (AMDV1)
 256*2^12,    386,1909  ,     367,1795    ,  79.79
 256*2^21,    362,1633  ,     355,1556    ,  77.77
 2^12,     56,62    ,      52,59      ,  11.11 (AMDv2)

256*2^12,    405,1355  ,     357,1292    ,  72.72
 256*2^21,    393,1160  ,     358,1114    ,  67.67
 2^12,     55,65    ,      53,62      ,  14.14 (VTD second stage)

256*2^12,    391,518   ,     332,512     ,  35.35
 256*2^21,    383,635   ,     336,624     ,  46.46
 2^12,     57,65    ,      55,63      ,  12.12 (ARM 64 bit)

256*2^12,    380,389   ,     361,369     ,   2.02
 256*2^21,    358,419   ,     345,400     ,  13.13
iommu_unmap():
   pgsz  ,avg new,old ns, min new,old ns  , min % (+ve is better)
     2^12,     69,88    ,      65,85      ,  23.23 (AMDv1)
 256*2^12,    353,6498  ,     331,6029    ,  94.94
 256*2^21,    373,6014  ,     360,5706    ,  93.93
 2^12,     71,72    ,      66,69      ,   4.04 (AMDv2)

256*2^12,    228,891   ,     206,871     ,  76.76
 256*2^21,    254,721   ,     245,711     ,  65.65
 2^12,     69,87    ,      65,82      ,  20.20 (VTD second stage)

256*2^12,    210,321   ,     200,315     ,  36.36
 256*2^21,    255,349   ,     238,342     ,  30.30
 2^12,     72,77    ,      68,74      ,   8.08 (ARM 64 bit)

256*2^12,    521,357   ,     447,346     , -29.29
 256*2^21,    489,358   ,     433,345     , -25.25

Above numbers include additional patches to remove the iommu_pgsize()
overheads. gcc 13.3.0, i7-12700

This version provides fairly consistent performance across formats. ARM
unmap performance is quite different because this version supports
contiguous pages and uses a very different algorithm for unmapping. Though
why it is so worse compared to AMDv1 I haven't figured out yet.
The per-format commits include a more detailed chart.
There is a second branch:
   https://github.com/jgunthorpe/linux/commits/iommu_pt_all
Containing supporting work and future steps:

ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats
RISCV format and RISCV conversion
 https://github.com/jgunthorpe/linux/commits/iommu_pt_riscv
Support for a DMA incoherent HW page table walker
VT-D second stage format and VT-D conversion
 https://github.com/jgunthorpe/linux/commits/iommu_pt_vtd
DART v1 & v2 format
Draft of a iommufd 'cut' operation to break down huge pages
A compare test that checks the iommupt formats against the iopgtable
interface, including updating AMD to have a working iopgtable and patches
to make VT-D have an iopgtable for testing.
A performance test to micro-benchmark map and unmap against iogptable

My strategy is to go one by one for the drivers:

AMD driver conversion
RISCV page table and driver
Intel VT-D driver and VTDSS page table
Flushing improvements for RISCV
ARM SMMUv3

And concurrently work on the algorithm side:

debugfs content dump, like VT-D has
Cut support
Increase/Decrease page size support
map/unmap batching
KHO

I'm wondering if this work could be used in the future to expand
page_table_check to support IOMMU page tables. For instance, every
time a PTE is inserted or removed, an external state machine could
check for false sharing or improper logic. This approach could
significantly help with preventing and detecting memory corruption
early.
The main requirement would be to define a common logic for all IOMMU
page tables. For the existing page_table_check, we use double-mapping
detection logic [1]. If we were to implement something similar for the
IOMMU, what kind of logic could we apply?
Pasha
[1] https://docs.kernel.org/mm/page_table_check.html

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v6 00/15] Consolidate iommu page table implementations (AMD)