The AMD IOMMU host page table implementation supports dynamic page table levels (up to 6 levels), starting with a 3-level configuration that expands based on IOVA address. The kernel maintains a root pointer and current page table level to enable proper page table walks in alloc_pte()/fetch_pte() operations.
The IOMMU IOVA allocator initially starts with 32-bit address and onces its exhuasted it switches to 64-bit address (max address is determined based on IOMMU and device DMA capability). To support larger IOVA, AMD IOMMU driver increases page table level.
But in unmap path (iommu_v1_unmap_pages()), fetch_pte() reads pgtable->[root/mode] without lock. So its possible that in exteme corner case, when increase_address_space() is updating pgtable->[root/mode], fetch_pte() reads wrong page table level (pgtable->mode). It does compare the value with level encoded in page table and returns NULL. This will result is iommu_unmap ops to fail and upper layer may retry/log WARN_ON.
CPU 0 CPU 1 ------ ------ map pages unmap pages alloc_pte() -> increase_address_space() iommu_v1_unmap_pages() -> fetch_pte() pgtable->root = pte (new root value) READ pgtable->[mode/root] Reads new root, old mode Updates mode (pgtable->mode += 1)
Since Page table level updates are infrequent and already synchronized with a spinlock, implement seqcount to enable lock-free read operations on the read path.
Reported-by: Alejandro Jimenez alejandro.j.jimenez@oracle.com Cc: stable@vger.kernel.org Cc: Joao Martins joao.m.martins@oracle.com Cc: Suravee Suthikulpanit suravee.suthikulpanit@amd.com Signed-off-by: Vasant Hegde vasant.hegde@amd.com --- drivers/iommu/amd/amd_iommu_types.h | 1 + drivers/iommu/amd/io_pgtable.c | 25 +++++++++++++++++++++---- 2 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h index 5219d7ddfdaa..95f63c5f6159 100644 --- a/drivers/iommu/amd/amd_iommu_types.h +++ b/drivers/iommu/amd/amd_iommu_types.h @@ -555,6 +555,7 @@ struct gcr3_tbl_info { };
struct amd_io_pgtable { + seqcount_t seqcount; /* Protects root/mode update */ struct io_pgtable pgtbl; int mode; u64 *root; diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c index a91e71f981ef..70c2f5b1631b 100644 --- a/drivers/iommu/amd/io_pgtable.c +++ b/drivers/iommu/amd/io_pgtable.c @@ -17,6 +17,7 @@ #include <linux/slab.h> #include <linux/types.h> #include <linux/dma-mapping.h> +#include <linux/seqlock.h>
#include <asm/barrier.h>
@@ -130,8 +131,11 @@ static bool increase_address_space(struct amd_io_pgtable *pgtable,
*pte = PM_LEVEL_PDE(pgtable->mode, iommu_virt_to_phys(pgtable->root));
+ write_seqcount_begin(&pgtable->seqcount); pgtable->root = pte; pgtable->mode += 1; + write_seqcount_end(&pgtable->seqcount); + amd_iommu_update_and_flush_device_table(domain);
pte = NULL; @@ -153,6 +157,7 @@ static u64 *alloc_pte(struct amd_io_pgtable *pgtable, { unsigned long last_addr = address + (page_size - 1); struct io_pgtable_cfg *cfg = &pgtable->pgtbl.cfg; + unsigned int seqcount; int level, end_lvl; u64 *pte, *page;
@@ -170,8 +175,14 @@ static u64 *alloc_pte(struct amd_io_pgtable *pgtable, }
- level = pgtable->mode - 1; - pte = &pgtable->root[PM_LEVEL_INDEX(level, address)]; + do { + seqcount = read_seqcount_begin(&pgtable->seqcount); + + level = pgtable->mode - 1; + pte = &pgtable->root[PM_LEVEL_INDEX(level, address)]; + } while (read_seqcount_retry(&pgtable->seqcount, seqcount)); + + address = PAGE_SIZE_ALIGN(address, page_size); end_lvl = PAGE_SIZE_LEVEL(page_size);
@@ -249,6 +260,7 @@ static u64 *fetch_pte(struct amd_io_pgtable *pgtable, unsigned long *page_size) { int level; + unsigned int seqcount; u64 *pte;
*page_size = 0; @@ -256,8 +268,12 @@ static u64 *fetch_pte(struct amd_io_pgtable *pgtable, if (address > PM_LEVEL_SIZE(pgtable->mode)) return NULL;
- level = pgtable->mode - 1; - pte = &pgtable->root[PM_LEVEL_INDEX(level, address)]; + do { + seqcount = read_seqcount_begin(&pgtable->seqcount); + level = pgtable->mode - 1; + pte = &pgtable->root[PM_LEVEL_INDEX(level, address)]; + } while (read_seqcount_retry(&pgtable->seqcount, seqcount)); + *page_size = PTE_LEVEL_PAGE_SIZE(level);
while (level > 0) { @@ -541,6 +557,7 @@ static struct io_pgtable *v1_alloc_pgtable(struct io_pgtable_cfg *cfg, void *coo if (!pgtable->root) return NULL; pgtable->mode = PAGE_MODE_3_LEVEL; + seqcount_init(&pgtable->seqcount);
cfg->pgsize_bitmap = amd_iommu_pgsize_bitmap; cfg->ias = IOMMU_IN_ADDR_BIT_SIZE;
Hey Vasant,
On Thu, Sep 11, 2025 at 12:14:15PM +0000, Vasant Hegde wrote:
Reported-by: Alejandro Jimenez alejandro.j.jimenez@oracle.com Cc: stable@vger.kernel.org Cc: Joao Martins joao.m.martins@oracle.com Cc: Suravee Suthikulpanit suravee.suthikulpanit@amd.com Signed-off-by: Vasant Hegde vasant.hegde@amd.com
Can you please send again with a Fixes tag?
Joerg
Joerg,
On 9/13/2025 11:40 AM, Jörg Rödel wrote:
Hey Vasant,
On Thu, Sep 11, 2025 at 12:14:15PM +0000, Vasant Hegde wrote:
Reported-by: Alejandro Jimenez alejandro.j.jimenez@oracle.com Cc: stable@vger.kernel.org Cc: Joao Martins joao.m.martins@oracle.com Cc: Suravee Suthikulpanit suravee.suthikulpanit@amd.com Signed-off-by: Vasant Hegde vasant.hegde@amd.com
Can you please send again with a Fixes tag?
I was not sure which commit I should pick for "Fixes" tag as this issue probably exists since we introduce increase_address_space() but lot of things changed. So I just Cced stable.
Looking into the git history, I think below commit is more appropriate. I will update and resend the patch.
commit 754265bcab78a9014f0f99cd35e0d610fcd7dfa7 Author: Joerg Roedel jroedel@suse.de Date: Fri Sep 6 10:39:54 2019 +0200
iommu/amd: Fix race in increase_address_space()
-Vasant
On Thu, Sep 11, 2025 at 12:14:15PM +0000, Vasant Hegde wrote:
The IOMMU IOVA allocator initially starts with 32-bit address and onces its exhuasted it switches to 64-bit address (max address is determined based on IOMMU and device DMA capability). To support larger IOVA, AMD IOMMU driver increases page table level.
Is this the case? I thought I saw something that the allocator is starting from high addresses?
But in unmap path (iommu_v1_unmap_pages()), fetch_pte() reads pgtable->[root/mode] without lock. So its possible that in exteme corner case, when increase_address_space() is updating pgtable->[root/mode], fetch_pte() reads wrong page table level (pgtable->mode). It does compare the value with level encoded in page table and returns NULL. This will result is iommu_unmap ops to fail and upper layer may retry/log WARN_ON.
Yep, definately a bug, I spotted it already and fixed it in iommupt, you can read about it here:
https://lore.kernel.org/linux-iommu/13-v5-116c4948af3d+68091-iommu_pt_jgg@nv...
CPU 0 CPU 1
map pages unmap pages alloc_pte() -> increase_address_space() iommu_v1_unmap_pages() -> fetch_pte() pgtable->root = pte (new root value) READ pgtable->[mode/root] Reads new root, old mode Updates mode (pgtable->mode += 1)
This doesn't solve the whole problem, yes reading the two values coherently is important but we must also serialize parallel map such that map only returns if the IOMMU is actually programmed with the new roots.
I don't see that in this fix.
IMHO unless someone is actually hitting this I'd leave it and focus on merging iomupt which fully fixes this without adding any locks to the fast path.
Jason
Jason,
On 9/18/2025 7:47 PM, Jason Gunthorpe wrote:
On Thu, Sep 11, 2025 at 12:14:15PM +0000, Vasant Hegde wrote:
The IOMMU IOVA allocator initially starts with 32-bit address and onces its exhuasted it switches to 64-bit address (max address is determined based on IOMMU and device DMA capability). To support larger IOVA, AMD IOMMU driver increases page table level.
Is this the case? I thought I saw something that the allocator is starting from high addresses?
Right. By default we start with 32bit address (from 0xffff_ffff to 0x0) and once its full it goes to 64bit (assuming device supports 64bit). At that point increase_address_space() gets called.
But in unmap path (iommu_v1_unmap_pages()), fetch_pte() reads pgtable->[root/mode] without lock. So its possible that in exteme corner case, when increase_address_space() is updating pgtable->[root/mode], fetch_pte() reads wrong page table level (pgtable->mode). It does compare the value with level encoded in page table and returns NULL. This will result is iommu_unmap ops to fail and upper layer may retry/log WARN_ON.
Yep, definately a bug, I spotted it already and fixed it in iommupt, you can read about it here:
https://lore.kernel.org/linux-iommu/13-v5-116c4948af3d+68091-iommu_pt_jgg@nv...
Nice. Will take a look this week.
CPU 0 CPU 1
map pages unmap pages alloc_pte() -> increase_address_space() iommu_v1_unmap_pages() -> fetch_pte() pgtable->root = pte (new root value) READ pgtable->[mode/root] Reads new root, old mode Updates mode (pgtable->mode += 1)
This doesn't solve the whole problem, yes reading the two values coherently is important but we must also serialize parallel map such that map only returns if the IOMMU is actually programmed with the new roots.
I don't see that in this fix.
IMHO unless someone is actually hitting this I'd leave it and focus on merging iomupt which fully fixes this without adding any locks to the fast path.
Unfortunately yes. We had customer reporting this issue.
-Vasant
On Mon, Sep 22, 2025 at 03:35:07PM +0530, Vasant Hegde wrote:
IMHO unless someone is actually hitting this I'd leave it and focus on merging iomupt which fully fixes this without adding any locks to the fast path.
Unfortunately yes. We had customer reporting this issue.
If a customer is actually hitting this then you definately need to solve the whole problem including the mmap race.
I guess adding a lock makes it easer to deal with, but this:
+ write_seqcount_begin(&pgtable->seqcount); pgtable->root = pte; pgtable->mode += 1; + write_seqcount_end(&pgtable->seqcount); + amd_iommu_update_and_flush_device_table(domain);
Is out of order.
The DTE has to be updated and caches flushed *before* writing the new information to the pgtable for other threads to observe.
You can look at what I did, but broadly you have to feed the new top as a function argument through to set_dte:
static void set_dte_entry(struct amd_iommu *iommu, - struct iommu_dev_data *dev_data) + struct iommu_dev_data *dev_data, + phys_addr_t top_paddr, unsigned int top_level) { u16 domid;
So the above can be ordered correctly.
You may also want to think about just disabling this optimization, always start with a 6 level table.
Jason
linux-stable-mirror@lists.linaro.org