Jason,
On 9/3/2025 11:16 PM, Jason Gunthorpe wrote:
map is slightly complicated because it has to handle a number of special edge cases:
- Overmapping a previously shared table with an OA - requries validating and freeing the possibly empty tables
- Doing the above across an entire to-be-created contiguous entry
- Installing a new shared table level concurrently with another thread
- Expanding the table by adding more top levels
Table expansion is a unique feature of AMDv1, this version is quite similar except we handle racing concurrent lockless map. The table top pointer and starting level are encoded in a single uintptr_t which ensures we can READ_ONCE() without tearing. Any op will do the READ_ONCE() and use that fixed point as its starting point. Concurrent expansion is handled with a table global spinlock.
When inserting a new table entry map checks that the entire portion of the table is empty. This includes freeing any empty lower tables that will be overwritten by an OA. A separate free list is used while checking and collecting all the empty lower tables so that writing the new entry is uninterrupted, either the new entry fully writes or nothing changes.
A special fast path for PAGE_SIZE is implemented that does a direct walk to the leaf level and installs a single entry. This gives ~15% improvement for iommu_map() when mapping lists of single pages.
This version sits under the iommu_domain_ops as map_pages() but does not require the external page size calculation. The implementation is actually map_range() and can do arbitrary ranges, internally handling all the validation and supporting any arrangment of page sizes. A future series can optimize iommu_map() to take advantage of this.
Tested-by: Alejandro Jimenez alejandro.j.jimenez@oracle.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com
drivers/iommu/generic_pt/iommu_pt.h | 481 ++++++++++++++++++++++++++++ include/linux/generic_pt/iommu.h | 58 ++++ 2 files changed, 539 insertions(+)
.../...
+static int __map_range_leaf(struct pt_range *range, void *arg,
unsigned int level, struct pt_table_p *table)
+{
- struct pt_state pts = pt_init(range, level, table);
- struct pt_iommu_map_args *map = arg;
- unsigned int leaf_pgsize_lg2 = map->leaf_pgsize_lg2;
- unsigned int start_index;
- pt_oaddr_t oa = map->oa;
- unsigned int step;
- bool need_contig;
- int ret = 0;
- PT_WARN_ON(map->leaf_level != level);
- PT_WARN_ON(!pt_can_have_leaf(&pts));
- step = log2_to_int_t(unsigned int,
leaf_pgsize_lg2 - pt_table_item_lg2sz(&pts));
- need_contig = leaf_pgsize_lg2 != pt_table_item_lg2sz(&pts);
- _pt_iter_first(&pts);
- start_index = pts.index;
- do {
pts.type = pt_load_entry_raw(&pts);
if (pts.type != PT_ENTRY_EMPTY || need_contig) {
if (pts.index != start_index)
pt_index_to_va(&pts);
ret = clear_contig(&pts, map->iotlb_gather, step,
leaf_pgsize_lg2);
if (ret)
break;
}
PT_WARN_ON(compute_best_pgsize(&pts, oa) != leaf_pgsize_lg2);
If I select CONFIG_DEBUG_GENERIC_PT=y and boot AMD system with V1 (Host page table), in some cases we hit this warning. Code path looks ok. may be silence these warning?
[ 31.985383] pt_iommu_amdv1_map_pages : oa 0x208b95d000 va 0xfef80000 last_va 0xfef9ffff pgsz_lg 0xc pgsize 0x1000 pgcount 0x20 [ 31.985384] __map_range_leaf oa 0x208b95e000 va 0xfef80000 last_va 0xfef9ffff pgsize 0xd leaf_pgsize 0xc possible_sz 0x1ff000 [ 31.985391] ------------[ cut here ]------------ [ 31.985392] WARNING: CPU: 359 PID: 2540 at drivers/iommu/generic_pt/fmt/../iommu_pt.h:493 __map_range_leaf+0x636/0x860 [ 31.985399] Modules linked in: [ 31.985402] CPU: 359 UID: 0 PID: 2540 Comm: systemd-udevd Not tainted 6.17.0-rc3-genricpt+ #444 VOLUNTARY [ 31.985405] Hardware name: AMD Corporation Titanite_4G/Titanite_4G, BIOS RTI100EB 12/05/2024 [ 31.985406] RIP: 0010:__map_range_leaf+0x636/0x860 [ 31.985409] Code: 49 89 6e 18 48 8b 54 24 58 65 48 2b 15 6b 4d b8 01 0f 85 2a 02 00 00 48 83 c4 60 5b 5d 41 5c 41 5d 41 5e 41 5f e9 55 2e 67 ff <0f> 0b e9 07 fe ff ff 0f b6 48 21 e9 e5 fb ff ff 48 8b 7c 24 18 44 [ 31.985411] RSP: 0018:ff78b42ad7063558 EFLAGS: 00010297 [ 31.985413] RAX: 0000000000000000 RBX: ff453e2c423cdc08 RCX: 000000000000000d [ 31.985414] RDX: 0000000000000000 RSI: 0000000000002000 RDI: ffffff7fffffffff [ 31.985415] RBP: 000000208b95e000 R08: 00000000fef9ffff R09: 00000000fffeffff [ 31.985416] R10: 000000000000000c R11: ff453e6b4c696000 R12: 0000000000003000 [ 31.985417] R13: ff78b42ad7063770 R14: ff78b42ad7063748 R15: 000000000000000c [ 31.985418] FS: 00007f46c7e888c0(0000) GS:ff453e6aabbc2000(0000) knlGS:0000000000000000 [ 31.985420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 31.985421] CR2: 00007f46c7e03000 CR3: 0000000141f6b002 CR4: 0000000000771ef0 [ 31.985422] PKRU: 55555554 [ 31.985423] Call Trace: [ 31.985424] <TASK> [ 31.985426] __map_range+0x399/0x5a0 [ 31.985429] ? down_trylock+0x20/0x30 [ 31.985434] __map_range+0x1af/0x5a0 [ 31.985436] ? _printk+0x52/0x70 [ 31.985441] pt_iommu_amdv1_map_pages+0x6e6/0xca0 [ 31.985444] ? srso_alias_return_thunk+0x5/0xfbef5 [ 31.985448] ? iommu_map_nosync+0x129/0x230 [ 31.985451] iommu_map_nosync+0x129/0x230 [ 31.985454] blk_rq_dma_map_iter_start+0x186/0x1c0 [ 31.985458] nvme_prep_rq+0x4ff/0x8b0 [ 31.985461] ? srso_alias_return_thunk+0x5/0xfbef5 [ 31.985463] nvme_queue_rqs+0xc0/0x1d0 [ 31.985466] blk_mq_dispatch_queue_requests+0xf2/0x140 [ 31.985469] blk_mq_flush_plug_list+0x71/0x170 [ 31.985472] __blk_flush_plug+0xcc/0x120 [ 31.985476] blk_finish_plug+0x1f/0x30 [ 31.985478] read_pages+0x1a8/0x260 [ 31.985483] ? filemap_add_folio+0xae/0xd0 [ 31.985485] page_cache_ra_unbounded+0x174/0x230 [ 31.985488] force_page_cache_ra+0x89/0xb0 [ 31.985491] filemap_get_pages+0x12a/0x720 [ 31.985494] filemap_read+0xda/0x3e0 [ 31.985497] ? srso_alias_return_thunk+0x5/0xfbef5 [ 31.985499] ? alloc_pages_mpol+0x76/0x140 [ 31.985502] ? srso_alias_return_thunk+0x5/0xfbef5 [ 31.985504] ? mod_memcg_lruvec_state+0x96/0x1a0 [ 31.985507] ? srso_alias_return_thunk+0x5/0xfbef5 [ 31.985509] ? __lruvec_stat_mod_folio+0x6d/0xa0 [ 31.985511] ? srso_alias_return_thunk+0x5/0xfbef5 [ 31.985512] ? srso_alias_return_thunk+0x5/0xfbef5 [ 31.985514] ? set_ptes.constprop.0+0x36/0x80 [ 31.985517] ? srso_alias_return_thunk+0x5/0xfbef5 [ 31.985519] ? __handle_mm_fault+0xa2c/0x14d0 [ 31.985522] blkdev_read_iter+0x6f/0x140 [ 31.985525] vfs_read+0x207/0x330 [ 31.985528] ksys_read+0x5c/0xd0 [ 31.985530] do_syscall_64+0x50/0x1e0 [ 31.985533] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 31.985535] RIP: 0033:0x7f46c8576852 [ 31.985537] Code: c0 e9 b2 fe ff ff 50 48 8d 3d 1a b4 0c 00 e8 a5 1d 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 [ 31.985538] RSP: 002b:00007ffc06f9c638 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 31.985540] RAX: ffffffffffffffda RBX: 00007f46c7e02028 RCX: 00007f46c8576852 [ 31.985541] RDX: 0000000000040000 RSI: 00007f46c7e02038 RDI: 000000000000000c [ 31.985542] RBP: 0000555f80925280 R08: 00007f46c7e02010 R09: 00007f46c7e02010 [ 31.985543] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000 [ 31.985544] R13: 0000000000040000 R14: 00007f46c7e02010 R15: 0000555f809252d0 [ 31.985546] </TASK> [ 31.985547] ---[ end trace 0000000000000000 ]---
-Vasant