On 8/6/25 09:09, Jason Gunthorpe wrote:
You can't do this approach without also pushing the pages to freed on a list and defering the free till the work. This is broadly what the normal mm user flow is doing..
FWIW, I think the simplest way to do this is to plop an unconditional schedule_work() in pte_free_kernel(). The work function will invalidate the IOTLBs and then free the page.
Keep the schedule_work() unconditional to keep it simple. The schedule_work() is way cheaper than all the system-wide TLB invalidation IPIs that have to get sent as well. No need to add complexity to optimize out something that's in the noise already.
That works also, but now you have to allocate memory or you are dead.. Is it OK these days, and safe in this code which seems a little bit linked to memory management?
The MM side avoided this by putting the list and the rcu_head in the struct page.
I don't think you need to allocate memory. A little static structure that uses the page->list and has a lock should do. Logically something like this:
struct kernel_pgtable_work { struct list_head list; spinlock_t lock; struct work_struct work; } kernel_pte_work;
pte_free_kernel() { struct page *page = ptdesc_magic();
guard(spinlock)(&kernel_pte_work.lock); list_add(&page->list, &kernel_pte_work.list); schedule_work(&kernel_pte_work.work); }
work_func() { iommu_sva_invalidate_kva();
guard(spinlock)(&kernel_pte_work.lock);
list_for_each_safe() { page = container_of(...); free_whatever(page); } }
The only wrinkle is that pte_free_kernel() itself still has a pte and 'ptdesc', not a 'struct page'. But there is ptdesc->pt_list, which should be unused at this point, especially for non-pgd pages on x86.
So, either go over to the 'struct page' earlier (maybe by open-coding pagetable_dtor_free()?), or just use the ptdesc.