From: Baolu Lu baolu.lu@linux.intel.com Sent: Friday, August 15, 2025 5:17 PM
On 8/8/2025 10:57 AM, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, August 8, 2025 3:52 AM
On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
+static void kernel_pte_work_func(struct work_struct *work) +{
- struct page *page, *next;
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
- guard(spinlock)(&kernel_pte_work.lock);
- list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
list_del_init(&page->lru);
Please don't add new usages of lru, we are trying to get rid of this. :(
I think the memory should be struct ptdesc, use that..
btw with this change we should also defer free of the pmd page:
pud_free_pmd_page() ... for (i = 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); pte_free_kernel(&init_mm, pte); } }
free_page((unsigned long)pmd_sv);
Otherwise the risk still exists if the pmd page is repurposed before the pte work is scheduled.
You're right that freeing high-level page table pages also requires an IOTLB flush before the pages are freed. But I question the practical risk of the race given the extremely small time window. If this is a
It's already extremely difficult to conduct a real attack even w/o this fix. I'm not sure the criteria how small we consider acceptable in this specific case. but leaving an incomplete fix in code doesn't sound clean...
real concern, a potential mitigation would be to clear the U/S bits in all page table entries for kernel address space? But I am not confident in making that change at this time as I am unsure of the side effects it might cause.
I think there was already consensus that clearing U/S bits in all entries doesn't prevent the IOMMU caching them and setting A/D bits on the freed pagetable.
another observation - pte_free_kernel is not used in remove_pagetable () and __change_page_attr(). Is it straightforward to put it in those paths or do we need duplicate some deferring logic there?
The remove_pagetable() function is called in the path where memory is hot-removed from the system, right? If so, there should be no issue, as the threat model here is a page table page being freed and repurposed while it's still cached in the IOTLB. In the hot-remove case, the memory is removed and will not be reused, so that's fine as far as I can see.
what about the page is hot-added back while the stale entry pointing to it is still valid in the IOMMU, theoretically? 😊
The same to __change_page_attr(), which only changes the attributes of a page table entry while the underlying page remains in use.
it may lead to cpa_collapse_large_pages() if changing attribute leads to all adjacent 4k pages in 2M range are with same attribute. Then page table might be freed:
cpa_collapse_large_pages(): list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { list_del(&ptdesc->pt_list); __free_page(ptdesc_page(ptdesc)); }