On 11/10/22 12:59, Nadav Amit wrote:
On Nov 7, 2022, at 5:19 PM, Mike Kravetz mike.kravetz@oracle.com wrote:
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page tables associated with the address range. For hugetlb vmas, zap_page_range will call __unmap_hugepage_range_final. However, __unmap_hugepage_range_final assumes the passed vma is about to be removed and deletes the vma_lock to prevent pmd sharing as the vma is on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, but the missing vma_lock prevents pmd sharing and could potentially lead to issues with truncation/fault races.
I understand the problem in general. Please consider my feedback as partial though.
Thanks for taking a look!
@@ -5203,32 +5194,50 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb, unsigned long end, struct page *ref_page, zap_flags_t zap_flags) {
- bool final = zap_flags & ZAP_FLAG_UNMAP;
Not sure why caching final in local variable helps.
No particular reason. It can be eliminated.
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page, zap_flags_t zap_flags) {
struct mmu_notifier_range range; struct mmu_gather tlb;
mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
start, end);
adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end); tlb_gather_mmu(&tlb, vma->vm_mm);
__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
Is there a reason for not using range.start and range.end?
After calling adjust_range_if_pmd_sharing_possible, range.start - range.end could be much greater than the range we actually want to unmap. The range gets adjusted to account for pmd sharing if that is POSSIBLE. It does not know for sure if we will actually 'unshare a pmd'.
I suppose adjust_range_if_pmd_sharing_possible could be modified to actually check if unmapping will result in unsharing, but it does not do that today.
It is just that every inconsistency is worrying…
@@ -1734,6 +1734,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr lru_add_drain(); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, address, address + size);
- if (is_vm_hugetlb_page(vma))
adjust_range_if_pmd_sharing_possible(vma, &range.start,
tlb_gather_mmu(&tlb, vma->vm_mm); update_hiwater_rss(vma->vm_mm); mmu_notifier_invalidate_range_start(&range);&range.end);
@@ -1742,6 +1745,12 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr tlb_finish_mmu(&tlb); }
+void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size)
+{
- __zap_page_range_single(vma, address, size, NULL);
Ugh. So zap_vma_range() would actually be emitted as a wrapper function that only calls __zap_page_range_single() (or worse __zap_page_range_single() which is large would be inlined), unless you use LTO.
Another option is to declare __zap_page_range_size() in the header and move this one to the header as inline wrapper.
Ok, I will keep that in mind.
However, Peter seems to prefer just exposing the current zap_page_range_single. The issue with exposing zap_page_range_single as it is today, is that we also need to expose 'struct zap_details' which currently is not visible outside mm/memory.c.