On Wed, May 28, 2025 at 10:26:04PM +0200, David Hildenbrand wrote:
Digging a bit:
commit 56c9cfb13c9b6516017eea4e8cbe22ea02e07ee6 Author: Naoya Horiguchi nao.horiguchi@gmail.com Date: Fri Sep 10 13:23:04 2010 +0900
hugetlb, rmap: fix confusing page locking in hugetlb_cow() The "if (!trylock_page)" block in the avoidcopy path of hugetlb_cow() looks confusing and is buggy. Originally this trylock_page() was intended to make sure that old_page is locked even when old_page != pagecache_page, because then only pagecache_page is locked.
Added the comment
/*
* hugetlb_cow() requires page locks of pte_page(entry) and
* pagecache_page, so here we need take the former one
* when page != pagecache_page or !pagecache_page.
* Note that locking order is always pagecache_page -> page,
* so no worry about deadlock.
*/
And
commit 0fe6e20b9c4c53b3e97096ee73a0857f60aad43f Author: Naoya Horiguchi nao.horiguchi@gmail.com Date: Fri May 28 09:29:16 2010 +0900
hugetlb, rmap: add reverse mapping for hugepage This patch adds reverse mapping feature for hugepage by introducing mapcount for shared/private-mapped hugepage and anon_vma for private-mapped hugepage. While hugepage is not currently swappable, reverse mapping can be useful for memory error handler. Without this patch, memory error handler cannot identify processes using the bad hugepage nor unmap it from them. That is: - for shared hugepage: we can collect processes using a hugepage through pagecache, but can not unmap the hugepage because of the lack of mapcount. - for privately mapped hugepage: we can neither collect processes nor unmap the hugepage. This patch solves these problems. This patch include the bug fix given by commit 23be7468e8, so reverts it.
Added the real locking magic.
Yes, I have been checking "hugetlb, rmap: add reverse mapping for hugepage", which added locking the now-so-called 'old_folio' in case hugetlbfs_pagecache_page() didn't return anything.
Because in hugetlb_wp, this was added:
@@ -2286,8 +2299,11 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, retry_avoidcopy: /* If no-one else is actually using this page, avoid the copy * and just make the page writable */ - avoidcopy = (page_count(old_page) == 1); + avoidcopy = (page_mapcount(old_page) == 1); if (avoidcopy) { + if (!trylock_page(old_page)) + if (PageAnon(old_page)) + page_move_anon_rmap(old_page, vma, address);
So, as you mentioned, it was done to keep the rmap stable as I guess rmap test test the PageLock.
Not that much changed regarding locking until COW support was added in
commit 1e8f889b10d8d2223105719e36ce45688fedbd59 Author: David Gibson david@gibson.dropbear.id.au Date: Fri Jan 6 00:10:44 2006 -0800
[PATCH] Hugetlb: Copy on Write support Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs.
Confusing.
Locking the *old_folio* when calling hugetlb_wp() makes sense when it is an anon folio because we might want to call folio_move_anon_rmap() to adjust the rmap root.
Yes, this is clear.
Locking the pagecache folio when calling hugetlb_wp() if old_folio is an anon folio ... does not make sense to me.
I think this one is also clear.
Locking the pagecache folio when calling hugetlb_wp if old_folio is a pageache folio ... also doesn't quite make sense for me. Again, we don't take the lock for ordinary pages, so what's special about hugetlb for the last case (reservations, I assume?).
So, this case is when pagecache_folio == old_folio.
I guess we are talking about resv_maps? But I think we cannot interfere there. For the reserves to be modified the page has to go away.
Now, I have been checking this one too:
commit 04f2cbe35699d22dbf428373682ead85ca1240f5 Author: Mel Gorman mel@csn.ul.ie Date: Wed Jul 23 21:27:25 2008 -0700
hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
And I think it is interesting. That one added this chunk in hugetlb_fault():
@@ -1126,8 +1283,15 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, spin_lock(&mm->page_table_lock); /* Check for a racing update before calling hugetlb_cow */ if (likely(pte_same(entry, huge_ptep_get(ptep)))) - if (write_access && !pte_write(entry)) - ret = hugetlb_cow(mm, vma, address, ptep, entry); + if (write_access && !pte_write(entry)) { + struct page *page; + page = hugetlbfs_pagecache_page(vma, address); + ret = hugetlb_cow(mm, vma, address, ptep, entry, page); + if (page) { + unlock_page(page); + put_page(page); + } + }
So, it finds and lock the page in the pagecache, and calls hugetlb_cow.
hugetlb_fault() takes hugetlb_instantiation_mutex, and there is a comment saying:
/* * Serialize hugepage allocation and instantiation, so that we don't * get spurious allocation failures if two CPUs race to instantiate * the same page in the page cache. */ mutex_lock(&hugetlb_instantiation_mutex);
But it does not say anything about truncation. Actually, checking the truncation code from back then, truncate_hugepages() (and none of its callers) take the hugetlb_instantiation_mutex, as it is done today (e.g: current remove_inode_hugepages() code).
Back then, truncate_hugepages() relied only in lock_page():
static void truncate_hugepages(struct inode *inode, loff_t lstart) { ... ... lock_page(page); truncate_huge_page(page); unlock_page(page); }
While today, remove_inode_hugepages() takes the mutex, and also the lock. And then zaps the page and does its thing with resv_maps.
So I think that we should not even need the lock for hugetlb_wp when pagecache_folio == old_folio (pagecache), because the mutex already protects us from the page to go away, right (e.g: truncated)? Besides we hold a reference on that page since filemap_lock_hugetlb_folio() locks the page and increases its refcount.
All in all, I am leaning towards not being needed, but it's getting late here..