There is ABBA dead locking scenario happening between hugetlb_fault() and hugetlb_wp() on the pagecache folio's lock and hugetlb global mutex, which is reproducible with syzkaller [1]. As below stack traces reveal, process-1 tries to take the hugetlb global mutex (A3), but with the pagecache folio's lock hold. Process-2 took the hugetlb global mutex but tries to take the pagecache folio's lock.
Process-1 Process-2 ========= ========= hugetlb_fault mutex_lock (A1) filemap_lock_hugetlb_folio (B1) hugetlb_wp alloc_hugetlb_folio #error mutex_unlock (A2) hugetlb_fault mutex_lock (A4) filemap_lock_hugetlb_folio (B4) unmap_ref_private mutex_lock (A3)
Fix it by releasing the pagecache folio's lock at (A2) of process-1 so that pagecache folio's lock is available to process-2 at (B4), to avoid the deadlock. In process-1, a new variable is added to track if the pagecache folio's lock has been released by its child function hugetlb_wp() to avoid double releases on the lock in hugetlb_fault(). The similar changes are applied to hugetlb_no_page().
Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=d... [1] Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") Cc: stable@vger.kernel.org Cc: Hugh Dickins hughd@google.com Cc: Florent Revest revest@google.com Reviewed-by: Gavin Shan gshan@redhat.com Signed-off-by: Gavin Guo gavinguo@igalia.com --- V1 -> V2 Suggested-by Oscar Salvador: - Use folio_test_locked to replace the unnecessary parameter passing. V2 -> V3 - Dropped the approach suggested by Oscar. - Refine the code and git commit suggested by Gavin Shan.
mm/hugetlb.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6a3cf7935c14..560b9b35262a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6137,7 +6137,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, * Keep the pte_same checks anyway to make transition from the mutex easier. */ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, - struct vm_fault *vmf) + struct vm_fault *vmf, + bool *pagecache_folio_locked) { struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; @@ -6234,6 +6235,18 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, u32 hash;
folio_put(old_folio); + /* + * The pagecache_folio has to be unlocked to avoid + * deadlock and we won't re-lock it in hugetlb_wp(). The + * pagecache_folio could be truncated after being + * unlocked. So its state should not be reliable + * subsequently. + */ + if (pagecache_folio) { + folio_unlock(pagecache_folio); + if (pagecache_folio_locked) + *pagecache_folio_locked = false; + } /* * Drop hugetlb_fault_mutex and vma_lock before * unmapping. unmapping needs to hold vma_lock @@ -6588,7 +6601,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, hugetlb_count_add(pages_per_huge_page(h), mm); if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ - ret = hugetlb_wp(folio, vmf); + ret = hugetlb_wp(folio, vmf, NULL); }
spin_unlock(vmf->ptl); @@ -6660,6 +6673,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, struct hstate *h = hstate_vma(vma); struct address_space *mapping; int need_wait_lock = 0; + bool pagecache_folio_locked = true; struct vm_fault vmf = { .vma = vma, .address = address & huge_page_mask(h), @@ -6814,7 +6828,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { if (!huge_pte_write(vmf.orig_pte)) { - ret = hugetlb_wp(pagecache_folio, &vmf); + ret = hugetlb_wp(pagecache_folio, &vmf, + &pagecache_folio_locked); goto out_put_page; } else if (likely(flags & FAULT_FLAG_WRITE)) { vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte); @@ -6832,7 +6847,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, spin_unlock(vmf.ptl);
if (pagecache_folio) { - folio_unlock(pagecache_folio); + if (pagecache_folio_locked) + folio_unlock(pagecache_folio); + folio_put(pagecache_folio); } out_mutex:
base-commit: 914873bc7df913db988284876c16257e6ab772c6
On Wed, May 28, 2025 at 10:33:26AM +0800, Gavin Guo wrote:
There is ABBA dead locking scenario happening between hugetlb_fault() and hugetlb_wp() on the pagecache folio's lock and hugetlb global mutex, which is reproducible with syzkaller [1]. As below stack traces reveal, process-1 tries to take the hugetlb global mutex (A3), but with the pagecache folio's lock hold. Process-2 took the hugetlb global mutex but tries to take the pagecache folio's lock.
Process-1 Process-2 ========= ========= hugetlb_fault mutex_lock (A1) filemap_lock_hugetlb_folio (B1) hugetlb_wp alloc_hugetlb_folio #error mutex_unlock (A2) hugetlb_fault mutex_lock (A4) filemap_lock_hugetlb_folio (B4) unmap_ref_private mutex_lock (A3)
Fix it by releasing the pagecache folio's lock at (A2) of process-1 so that pagecache folio's lock is available to process-2 at (B4), to avoid the deadlock. In process-1, a new variable is added to track if the pagecache folio's lock has been released by its child function hugetlb_wp() to avoid double releases on the lock in hugetlb_fault(). The similar changes are applied to hugetlb_no_page().
Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=d... [1] Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") Cc: stable@vger.kernel.org Cc: Hugh Dickins hughd@google.com Cc: Florent Revest revest@google.com Reviewed-by: Gavin Shan gshan@redhat.com Signed-off-by: Gavin Guo gavinguo@igalia.com
...
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6a3cf7935c14..560b9b35262a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6137,7 +6137,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
- Keep the pte_same checks anyway to make transition from the mutex easier.
*/ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
struct vm_fault *vmf)
struct vm_fault *vmf,
bool *pagecache_folio_locked)
{ struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; @@ -6234,6 +6235,18 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, u32 hash; folio_put(old_folio);
/*
* The pagecache_folio has to be unlocked to avoid
* deadlock and we won't re-lock it in hugetlb_wp(). The
* pagecache_folio could be truncated after being
* unlocked. So its state should not be reliable
* subsequently.
*/
if (pagecache_folio) {
folio_unlock(pagecache_folio);
if (pagecache_folio_locked)
*pagecache_folio_locked = false;
}
I am having a problem with this patch as I think it keeps carrying on an assumption that it is not true.
I was discussing this matter yesterday with Peter Xu (CCed now), who has also some experience in this field.
Exactly against what pagecache_folio's lock protects us when pagecache_folio != old_folio?
There are two cases here:
1) pagecache_folio = old_folio (original page in the pagecache) 2) pagecache_folio != old_folio (original page has already been mapped privately and CoWed, old_folio contains the new folio)
For case 1), we need to hold the lock because we are copying old_folio to the new one in hugetlb_wp(). That is clear.
But for case 2), unless I am missing something, we do not really need the pagecache_folio's lock at all, do we? (only old_folio's one) The only reason pagecache_folio gets looked up in the pagecache is to check whether the current task has mapped and faulted in the file privately, which means that a reservation has been consumed (a new folio was allocated). That is what the whole dance about "old_folio != pagecache_folio && HPAGE_RESV_OWNER" in hugetlb_wp() is about.
And the original mapping cannot really go away either from under us, as remove_inode_hugepages() needs to take the mutex in order to evict it, which would be the only reason counters like resv_huge_pages (adjusted in remove_inode_hugepages()->hugetlb_unreserve_pages()) would interfere with alloc_hugetlb_folio() from hugetlb_wp().
So, again, unless I am missing something there is no need for the pagecache_folio lock when pagecache_folio != old_folio, let alone the need to hold it throughout hugetlb_wp(). I think we could just look up the cache, and unlock it right away.
So, the current situation (previous to this patch) is already misleading for case 2).
And comments like:
/* * The pagecache_folio has to be unlocked to avoid * deadlock and we won't re-lock it in hugetlb_wp(). The * pagecache_folio could be truncated after being * unlocked. So its state should not be reliable * subsequently. */
Keep carrying on the assumption that we need the lock.
Now, if the above is true, I would much rather see this reworked (I have some ideas I discussed with Peter yesterday), than keep it as is.
Let me also CC David who tends to have a good overview in this.
On Wed, May 28, 2025 at 11:27:46AM +0200, Oscar Salvador wrote:
On Wed, May 28, 2025 at 10:33:26AM +0800, Gavin Guo wrote:
There is ABBA dead locking scenario happening between hugetlb_fault() and hugetlb_wp() on the pagecache folio's lock and hugetlb global mutex, which is reproducible with syzkaller [1]. As below stack traces reveal, process-1 tries to take the hugetlb global mutex (A3), but with the pagecache folio's lock hold. Process-2 took the hugetlb global mutex but tries to take the pagecache folio's lock.
Process-1 Process-2 ========= ========= hugetlb_fault mutex_lock (A1) filemap_lock_hugetlb_folio (B1) hugetlb_wp alloc_hugetlb_folio #error mutex_unlock (A2) hugetlb_fault mutex_lock (A4) filemap_lock_hugetlb_folio (B4) unmap_ref_private mutex_lock (A3)
Fix it by releasing the pagecache folio's lock at (A2) of process-1 so that pagecache folio's lock is available to process-2 at (B4), to avoid the deadlock. In process-1, a new variable is added to track if the pagecache folio's lock has been released by its child function hugetlb_wp() to avoid double releases on the lock in hugetlb_fault(). The similar changes are applied to hugetlb_no_page().
Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=d... [1] Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") Cc: stable@vger.kernel.org Cc: Hugh Dickins hughd@google.com Cc: Florent Revest revest@google.com Reviewed-by: Gavin Shan gshan@redhat.com Signed-off-by: Gavin Guo gavinguo@igalia.com
...
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6a3cf7935c14..560b9b35262a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6137,7 +6137,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
- Keep the pte_same checks anyway to make transition from the mutex easier.
*/ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
struct vm_fault *vmf)
struct vm_fault *vmf,
bool *pagecache_folio_locked)
{ struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; @@ -6234,6 +6235,18 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, u32 hash; folio_put(old_folio);
/*
* The pagecache_folio has to be unlocked to avoid
* deadlock and we won't re-lock it in hugetlb_wp(). The
* pagecache_folio could be truncated after being
* unlocked. So its state should not be reliable
* subsequently.
*/
if (pagecache_folio) {
folio_unlock(pagecache_folio);
if (pagecache_folio_locked)
*pagecache_folio_locked = false;
}
I am having a problem with this patch as I think it keeps carrying on an assumption that it is not true.
I was discussing this matter yesterday with Peter Xu (CCed now), who has also some experience in this field.
Exactly against what pagecache_folio's lock protects us when pagecache_folio != old_folio?
There are two cases here:
- pagecache_folio = old_folio (original page in the pagecache)
- pagecache_folio != old_folio (original page has already been mapped privately and CoWed, old_folio contains the new folio)
For case 1), we need to hold the lock because we are copying old_folio to the new one in hugetlb_wp(). That is clear.
So I'm not 100% sure we need the folio lock even for copy; IIUC a refcount would be enough?
But for case 2), unless I am missing something, we do not really need the pagecache_folio's lock at all, do we? (only old_folio's one) The only reason pagecache_folio gets looked up in the pagecache is to check whether the current task has mapped and faulted in the file privately, which means that a reservation has been consumed (a new folio was allocated). That is what the whole dance about "old_folio != pagecache_folio && HPAGE_RESV_OWNER" in hugetlb_wp() is about.
And the original mapping cannot really go away either from under us, as remove_inode_hugepages() needs to take the mutex in order to evict it, which would be the only reason counters like resv_huge_pages (adjusted in remove_inode_hugepages()->hugetlb_unreserve_pages()) would interfere with alloc_hugetlb_folio() from hugetlb_wp().
So, again, unless I am missing something there is no need for the pagecache_folio lock when pagecache_folio != old_folio, let alone the need to hold it throughout hugetlb_wp(). I think we could just look up the cache, and unlock it right away.
So, the current situation (previous to this patch) is already misleading for case 2).
And comments like:
/*
- The pagecache_folio has to be unlocked to avoid
- deadlock and we won't re-lock it in hugetlb_wp(). The
- pagecache_folio could be truncated after being
- unlocked. So its state should not be reliable
- subsequently.
*/
Keep carrying on the assumption that we need the lock.
Now, if the above is true, I would much rather see this reworked (I have some ideas I discussed with Peter yesterday), than keep it as is.
Yes just to reply in public I also am not aware of why the folio lock is needed considering hugetlb has the fault mutex. I'm not sure if we should rely more on the fault mutex, but that doesn't sound like an immediate concern.
It may depend on whether my above understand was correct.. and only if so, maybe we could avoid locking the folio completely.
Thanks,
Let me also CC David who tends to have a good overview in this.
-- Oscar Salvador SUSE Labs
On 28.05.25 17:03, Peter Xu wrote:
On Wed, May 28, 2025 at 11:27:46AM +0200, Oscar Salvador wrote:
On Wed, May 28, 2025 at 10:33:26AM +0800, Gavin Guo wrote:
There is ABBA dead locking scenario happening between hugetlb_fault() and hugetlb_wp() on the pagecache folio's lock and hugetlb global mutex, which is reproducible with syzkaller [1]. As below stack traces reveal, process-1 tries to take the hugetlb global mutex (A3), but with the pagecache folio's lock hold. Process-2 took the hugetlb global mutex but tries to take the pagecache folio's lock.
Process-1 Process-2 ========= ========= hugetlb_fault mutex_lock (A1) filemap_lock_hugetlb_folio (B1) hugetlb_wp alloc_hugetlb_folio #error mutex_unlock (A2) hugetlb_fault mutex_lock (A4) filemap_lock_hugetlb_folio (B4) unmap_ref_private mutex_lock (A3)
Fix it by releasing the pagecache folio's lock at (A2) of process-1 so that pagecache folio's lock is available to process-2 at (B4), to avoid the deadlock. In process-1, a new variable is added to track if the pagecache folio's lock has been released by its child function hugetlb_wp() to avoid double releases on the lock in hugetlb_fault(). The similar changes are applied to hugetlb_no_page().
Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=d... [1] Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") Cc: stable@vger.kernel.org Cc: Hugh Dickins hughd@google.com Cc: Florent Revest revest@google.com Reviewed-by: Gavin Shan gshan@redhat.com Signed-off-by: Gavin Guo gavinguo@igalia.com
...
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6a3cf7935c14..560b9b35262a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6137,7 +6137,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
- Keep the pte_same checks anyway to make transition from the mutex easier.
*/ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
struct vm_fault *vmf)
struct vm_fault *vmf,
{ struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm;bool *pagecache_folio_locked)
@@ -6234,6 +6235,18 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, u32 hash; folio_put(old_folio);
/*
* The pagecache_folio has to be unlocked to avoid
* deadlock and we won't re-lock it in hugetlb_wp(). The
* pagecache_folio could be truncated after being
* unlocked. So its state should not be reliable
* subsequently.
*/
if (pagecache_folio) {
folio_unlock(pagecache_folio);
if (pagecache_folio_locked)
*pagecache_folio_locked = false;
}
I am having a problem with this patch as I think it keeps carrying on an assumption that it is not true.
I was discussing this matter yesterday with Peter Xu (CCed now), who has also some experience in this field.
Exactly against what pagecache_folio's lock protects us when pagecache_folio != old_folio?
There are two cases here:
- pagecache_folio = old_folio (original page in the pagecache)
- pagecache_folio != old_folio (original page has already been mapped privately and CoWed, old_folio contains the new folio)
For case 1), we need to hold the lock because we are copying old_folio to the new one in hugetlb_wp(). That is clear.
So I'm not 100% sure we need the folio lock even for copy; IIUC a refcount would be enough?
The introducing patches seem to talk about blocking concurrent migration / rmap walks.
Maybe also concurrent fallocate(PUNCH_HOLE) is a problem regarding reservations? Not sure ...
For 2) I am also not sure if we need need the pagecache folio locked; I doubt it ... but this code is not the easiest to follow.
On Wed, May 28, 2025 at 05:09:26PM +0200, David Hildenbrand wrote:
On 28.05.25 17:03, Peter Xu wrote:
So I'm not 100% sure we need the folio lock even for copy; IIUC a refcount would be enough?
The introducing patches seem to talk about blocking concurrent migration / rmap walks.
I thought the main reason was because PageLock protects us against writes, so when copying (in case of copying the underlying file), we want the file to be stable throughout the copy?
Maybe also concurrent fallocate(PUNCH_HOLE) is a problem regarding reservations? Not sure ...
fallocate()->hugetlb_vmdelete_list() tries to grab the vma in write-mode, and hugetlb_wp() grabs the lock in read-mode, so we should be covered?
Also, hugetlbfs_punch_hole()->remove_inode_hugepages() will try to grab the mutex.
The only fishy thing I see is hugetlbfs_zero_partial_page().
But that is for old_page, and as I said, I thought main reason was to protect us against writes during the copy.
For 2) I am also not sure if we need need the pagecache folio locked; I doubt it ... but this code is not the easiest to follow.
I have been staring at that code and thinking about potential scenarios for a few days now, and I cannot convice myself that we need pagecache_folio's lock when pagecache_folio != old_folio because as a matter of fact I cannot think of anything it protects us against.
I plan to rework this in a more sane way, or at least less offusctaed, and then Galvin can fire his syzkaller to check whether we are good.
On Wed, May 28, 2025 at 11:45 AM Oscar Salvador osalvador@suse.de wrote:
On Wed, May 28, 2025 at 05:09:26PM +0200, David Hildenbrand wrote:
On 28.05.25 17:03, Peter Xu wrote:
So I'm not 100% sure we need the folio lock even for copy; IIUC a refcount would be enough?
The introducing patches seem to talk about blocking concurrent migration / rmap walks.
I thought the main reason was because PageLock protects us against writes, so when copying (in case of copying the underlying file), we want the file to be stable throughout the copy?
Maybe also concurrent fallocate(PUNCH_HOLE) is a problem regarding reservations? Not sure ...
fallocate()->hugetlb_vmdelete_list() tries to grab the vma in write-mode, and hugetlb_wp() grabs the lock in read-mode, so we should be covered?
Also, hugetlbfs_punch_hole()->remove_inode_hugepages() will try to grab the mutex.
The only fishy thing I see is hugetlbfs_zero_partial_page().
But that is for old_page, and as I said, I thought main reason was to protect us against writes during the copy.
For 2) I am also not sure if we need need the pagecache folio locked; I doubt it ... but this code is not the easiest to follow.
I have been staring at that code and thinking about potential scenarios for a few days now, and I cannot convice myself that we need pagecache_folio's lock when pagecache_folio != old_folio because as a matter of fact I cannot think of anything it protects us against.
Hi Oscar,
Have you thought about the UFFDIO_CONTINUE case (hugetlb_mfill_atomic_pte())?
I'm slightly concerned that, if you aren't holding pagecache_folio's lock, there might be issues where hugetlb_mfill_atomic_pte() proceeds to map a hugetlb page that it is not supposed to. (For example, if the fault handler does not generally hold pagecache_folio's lock, hugetlb_mfill_atomic_pte() will see a page in the pagecache and map it, even though it may not have been zeroed yet.)
I haven't had enough time to fully think through this case, but just want to make sure it has been considered.
Thanks!
I plan to rework this in a more sane way, or at least less offusctaed, and then Galvin can fire his syzkaller to check whether we are good.
-- Oscar Salvador SUSE Labs
On Wed, May 28, 2025 at 12:14:28PM -0400, James Houghton wrote:
[...]
For 2) I am also not sure if we need need the pagecache folio locked; I doubt it ... but this code is not the easiest to follow.
I have been staring at that code and thinking about potential scenarios for a few days now, and I cannot convice myself that we need pagecache_folio's lock when pagecache_folio != old_folio because as a matter of fact I cannot think of anything it protects us against.
Hi Oscar,
Hey, James,
Have you thought about the UFFDIO_CONTINUE case (hugetlb_mfill_atomic_pte())?
I'm slightly concerned that, if you aren't holding pagecache_folio's lock, there might be issues where hugetlb_mfill_atomic_pte() proceeds to map a hugetlb page that it is not supposed to. (For example, if the fault handler does not generally hold pagecache_folio's lock, hugetlb_mfill_atomic_pte() will see a page in the pagecache and map it, even though it may not have been zeroed yet.)
I haven't had enough time to fully think through this case, but just want to make sure it has been considered.
AFAIU we're talking about two separate code paths. IIUC you're talking about a fresh new hugetlb folio being allocated, but then that's what hugetlb_no_page() does. Folio lock required there.
Here IIUC Oscar's context is only in hugetlb_wp() where there's a niche use case to compare whether a VM_PRIVATE has already CoWed once from a pagecache, and whether we need the folio lock for the pagecache lookup. Aka, this one:
if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && !(vma->vm_flags & VM_MAYSHARE) && !huge_pte_write(vmf.orig_pte)) { if (vma_needs_reservation(h, vma, vmf.address) < 0) { ret = VM_FAULT_OOM; goto out_mutex; } /* Just decrements count, does not deallocate */ vma_end_reservation(h, vma, vmf.address);
pagecache_folio = filemap_lock_hugetlb_folio(h, mapping, <--- vmf.pgoff); if (IS_ERR(pagecache_folio)) pagecache_folio = NULL; }
Thanks,
On Wed, May 28, 2025 at 05:45:42PM +0200, Oscar Salvador wrote:
I thought the main reason was because PageLock protects us against writes, so when copying (in case of copying the underlying file), we want the file to be stable throughout the copy?
The folio can already been mapped writable in other VM_SHARED vmas.. which means the userspace is free to write whatever while kernel copying, right?
IIUC there's no way to make sure the folio content is stable as long as it can be mapped, CoW should just happen and the result of the copied page is unpredictable if there're concurrent writes.
IMHO it's the userspace's job if it wants to make sure the folio (when triggering CoW) copies a stable piece of content.
That's also why I was thinking maybe we don't need the folio lock at all. We still will need a refcount though for the pagecache to make sure it wont' get freed concurrently.
Thanks,
On 28.05.25 17:45, Oscar Salvador wrote:
On Wed, May 28, 2025 at 05:09:26PM +0200, David Hildenbrand wrote:
On 28.05.25 17:03, Peter Xu wrote:
So I'm not 100% sure we need the folio lock even for copy; IIUC a refcount would be enough?
The introducing patches seem to talk about blocking concurrent migration / rmap walks.
I thought the main reason was because PageLock protects us against writes, so when copying (in case of copying the underlying file), we want the file to be stable throughout the copy?
Well, we don't do the same for ordinary pages, why should we do for hugetlb?
See wp_page_copy().
If you have a MAP_PRIVATE mapping of a file and modify the pagecache pages concurrently (write to another MAP_SHARED mapping, write() ...), there are no guarantees about one observing any specific page state.
At least not that I am aware of ;)
Maybe also concurrent fallocate(PUNCH_HOLE) is a problem regarding reservations? Not sure ...
fallocate()->hugetlb_vmdelete_list() tries to grab the vma in write-mode, and hugetlb_wp() grabs the lock in read-mode, so we should be covered?
Yeah, maybe that's the case nowadays. Maybe it wasn't in the past ...
Also, hugetlbfs_punch_hole()->remove_inode_hugepages() will try to grab the mutex.
The only fishy thing I see is hugetlbfs_zero_partial_page().
But that is for old_page, and as I said, I thought main reason was to protect us against writes during the copy.
See above, I really wouldn't understand why that is required.
For 2) I am also not sure if we need need the pagecache folio locked; I doubt it ... but this code is not the easiest to follow.
I have been staring at that code and thinking about potential scenarios for a few days now, and I cannot convice myself that we need pagecache_folio's lock when pagecache_folio != old_folio because as a matter of fact I cannot think of anything it protects us against.
I plan to rework this in a more sane way, or at least less offusctaed, and then Galvin can fire his syzkaller to check whether we are good.
On 28.05.25 22:00, David Hildenbrand wrote:
On 28.05.25 17:45, Oscar Salvador wrote:
On Wed, May 28, 2025 at 05:09:26PM +0200, David Hildenbrand wrote:
On 28.05.25 17:03, Peter Xu wrote:
So I'm not 100% sure we need the folio lock even for copy; IIUC a refcount would be enough?
The introducing patches seem to talk about blocking concurrent migration / rmap walks.
I thought the main reason was because PageLock protects us against writes, so when copying (in case of copying the underlying file), we want the file to be stable throughout the copy?
Well, we don't do the same for ordinary pages, why should we do for hugetlb?
See wp_page_copy().
If you have a MAP_PRIVATE mapping of a file and modify the pagecache pages concurrently (write to another MAP_SHARED mapping, write() ...), there are no guarantees about one observing any specific page state.
At least not that I am aware of ;)
Maybe also concurrent fallocate(PUNCH_HOLE) is a problem regarding reservations? Not sure ...
fallocate()->hugetlb_vmdelete_list() tries to grab the vma in write-mode, and hugetlb_wp() grabs the lock in read-mode, so we should be covered?
Yeah, maybe that's the case nowadays. Maybe it wasn't in the past ...
Also, hugetlbfs_punch_hole()->remove_inode_hugepages() will try to grab the mutex.
The only fishy thing I see is hugetlbfs_zero_partial_page().
But that is for old_page, and as I said, I thought main reason was to protect us against writes during the copy.
See above, I really wouldn't understand why that is required.
For 2) I am also not sure if we need need the pagecache folio locked; I doubt it ... but this code is not the easiest to follow.
I have been staring at that code and thinking about potential scenarios for a few days now, and I cannot convice myself that we need pagecache_folio's lock when pagecache_folio != old_folio because as a matter of fact I cannot think of anything it protects us against.
I plan to rework this in a more sane way, or at least less offusctaed, and then Galvin can fire his syzkaller to check whether we are good.
Digging a bit:
commit 56c9cfb13c9b6516017eea4e8cbe22ea02e07ee6 Author: Naoya Horiguchi nao.horiguchi@gmail.com Date: Fri Sep 10 13:23:04 2010 +0900
hugetlb, rmap: fix confusing page locking in hugetlb_cow()
The "if (!trylock_page)" block in the avoidcopy path of hugetlb_cow() looks confusing and is buggy. Originally this trylock_page() was intended to make sure that old_page is locked even when old_page != pagecache_page, because then only pagecache_page is locked.
Added the comment
+ /* + * hugetlb_cow() requires page locks of pte_page(entry) and + * pagecache_page, so here we need take the former one + * when page != pagecache_page or !pagecache_page. + * Note that locking order is always pagecache_page -> page, + * so no worry about deadlock. + */
And
commit 0fe6e20b9c4c53b3e97096ee73a0857f60aad43f Author: Naoya Horiguchi nao.horiguchi@gmail.com Date: Fri May 28 09:29:16 2010 +0900
hugetlb, rmap: add reverse mapping for hugepage
This patch adds reverse mapping feature for hugepage by introducing mapcount for shared/private-mapped hugepage and anon_vma for private-mapped hugepage.
While hugepage is not currently swappable, reverse mapping can be useful for memory error handler.
Without this patch, memory error handler cannot identify processes using the bad hugepage nor unmap it from them. That is: - for shared hugepage: we can collect processes using a hugepage through pagecache, but can not unmap the hugepage because of the lack of mapcount. - for privately mapped hugepage: we can neither collect processes nor unmap the hugepage. This patch solves these problems.
This patch include the bug fix given by commit 23be7468e8, so reverts it.
Added the real locking magic.
Not that much changed regarding locking until COW support was added in
commit 1e8f889b10d8d2223105719e36ce45688fedbd59 Author: David Gibson david@gibson.dropbear.id.au Date: Fri Jan 6 00:10:44 2006 -0800
[PATCH] Hugetlb: Copy on Write support
Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs.
Confusing.
Locking the *old_folio* when calling hugetlb_wp() makes sense when it is an anon folio because we might want to call folio_move_anon_rmap() to adjust the rmap root.
Locking the pagecache folio when calling hugetlb_wp() if old_folio is an anon folio ... does not make sense to me.
Locking the pagecache folio when calling hugetlb_wp if old_folio is a pageache folio ... also doesn't quite make sense for me.
Again, we don't take the lock for ordinary pages, so what's special about hugetlb for the last case (reservations, I assume?).
On Wed, May 28, 2025 at 10:26:04PM +0200, David Hildenbrand wrote:
Digging a bit:
commit 56c9cfb13c9b6516017eea4e8cbe22ea02e07ee6 Author: Naoya Horiguchi nao.horiguchi@gmail.com Date: Fri Sep 10 13:23:04 2010 +0900
hugetlb, rmap: fix confusing page locking in hugetlb_cow() The "if (!trylock_page)" block in the avoidcopy path of hugetlb_cow() looks confusing and is buggy. Originally this trylock_page() was intended to make sure that old_page is locked even when old_page != pagecache_page, because then only pagecache_page is locked.
Added the comment
/*
* hugetlb_cow() requires page locks of pte_page(entry) and
* pagecache_page, so here we need take the former one
* when page != pagecache_page or !pagecache_page.
* Note that locking order is always pagecache_page -> page,
* so no worry about deadlock.
*/
And
commit 0fe6e20b9c4c53b3e97096ee73a0857f60aad43f Author: Naoya Horiguchi nao.horiguchi@gmail.com Date: Fri May 28 09:29:16 2010 +0900
hugetlb, rmap: add reverse mapping for hugepage This patch adds reverse mapping feature for hugepage by introducing mapcount for shared/private-mapped hugepage and anon_vma for private-mapped hugepage. While hugepage is not currently swappable, reverse mapping can be useful for memory error handler. Without this patch, memory error handler cannot identify processes using the bad hugepage nor unmap it from them. That is: - for shared hugepage: we can collect processes using a hugepage through pagecache, but can not unmap the hugepage because of the lack of mapcount. - for privately mapped hugepage: we can neither collect processes nor unmap the hugepage. This patch solves these problems. This patch include the bug fix given by commit 23be7468e8, so reverts it.
Added the real locking magic.
Yes, I have been checking "hugetlb, rmap: add reverse mapping for hugepage", which added locking the now-so-called 'old_folio' in case hugetlbfs_pagecache_page() didn't return anything.
Because in hugetlb_wp, this was added:
@@ -2286,8 +2299,11 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, retry_avoidcopy: /* If no-one else is actually using this page, avoid the copy * and just make the page writable */ - avoidcopy = (page_count(old_page) == 1); + avoidcopy = (page_mapcount(old_page) == 1); if (avoidcopy) { + if (!trylock_page(old_page)) + if (PageAnon(old_page)) + page_move_anon_rmap(old_page, vma, address);
So, as you mentioned, it was done to keep the rmap stable as I guess rmap test test the PageLock.
Not that much changed regarding locking until COW support was added in
commit 1e8f889b10d8d2223105719e36ce45688fedbd59 Author: David Gibson david@gibson.dropbear.id.au Date: Fri Jan 6 00:10:44 2006 -0800
[PATCH] Hugetlb: Copy on Write support Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs.
Confusing.
Locking the *old_folio* when calling hugetlb_wp() makes sense when it is an anon folio because we might want to call folio_move_anon_rmap() to adjust the rmap root.
Yes, this is clear.
Locking the pagecache folio when calling hugetlb_wp() if old_folio is an anon folio ... does not make sense to me.
I think this one is also clear.
Locking the pagecache folio when calling hugetlb_wp if old_folio is a pageache folio ... also doesn't quite make sense for me. Again, we don't take the lock for ordinary pages, so what's special about hugetlb for the last case (reservations, I assume?).
So, this case is when pagecache_folio == old_folio.
I guess we are talking about resv_maps? But I think we cannot interfere there. For the reserves to be modified the page has to go away.
Now, I have been checking this one too:
commit 04f2cbe35699d22dbf428373682ead85ca1240f5 Author: Mel Gorman mel@csn.ul.ie Date: Wed Jul 23 21:27:25 2008 -0700
hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
And I think it is interesting. That one added this chunk in hugetlb_fault():
@@ -1126,8 +1283,15 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, spin_lock(&mm->page_table_lock); /* Check for a racing update before calling hugetlb_cow */ if (likely(pte_same(entry, huge_ptep_get(ptep)))) - if (write_access && !pte_write(entry)) - ret = hugetlb_cow(mm, vma, address, ptep, entry); + if (write_access && !pte_write(entry)) { + struct page *page; + page = hugetlbfs_pagecache_page(vma, address); + ret = hugetlb_cow(mm, vma, address, ptep, entry, page); + if (page) { + unlock_page(page); + put_page(page); + } + }
So, it finds and lock the page in the pagecache, and calls hugetlb_cow.
hugetlb_fault() takes hugetlb_instantiation_mutex, and there is a comment saying:
/* * Serialize hugepage allocation and instantiation, so that we don't * get spurious allocation failures if two CPUs race to instantiate * the same page in the page cache. */ mutex_lock(&hugetlb_instantiation_mutex);
But it does not say anything about truncation. Actually, checking the truncation code from back then, truncate_hugepages() (and none of its callers) take the hugetlb_instantiation_mutex, as it is done today (e.g: current remove_inode_hugepages() code).
Back then, truncate_hugepages() relied only in lock_page():
static void truncate_hugepages(struct inode *inode, loff_t lstart) { ... ... lock_page(page); truncate_huge_page(page); unlock_page(page); }
While today, remove_inode_hugepages() takes the mutex, and also the lock. And then zaps the page and does its thing with resv_maps.
So I think that we should not even need the lock for hugetlb_wp when pagecache_folio == old_folio (pagecache), because the mutex already protects us from the page to go away, right (e.g: truncated)? Besides we hold a reference on that page since filemap_lock_hugetlb_folio() locks the page and increases its refcount.
All in all, I am leaning towards not being needed, but it's getting late here..
linux-stable-mirror@lists.linaro.org