The patch fixes a deadlock which can be triggered by an internal syzkaller [1] reproducer and captured by bpftrace script [2] and its log [3] in this scenario:
Process 1 Process 2 --- --- hugetlb_fault mutex_lock(B) // take B filemap_lock_hugetlb_folio filemap_lock_folio __filemap_get_folio folio_lock(A) // take A hugetlb_wp mutex_unlock(B) // release B ... hugetlb_fault ... mutex_lock(B) // take B filemap_lock_hugetlb_folio filemap_lock_folio __filemap_get_folio folio_lock(A) // blocked unmap_ref_private ... mutex_lock(B) // retake and blocked
This is a ABBA deadlock involving two locks: - Lock A: pagecache_folio lock - Lock B: hugetlb_fault_mutex_table lock
The deadlock occurs between two processes as follows: 1. The first process (let’s call it Process 1) is handling a copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to insufficient reserved hugetlb pages, Process 1, owner of the reserved hugetlb page, attempts to unmap a hugepage owned by another process (non-owner) to satisfy the reservation. Before unmapping, Process 1 acquires lock B (hugetlb_fault_mutex_table lock) and then lock A (pagecache_folio lock). To proceed with the unmap, it releases Lock B but retains Lock A. After the unmap, Process 1 tries to reacquire Lock B. However, at this point, Lock B has already been acquired by another process.
2. The second process (Process 2) enters the hugetlb_fault handler during the unmap operation. It successfully acquires Lock B (hugetlb_fault_mutex_table lock) that was just released by Process 1, but then attempts to acquire Lock A (pagecache_folio lock), which is still held by Process 1.
As a result, Process 1 (holding Lock A) is blocked waiting for Lock B (held by Process 2), while Process 2 (holding Lock B) is blocked waiting for Lock A (held by Process 1), constructing a ABBA deadlock scenario.
The error message: INFO: task repro_20250402_:13229 blocked for more than 64 seconds. Not tainted 6.15.0-rc3+ #24 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00004006 Call Trace: <TASK> __schedule+0x1755/0x4f50 schedule+0x158/0x330 schedule_preempt_disabled+0x15/0x30 __mutex_lock+0x75f/0xeb0 hugetlb_wp+0xf88/0x3440 hugetlb_fault+0x14c8/0x2c30 trace_clock_x86_tsc+0x20/0x20 do_user_addr_fault+0x61d/0x1490 exc_page_fault+0x64/0x100 asm_exc_page_fault+0x26/0x30 RIP: 0010:__put_user_4+0xd/0x20 copy_process+0x1f4a/0x3d60 kernel_clone+0x210/0x8f0 __x64_sys_clone+0x18d/0x1f0 do_syscall_64+0x6a/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x41b26d </TASK> INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250. task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00000006 Call Trace: <TASK> __schedule+0x1755/0x4f50 schedule+0x158/0x330 io_schedule+0x92/0x110 folio_wait_bit_common+0x69a/0xba0 __filemap_get_folio+0x154/0xb70 hugetlb_fault+0xa50/0x2c30 trace_clock_x86_tsc+0x20/0x20 do_user_addr_fault+0xace/0x1490 exc_page_fault+0x64/0x100 asm_exc_page_fault+0x26/0x30 RIP: 0033:0x402619 </TASK> INFO: task repro_20250402_:13250 blocked for more than 65 seconds. Not tainted 6.15.0-rc3+ #24 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00000006 Call Trace: <TASK> __schedule+0x1755/0x4f50 schedule+0x158/0x330 io_schedule+0x92/0x110 folio_wait_bit_common+0x69a/0xba0 __filemap_get_folio+0x154/0xb70 hugetlb_fault+0xa50/0x2c30 trace_clock_x86_tsc+0x20/0x20 do_user_addr_fault+0xace/0x1490 exc_page_fault+0x64/0x100 asm_exc_page_fault+0x26/0x30 RIP: 0033:0x402619 </TASK>
Showing all locks held in the system: 1 lock held by khungtaskd/35: #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180 2 locks held by repro_20250402_/13229: #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300 #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440 3 locks held by repro_20250402_/13250: #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490 #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30 #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30
Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=d... [1] Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/hugetlb_lock_debug... [2] Link: https://drive.google.com/file/d/1bWq2-8o-BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=s... [3] Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") Cc: stable@vger.kernel.org Cc: Hugh Dickins hughd@google.com Cc: Florent Revest revest@google.com Cc: Gavin Shan gshan@redhat.com Suggested-by: Oscar Salvador osalvador@suse.de Signed-off-by: Gavin Guo gavinguo@igalia.com --- V1 -> V2 Suggested-by Oscar Salvador: - Use folio_test_locked to replace the unnecessary parameter passing.
mm/hugetlb.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7ae38bfb9096..ed501f134eff 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6226,6 +6226,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, u32 hash;
folio_put(old_folio); + /* + * The pagecache_folio needs to be unlocked to avoid + * deadlock when the child unmaps the folio. + */ + if (pagecache_folio) + folio_unlock(pagecache_folio); /* * Drop hugetlb_fault_mutex and vma_lock before * unmapping. unmapping needs to hold vma_lock @@ -6823,8 +6829,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, out_ptl: spin_unlock(vmf.ptl);
+ /* + * hugetlb_wp() might have already unlocked pagecache_folio, so + * skip it if that is the case. + */ if (pagecache_folio) { - folio_unlock(pagecache_folio); + if (folio_test_locked(pagecache_folio)) + folio_unlock(pagecache_folio); folio_put(pagecache_folio); } out_mutex:
base-commit: 4a95bc121ccdaee04c4d72f84dbfa6b880a514b6
On Wed, May 21, 2025 at 07:57:27PM +0800, Gavin Guo wrote:
The patch fixes a deadlock which can be triggered by an internal syzkaller [1] reproducer and captured by bpftrace script [2] and its log [3] in this scenario:
Process 1 Process 2
hugetlb_fault mutex_lock(B) // take B filemap_lock_hugetlb_folio filemap_lock_folio __filemap_get_folio folio_lock(A) // take A hugetlb_wp mutex_unlock(B) // release B ... hugetlb_fault ... mutex_lock(B) // take B filemap_lock_hugetlb_folio filemap_lock_folio __filemap_get_folio folio_lock(A) // blocked unmap_ref_private ... mutex_lock(B) // retake and blocked
This is a ABBA deadlock involving two locks:
- Lock A: pagecache_folio lock
- Lock B: hugetlb_fault_mutex_table lock
The deadlock occurs between two processes as follows:
- The first process (let’s call it Process 1) is handling a
copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to insufficient reserved hugetlb pages, Process 1, owner of the reserved hugetlb page, attempts to unmap a hugepage owned by another process (non-owner) to satisfy the reservation. Before unmapping, Process 1 acquires lock B (hugetlb_fault_mutex_table lock) and then lock A (pagecache_folio lock). To proceed with the unmap, it releases Lock B but retains Lock A. After the unmap, Process 1 tries to reacquire Lock B. However, at this point, Lock B has already been acquired by another process.
- The second process (Process 2) enters the hugetlb_fault handler
during the unmap operation. It successfully acquires Lock B (hugetlb_fault_mutex_table lock) that was just released by Process 1, but then attempts to acquire Lock A (pagecache_folio lock), which is still held by Process 1.
As a result, Process 1 (holding Lock A) is blocked waiting for Lock B (held by Process 2), while Process 2 (holding Lock B) is blocked waiting for Lock A (held by Process 1), constructing a ABBA deadlock scenario.
The error message: INFO: task repro_20250402_:13229 blocked for more than 64 seconds. Not tainted 6.15.0-rc3+ #24 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00004006 Call Trace:
<TASK> __schedule+0x1755/0x4f50 schedule+0x158/0x330 schedule_preempt_disabled+0x15/0x30 __mutex_lock+0x75f/0xeb0 hugetlb_wp+0xf88/0x3440 hugetlb_fault+0x14c8/0x2c30 trace_clock_x86_tsc+0x20/0x20 do_user_addr_fault+0x61d/0x1490 exc_page_fault+0x64/0x100 asm_exc_page_fault+0x26/0x30 RIP: 0010:__put_user_4+0xd/0x20 copy_process+0x1f4a/0x3d60 kernel_clone+0x210/0x8f0 __x64_sys_clone+0x18d/0x1f0 do_syscall_64+0x6a/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x41b26d </TASK> INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250. task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00000006 Call Trace: <TASK> __schedule+0x1755/0x4f50 schedule+0x158/0x330 io_schedule+0x92/0x110 folio_wait_bit_common+0x69a/0xba0 __filemap_get_folio+0x154/0xb70 hugetlb_fault+0xa50/0x2c30 trace_clock_x86_tsc+0x20/0x20 do_user_addr_fault+0xace/0x1490 exc_page_fault+0x64/0x100 asm_exc_page_fault+0x26/0x30 RIP: 0033:0x402619 </TASK> INFO: task repro_20250402_:13250 blocked for more than 65 seconds. Not tainted 6.15.0-rc3+ #24 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00000006 Call Trace: <TASK> __schedule+0x1755/0x4f50 schedule+0x158/0x330 io_schedule+0x92/0x110 folio_wait_bit_common+0x69a/0xba0 __filemap_get_folio+0x154/0xb70 hugetlb_fault+0xa50/0x2c30 trace_clock_x86_tsc+0x20/0x20 do_user_addr_fault+0xace/0x1490 exc_page_fault+0x64/0x100 asm_exc_page_fault+0x26/0x30 RIP: 0033:0x402619 </TASK>
Showing all locks held in the system: 1 lock held by khungtaskd/35: #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180 2 locks held by repro_20250402_/13229: #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300 #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440 3 locks held by repro_20250402_/13250: #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490 #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30 #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30
Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=d... [1] Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/hugetlb_lock_debug... [2] Link: https://drive.google.com/file/d/1bWq2-8o-BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=s... [3] Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") Cc: stable@vger.kernel.org Cc: Hugh Dickins hughd@google.com Cc: Florent Revest revest@google.com Cc: Gavin Shan gshan@redhat.com Suggested-by: Oscar Salvador osalvador@suse.de Signed-off-by: Gavin Guo gavinguo@igalia.com
Acked-by: Oscar Salvador osalvador@suse.de
On Wed, 21 May 2025, Gavin Guo wrote:
... V1 -> V2 Suggested-by Oscar Salvador:
- Use folio_test_locked to replace the unnecessary parameter passing.
mm/hugetlb.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7ae38bfb9096..ed501f134eff 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6226,6 +6226,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, u32 hash; folio_put(old_folio);
/*
* The pagecache_folio needs to be unlocked to avoid
* deadlock when the child unmaps the folio.
*/
if (pagecache_folio)
folio_unlock(pagecache_folio); /* * Drop hugetlb_fault_mutex and vma_lock before * unmapping. unmapping needs to hold vma_lock
@@ -6823,8 +6829,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, out_ptl: spin_unlock(vmf.ptl);
- /*
* hugetlb_wp() might have already unlocked pagecache_folio, so
* skip it if that is the case.
if (pagecache_folio) {*/
folio_unlock(pagecache_folio);
if (folio_test_locked(pagecache_folio))
folio_put(pagecache_folio); }folio_unlock(pagecache_folio);
out_mutex:
NAK!
I have not (and shall not) review V1, but was hoping someone else would save me from rejecting this V2 idea immediately.
Unless you have a very strong argument why this folio is invisible to the rest of the world, including speculative accessors like compaction (and the name "pagecache_folio" suggests very much the reverse): the pattern of unlocking a lock when you see it locked is like (or worse than) having no locking at all - it is potentially unlocking someone else's lock.
Hugh
On Wed, May 21, 2025 at 08:10:46AM -0700, Hugh Dickins wrote:
Unless you have a very strong argument why this folio is invisible to the rest of the world, including speculative accessors like compaction (and the name "pagecache_folio" suggests very much the reverse): the pattern of unlocking a lock when you see it locked is like (or worse than) having no locking at all - it is potentially unlocking someone else's lock.
hugetlb_fault() locks 'pagecache_folio' and unlocks it after returning from hugetlb_wp(). This patch introduces the possibility that hugetlb_wp() can also unlock it for the reasons explained. So, when hugetlb_wp() returns back to hugetlb_fault(), we
1) either still hold the lock (because hugetlb_fault() took it) 2) or we do not anymore because hugetlb_wp() unlocked it for us.
So it is not that we are unlocking anything blindly, because if the lock is still 'taken' (folio_test_locked() returned true) it is because we, hugetlb_fault() took it and we are still holding it.
On Wed, 21 May 2025, Oscar Salvador wrote:
On Wed, May 21, 2025 at 08:10:46AM -0700, Hugh Dickins wrote:
Unless you have a very strong argument why this folio is invisible to the rest of the world, including speculative accessors like compaction (and the name "pagecache_folio" suggests very much the reverse): the pattern of unlocking a lock when you see it locked is like (or worse than) having no locking at all - it is potentially unlocking someone else's lock.
hugetlb_fault() locks 'pagecache_folio' and unlocks it after returning from hugetlb_wp(). This patch introduces the possibility that hugetlb_wp() can also unlock it for the reasons explained. So, when hugetlb_wp() returns back to hugetlb_fault(), we
- either still hold the lock (because hugetlb_fault() took it)
- or we do not anymore because hugetlb_wp() unlocked it for us.
So it is not that we are unlocking anything blindly, because if the lock is still 'taken' (folio_test_locked() returned true) it is because we, hugetlb_fault() took it and we are still holding it.
If we unlocked it, anyone else could have taken it immediately after.
Hugh
On Wed, May 21, 2025 at 08:58:32AM -0700, Hugh Dickins wrote:
If we unlocked it, anyone else could have taken it immediately after.
Sorry Hugh, I was being dumb, of course you are right.
Then, maybe v1 was not really a bad idea, but we might need to think of a better idea overall.
On 5/21/25 23:58, Hugh Dickins wrote:
On Wed, 21 May 2025, Oscar Salvador wrote:
On Wed, May 21, 2025 at 08:10:46AM -0700, Hugh Dickins wrote:
Unless you have a very strong argument why this folio is invisible to the rest of the world, including speculative accessors like compaction (and the name "pagecache_folio" suggests very much the reverse): the pattern of unlocking a lock when you see it locked is like (or worse than) having no locking at all - it is potentially unlocking someone else's lock.
hugetlb_fault() locks 'pagecache_folio' and unlocks it after returning from hugetlb_wp(). This patch introduces the possibility that hugetlb_wp() can also unlock it for the reasons explained. So, when hugetlb_wp() returns back to hugetlb_fault(), we
- either still hold the lock (because hugetlb_fault() took it)
- or we do not anymore because hugetlb_wp() unlocked it for us.
So it is not that we are unlocking anything blindly, because if the lock is still 'taken' (folio_test_locked() returned true) it is because we, hugetlb_fault() took it and we are still holding it.
If we unlocked it, anyone else could have taken it immediately after.
Hugh _______________________________________________ Kernel-dev mailing list -- kernel-dev@igalia.com To unsubscribe send an email to kernel-dev-leave@igalia.com
Sigh, I should have thought of that as well. Next time, I'll be more careful.
linux-stable-mirror@lists.linaro.org