An async dio write to a sparse file can generate a lot of extents and when we unlink this file (using rm), the kernel can be busy in umapping and freeing those extents as part of transaction processing. Add cond_resched() in xfs_defer_finish_noroll() to avoid soft lockups messages. Here is a call trace of such soft lockup.
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [rm:81335] CPU: 1 PID: 81335 Comm: rm Kdump: loaded Tainted: G L X 5.14.21-150500.53-default NIP [c00800001b174768] xfs_extent_busy_trim+0xc0/0x2a0 [xfs] LR [c00800001b1746f4] xfs_extent_busy_trim+0x4c/0x2a0 [xfs] Call Trace: 0xc0000000a8268340 (unreliable) xfs_alloc_compute_aligned+0x5c/0x150 [xfs] xfs_alloc_ag_vextent_size+0x1dc/0x8c0 [xfs] xfs_alloc_ag_vextent+0x17c/0x1c0 [xfs] xfs_alloc_fix_freelist+0x274/0x4b0 [xfs] xfs_free_extent_fix_freelist+0x84/0xe0 [xfs] __xfs_free_extent+0xa0/0x240 [xfs] xfs_trans_free_extent+0x6c/0x140 [xfs] xfs_defer_finish_noroll+0x2b0/0x650 [xfs] xfs_inactive_truncate+0xe8/0x140 [xfs] xfs_fs_destroy_inode+0xdc/0x320 [xfs] destroy_inode+0x6c/0xc0 do_unlinkat+0x1fc/0x410 sys_unlinkat+0x58/0xb0 system_call_exception+0x15c/0x330 system_call_common+0xec/0x250
Signed-off-by: Ritesh Harjani (IBM) ritesh.list@gmail.com cc: stable@vger.kernel.org cc: Ojaswin Mujoo ojaswin@linux.ibm.com --- fs/xfs/libxfs/xfs_defer.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index c13276095cc0..cb185b97447d 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -705,6 +705,7 @@ xfs_defer_finish_noroll( error = xfs_defer_finish_one(*tp, dfp); if (error && error != -EAGAIN) goto out_shutdown; + cond_resched(); }
/* Requeue the paused items in the outgoing transaction. */ -- 2.44.0
On Sun, Apr 21, 2024 at 01:19:44PM +0530, Ritesh Harjani (IBM) wrote:
An async dio write to a sparse file can generate a lot of extents and when we unlink this file (using rm), the kernel can be busy in umapping and freeing those extents as part of transaction processing. Add cond_resched() in xfs_defer_finish_noroll() to avoid soft lockups messages. Here is a call trace of such soft lockup.
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [rm:81335] CPU: 1 PID: 81335 Comm: rm Kdump: loaded Tainted: G L X 5.14.21-150500.53-default
Can you reproduce this on a current TOT kernel? 5.14 is pretty old, and this stack trace:
NIP [c00800001b174768] xfs_extent_busy_trim+0xc0/0x2a0 [xfs] LR [c00800001b1746f4] xfs_extent_busy_trim+0x4c/0x2a0 [xfs] Call Trace: 0xc0000000a8268340 (unreliable) xfs_alloc_compute_aligned+0x5c/0x150 [xfs] xfs_alloc_ag_vextent_size+0x1dc/0x8c0 [xfs] xfs_alloc_ag_vextent+0x17c/0x1c0 [xfs] xfs_alloc_fix_freelist+0x274/0x4b0 [xfs] xfs_free_extent_fix_freelist+0x84/0xe0 [xfs] __xfs_free_extent+0xa0/0x240 [xfs] xfs_trans_free_extent+0x6c/0x140 [xfs] xfs_defer_finish_noroll+0x2b0/0x650 [xfs] xfs_inactive_truncate+0xe8/0x140 [xfs] xfs_fs_destroy_inode+0xdc/0x320 [xfs] destroy_inode+0x6c/0xc0
.... doesn't exist anymore.
xfs_inactive_truncate() is now done from a background inodegc thread, not directly in destroy_inode().
I also suspect that any sort of cond_resched() should be in the top layer loop in xfs_bunmapi_range(), not hidden deep in the defer code. The problem is the number of extents being processed without yielding, not the time spent processing each individual deferred work chain to free the extent. Hence the explicit rescheduling should be at the top level loop where it can be easily explained and understand, not hidden deep inside the defer chain mechanism....
-Dave.
Dave Chinner david@fromorbit.com writes:
On Sun, Apr 21, 2024 at 01:19:44PM +0530, Ritesh Harjani (IBM) wrote:
An async dio write to a sparse file can generate a lot of extents and when we unlink this file (using rm), the kernel can be busy in umapping and freeing those extents as part of transaction processing. Add cond_resched() in xfs_defer_finish_noroll() to avoid soft lockups messages. Here is a call trace of such soft lockup.
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [rm:81335] CPU: 1 PID: 81335 Comm: rm Kdump: loaded Tainted: G L X 5.14.21-150500.53-default
Can you reproduce this on a current TOT kernel? 5.14 is pretty old, and this stack trace:
Yes, I was able to reproduce this on upstream kernel too -
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435] CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S L 6.9.0-rc5-0-default #1 Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290 LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290 Call Trace: xfs_alloc_get_rec+0x54/0x1b0 (unreliable) xfs_alloc_compute_aligned+0x5c/0x144 xfs_alloc_ag_vextent_size+0x238/0x8d4 xfs_alloc_fix_freelist+0x540/0x694 xfs_free_extent_fix_freelist+0x84/0xe0 __xfs_free_extent+0x74/0x1ec xfs_extent_free_finish_item+0xcc/0x214 xfs_defer_finish_one+0x194/0x388 xfs_defer_finish_noroll+0x1b4/0x5c8 xfs_defer_finish+0x2c/0xc4 xfs_bunmapi_range+0xa4/0x100 xfs_itruncate_extents_flags+0x1b8/0x2f4 xfs_inactive_truncate+0xe0/0x124 xfs_inactive+0x30c/0x3e0 xfs_inodegc_worker+0x140/0x234 process_scheduled_works+0x240/0x57c worker_thread+0x198/0x468 kthread+0x138/0x140 start_kernel_thread+0x14/0x18
NIP [c00800001b174768] xfs_extent_busy_trim+0xc0/0x2a0 [xfs] LR [c00800001b1746f4] xfs_extent_busy_trim+0x4c/0x2a0 [xfs] Call Trace: 0xc0000000a8268340 (unreliable) xfs_alloc_compute_aligned+0x5c/0x150 [xfs] xfs_alloc_ag_vextent_size+0x1dc/0x8c0 [xfs] xfs_alloc_ag_vextent+0x17c/0x1c0 [xfs] xfs_alloc_fix_freelist+0x274/0x4b0 [xfs] xfs_free_extent_fix_freelist+0x84/0xe0 [xfs] __xfs_free_extent+0xa0/0x240 [xfs] xfs_trans_free_extent+0x6c/0x140 [xfs] xfs_defer_finish_noroll+0x2b0/0x650 [xfs] xfs_inactive_truncate+0xe8/0x140 [xfs] xfs_fs_destroy_inode+0xdc/0x320 [xfs] destroy_inode+0x6c/0xc0
.... doesn't exist anymore.
xfs_inactive_truncate() is now done from a background inodegc thread, not directly in destroy_inode().
I also suspect that any sort of cond_resched() should be in the top layer loop in xfs_bunmapi_range(), not hidden deep in the defer code. The problem is the number of extents being processed without yielding, not the time spent processing each individual deferred work chain to free the extent. Hence the explicit rescheduling should be at the top level loop where it can be easily explained and understand, not hidden deep inside the defer chain mechanism....
Yes, sure. I will submit a v2 with this diff then (after I verify the fix once on the setup, just for my sanity)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 656c95a22f2e..44d5381bc66f 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -6354,6 +6354,7 @@ xfs_bunmapi_range( error = xfs_defer_finish(tpp); if (error) goto out; + cond_resched(); } out: return error;
-ritesh
linux-stable-mirror@lists.linaro.org