Re: rm hanging, v6.1.35

11 Jul 2023


      On Tue, Jul 11, 2023 at 07:53:35AM +0700, Bagas Sanjaya wrote:
...
On Tue, Jul 11, 2023 at 07:53:54AM +1000, Chris Dunlop wrote:
...
Hi,
This box is newly booted into linux v6.1.35 (2 days ago), it was previously
running v5.15.118 without any problems (other than that fixed by
"5e672cd69f0a xfs: non-blocking inodegc pushes", the reason for the
upgrade).
I have rm operations on two files that have been stuck for in excess of 22
hours and 18 hours respectively:
$ ps -opid,lstart,state,wchan=WCHAN-xxxxxxxxxxxxxxx,cmd -C rm
    PID                  STARTED S WCHAN-xxxxxxxxxxxxxxx CMD
2379355 Mon Jul 10 09:07:57 2023 D vfs_unlink            /bin/rm -rf /aaa/5539_tmp
2392421 Mon Jul 10 09:18:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
2485728 Mon Jul 10 09:28:57 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
2488254 Mon Jul 10 09:39:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
2491180 Mon Jul 10 09:49:58 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
3014914 Mon Jul 10 13:00:33 2023 D vfs_unlink            /bin/rm -rf /bbb/5541_tmp
3095893 Mon Jul 10 13:11:03 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
3098809 Mon Jul 10 13:21:35 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
3101387 Mon Jul 10 13:32:06 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
3195017 Mon Jul 10 13:42:37 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
The "rm"s are run from a process that's obviously tried a few times to get
rid of these files.
...
There's nothing extraordinary about the files in terms of size:
$ ls -ltrn --full-time /aaa/5539_tmp /bbb/5541_tmp
-rw-rw-rw- 1 1482 1482 7870643 2023-07-10 06:07:58.684036505 +1000 /aaa/5539_tmp
-rw-rw-rw- 1 1482 1482  701240 2023-07-10 10:00:34.181064549 +1000 /bbb/5541_tmp
As hinted by the WCHAN in the ps output above, each "primary" rm (i.e. the
first one run on each file) stack trace looks like:
[<0>] vfs_unlink+0x48/0x270
[<0>] do_unlinkat+0x1f5/0x290
[<0>] __x64_sys_unlinkat+0x3b/0x60
[<0>] do_syscall_64+0x34/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
This looks to be stuck on the target inode lock (i.e. the locks for
the inodes at /aaa/5539_tmp and /bbb/5541_tmp).
What's holding these inode locks? This hasn't even got to XFS yet
here, so there's something else going on in the background. Attached
the full output of 'echo w > /proc/sysrq-trigger' and 'echo t >
/proc/sysrq-trigger', please?
...
...
And each "secondary" rm (i.e. the subsequent ones on each file) stack trace
looks like:
== blog-230710-xfs-rm-stuckd
[<0>] down_write_nested+0xdc/0x100
[<0>] do_unlinkat+0x10d/0x290
[<0>] __x64_sys_unlinkat+0x3b/0x60
[<0>] do_syscall_64+0x34/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
These are likely all stuck on the parent directory inode lock (i.e.
/aaa and /bbb).
...
...
Where to from here?
I'm guessing only a reboot is going to unstick this. Anything I should be
looking at before reverting to v5.15.118?
...subsequent to starting writing all this down I have another two sets of
rms stuck, again on unremarkable files, and on two more separate
filesystems.
What's an "unremarkable file" look like? Is is a reflink copy of
something else, a hard link, a small/large regular data file or something else?
...
...
...oh. And an 'ls' on those files is hanging. The reboot has become more
urgent.
Yup, that's most likely getting stuck on the directory locks that
the unlinks are holding....
-Dave.
-- 
Dave Chinner
david@fromorbit.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: rm hanging, v6.1.35