When I run "losetup --verbose --partscan --read-only --find /mnt/gemini.61rn.3T/Backups/debian.raw" on a 4.14.103 system losetup hangs for exactly 3 minutes. After the hang the loopback device works like it should. The /mnt/gemini.61rn.3T mount is an ext4 fs on dm-crypt on a spinning sata disk.
The hang was introduced in 4.14.95 and there are several loop related patches in 4.14.95. I bisected it down to commit c1e63df4f30c3918476ac9bc594355b0e9629893 "loop: Get rid of loop_index_mutex". Reverting that commit from 4.14.103 also fixes the problem.
This could be a problem in just the 4.14 stable series. I haven't tested any other series.
Here is the output of dmesg when losetup hangs: [Feb26 21:59] systemd-udevd[996]: seq 2791 '/devices/virtual/block/loop0' is taking a long time [Feb26 22:01] INFO: task losetup:7694 blocked for more than 120 seconds. [ +0.000009] Not tainted 4.14.103 #25 [ +0.000002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ +0.000001] losetup D [ +0.000002] 0 7694 7687 0x00080000 [ +0.000002] Call Trace: [ +0.000005] ? __schedule+0x273/0x870 [ +0.000003] schedule+0x2f/0x90 [ +0.000002] schedule_preempt_disabled+0x11/0x20 [ +0.000002] __mutex_lock.isra.2+0x32c/0x540 [ +0.000003] ? __wake_up_common_lock+0x8a/0xc0 [ +0.000004] blkdev_reread_part+0x16/0x30 [ +0.000100] loop_reread_partitions+0x27/0x30 [ +0.000004] loop_set_status+0x335/0x410 [ +0.000003] loop_set_status64+0x4b/0x80 [ +0.000003] lo_ioctl+0x1e7/0x7d0 [ +0.000003] blkdev_ioctl+0x446/0x9d0 [ +0.000004] block_ioctl+0x39/0x40 [ +0.000004] do_vfs_ioctl+0xa4/0x650 [ +0.000002] SyS_ioctl+0x74/0x80 [ +0.000004] do_syscall_64+0x6e/0x170 [ +0.000003] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ +0.000003] RIP: 0033:0x7f455c7912a7 [ +0.000002] RSP: 002b:00007ffe5a9d6908 EFLAGS: 00000246 [ +0.000001] ORIG_RAX: 0000000000000010 [ +0.000003] RAX: ffffffffffffffda RBX: 00007ffe5a9d6ab0 RCX: 00007f455c7912a7 [ +0.000002] RDX: 00007ffe5a9d6b50 RSI: 0000000000004c04 RDI: 0000000000000004 [ +0.000002] RBP: 0000000000000004 R08: 0000000000000008 R09: 696265642f737075 [ +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f455ce816b8 [ +0.000002] R13: 0000000000000003 R14: 00007ffe5a9d6b50 R15: 00007ffe5a9d6930 [ +6.555728] systemd-udevd[996]: seq 2791 '/devices/virtual/block/loop0' killed [ +0.000247] systemd-udevd[996]: worker [7697] terminated by signal 9 (KILL) [ +0.000005] systemd-udevd[996]: worker [7697] failed while handling '/devices/virtual/block/loop0' [ +0.052335] loop0: p1 p2 < p5 >
From that backtrace it looks like the problem is related to --partscan. Among the loop related patches in 4.14.95 commit 57da9a9742200f391d1cf93fea389f7ddc25ec9a says:
Note that syzbot is also reporting circular locking dependency between bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling blkdev_reread_part() with lock held. This patch does not address it.
[2] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d1588...
To me it looks like that is what is causing the hang. The syzkaller report says that the fix is "loop: Fix deadlock when calling blkdev_reread_part()" but I don't see that commit in the 4.14 series.
This is an x86_64 Gentoo system. Here is the .config I use http://sprunge.us/u7YNBt
Hello!
Thanks for the detailed report and bisection!
On Wed 27-02-19 00:35:26, Thomas Lindroth wrote:
When I run "losetup --verbose --partscan --read-only --find /mnt/gemini.61rn.3T/Backups/debian.raw" on a 4.14.103 system losetup hangs for exactly 3 minutes. After the hang the loopback device works like it should. The /mnt/gemini.61rn.3T mount is an ext4 fs on dm-crypt on a spinning sata disk.
The hang was introduced in 4.14.95 and there are several loop related patches in 4.14.95. I bisected it down to commit c1e63df4f30c3918476ac9bc594355b0e9629893 "loop: Get rid of loop_index_mutex". Reverting that commit from 4.14.103 also fixes the problem.
So as you mention below, all the problems with loop device deadlocks didn't get fixed in stable kernels as some changes were too intrusive for the stable tree. Now unfortunately the commit 0a42e99b58a "loop: Get rid of loop_index_mutex" that did get backported makes some deadlocks much easier to hit as I'm looking into that now. For example when partitions are reread in loop_set_status(), it takes just one process trying to open the loop device to deadlock the kernel.
Actually that commit got already reverted in 4.4 stable because I've pointed out to Greg earlier that it has a doubtful benefit without followup fixes. But sadly it remained in other stable branches. Now going through the active branches the summary seems to be:
3.18 and older: never applied 4.4: already reverted 4.9: needs revert 4.14: needs revert 4.19 and newer: followup fixes applied
So Greg, can you please revert the same three commits that you've reverted in 4.4 also in 4.9 and 4.14 stable threes? These are:
0a42e99b58a "loop: Get rid of loop_index_mutex" 967d1dc144b "loop: Fold __loop_release into loop_release" 628bd859470 "loop: Fix double mutex_unlock(&loop_ctl_mutex) in loop_control_ioctl()"
Thanks!
Honza
On Wed, Feb 27, 2019 at 11:30:22AM +0100, Jan Kara wrote:
Hello!
Thanks for the detailed report and bisection!
On Wed 27-02-19 00:35:26, Thomas Lindroth wrote:
When I run "losetup --verbose --partscan --read-only --find /mnt/gemini.61rn.3T/Backups/debian.raw" on a 4.14.103 system losetup hangs for exactly 3 minutes. After the hang the loopback device works like it should. The /mnt/gemini.61rn.3T mount is an ext4 fs on dm-crypt on a spinning sata disk.
The hang was introduced in 4.14.95 and there are several loop related patches in 4.14.95. I bisected it down to commit c1e63df4f30c3918476ac9bc594355b0e9629893 "loop: Get rid of loop_index_mutex". Reverting that commit from 4.14.103 also fixes the problem.
So as you mention below, all the problems with loop device deadlocks didn't get fixed in stable kernels as some changes were too intrusive for the stable tree. Now unfortunately the commit 0a42e99b58a "loop: Get rid of loop_index_mutex" that did get backported makes some deadlocks much easier to hit as I'm looking into that now. For example when partitions are reread in loop_set_status(), it takes just one process trying to open the loop device to deadlock the kernel.
Actually that commit got already reverted in 4.4 stable because I've pointed out to Greg earlier that it has a doubtful benefit without followup fixes. But sadly it remained in other stable branches. Now going through the active branches the summary seems to be:
3.18 and older: never applied 4.4: already reverted 4.9: needs revert 4.14: needs revert 4.19 and newer: followup fixes applied
So Greg, can you please revert the same three commits that you've reverted in 4.4 also in 4.9 and 4.14 stable threes? These are:
0a42e99b58a "loop: Get rid of loop_index_mutex" 967d1dc144b "loop: Fold __loop_release into loop_release" 628bd859470 "loop: Fix double mutex_unlock(&loop_ctl_mutex) in loop_control_ioctl()"
Now all reverted, sorry about this.
greg k-h
linux-stable-mirror@lists.linaro.org