Re: [REGRESSION] 6.6.10+ and 6.7+ kernels lock up early in init.

23 Jan 2024


      On Tue, 2024-01-23 at 08:19 -0500, Jeff Layton wrote:
...
On Tue, 2024-01-23 at 12:46 +0100, Sedat Dilek wrote:
...
On Tue, Jan 23, 2024 at 12:16 PM Jeff Layton jlayton@kernel.org wrote:
...
On Tue, 2024-01-23 at 07:39 +0100, Linux regression tracking (Thorsten
Leemhuis) wrote:
...
[a quick follow up with an important correction from the reporter for
those I added to the list of recipients]
On 23.01.24 06:37, Linux regression tracking (Thorsten Leemhuis) wrote:
...
On 23.01.24 05:40, Paul Thompson wrote:
...
With my longstanding configuration, kernels upto 6.6.9 work fine.
Kernels 6.6.1[0123] and 6.7.[01] all lock up in early (open-rc) init,
before even the virtual filesystems are mounted.
The last thing visible on the console is the nfsclient service
being started and:
Call to flock failed: Funtion not implemented. (twice)
Then the machine is unresponsive, numlock doesnt toggle the keyboard led,
and the alt-sysrq chords appear to do nothing.
The problem is solved by changing my 6.6.9 config option:
# CONFIG_FILE_LOCKING is not set
to
CONFIG_FILE_LOCKING=y
(This option is under File Systems > Enable POSIX file locking API)
The reporter replied out-of-thread:
https://lore.kernel.org/all/Za9TRtSjubbX0bVu@squish.home.loc/
"""
      Now I feel stupid or like Im losing it, but I went back and grepped for
the CONFIG_FILE_LOCKING in my old Configs, and it was turned on in all
but 6.6.9. So, somehow I turned that off *after I built 6.6.9? Argh. I
just built 6.6.4 with it unset and that locked up too.
      Sorry if this is just noise, though one would have hoped the failure
was less severe...
"""
Ok, so not necessarily a regression? It might be helpful to know the
earliest kernel you can boot with CONFIG_FILE_LOCKING turned off.
...
...
I'll give a try reproducing this later though.
Quote from Paul:
"
Now I feel stupid or like Im losing it, but I went back and grepped
for the CONFIG_FILE_LOCKING in my old Configs, and it was turned on in all
but 6.6.9. So, somehow I turned that off *after I built 6.6.9? Argh. I just
built 6.6.4 with it unset and that locked up too.
Sorry if this is just noise, though one would have hoped the failure
was less severe...
"
-Sedat-
https://lore.kernel.org/all/Za9TRtSjubbX0bVu@squish.home.loc/#t
Ok, I can reproduce this in KVM, which should make this a bit simpler:
I tried turning off CONFIG_FILE_LOCKING on mainline kernels and it also
hung for me at boot here (I think it was trying to enable the nvme disks
attached to this host):
[  OK  ] Reached target sysinit.target - System Initialization.
[  OK  ] Finished dracut-pre-mount.service - dracut pre-mount hook.
[  OK  ] Started plymouth-start.service - Show Plymouth Boot Screen.
[  OK  ] Started systemd-ask-password-plymo…quests to Plymouth Directory Watch.
[  OK  ] Reached target paths.target - Path Units.
[  OK  ] Reached target basic.target - Basic System.
[    4.647183] cryptd: max_cpu_qlen set to 1000
[    4.650280] AVX2 version of gcm_enc/dec engaged.
[    4.651252] AES CTR mode by8 optimization enabled
         Starting systemd-vconsole-setup.service - Virtual Console Setup...
[FAILED] Failed to start systemd-vconsole-s…up.service - Virtual Console Setup.
See 'systemctl status systemd-vconsole-setup.service' for details.
[    5.777176] virtio_blk virtio3: 8/0/0 default/read/poll queues
[    5.784633] virtio_blk virtio3: [vda] 41943040 512-byte logical blocks (21.5 GB/20.0 GiB)
[    5.791351]  vda: vda1 vda2 vda3
[    5.792672] virtio_blk virtio6: 8/0/0 default/read/poll queues
[    5.801796] virtio_blk virtio6: [vdb] 209715200 512-byte logical blocks (107 GB/100 GiB)
[    5.807839] virtio_blk virtio7: 8/0/0 default/read/poll queues
[    5.813098] virtio_blk virtio7: [vdc] 209715200 512-byte logical blocks (107 GB/100 GiB)
[    5.818500] virtio_blk virtio8: 8/0/0 default/read/poll queues
[    5.823969] virtio_blk virtio8: [vdd] 209715200 512-byte logical blocks (107 GB/100 GiB)
[    5.829217] virtio_blk virtio9: 8/0/0 default/read/poll queues
[    5.834636] virtio_blk virtio9: [vde] 209715200 512-byte logical blocks (107 GB/100 GiB)
[    **] Job dev-disk-by\x2duuid-5a8a135f\x2…art running (13min 46s / no limit)
The last part will just keep spinning forever.
I've gone back as far as v6.0, and I see the same behavior. I then tried
changing the disks in the VM to be attached by virtio instead of NVMe,
and that also didn't help.
That said, I'm using a fedora 39 cloud image here. I'm not sure it's
reasonable to expect that to boot properly with file locking disabled.
 
Paul, what distro are you running? When you say that it's hung, are you
seeing similar behavior?
FWIW, I grabbed a dump of the VM's memory and took a quick look with
crash. All of the tasks are either idle, or waiting in epoll. Perhaps
there is some subtle dependency between epoll and CONFIG_FILE_LOCKING?
PID: 190      TASK: ffff8fa846eb3080  CPU: 7    COMMAND: "systemd-journal"
 #0 [ffffb5560063bd18] __schedule at ffffffffa10e8d39
 #1 [ffffb5560063bd88] schedule at ffffffffa10e9491
 #2 [ffffb5560063bda0] schedule_hrtimeout_range_clock at ffffffffa10eff99
 #3 [ffffb5560063be10] do_epoll_wait at ffffffffa0a08106
 #4 [ffffb5560063bee8] __x64_sys_epoll_wait at ffffffffa0a0872d
 #5 [ffffb5560063bf38] do_syscall_64 at ffffffffa10d3af4
 #6 [ffffb5560063bf50] entry_SYSCALL_64_after_hwframe at ffffffffa12000e6
    RIP: 00007f975753cac7  RSP: 00007ffe07ab17b8  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 000000000000001e  RCX: 00007f975753cac7
    RDX: 000000000000001e  RSI: 000055d723ad8ca0  RDI: 0000000000000007
    RBP: 00007ffe07ab18d0   R8: 000055d723ad79ac   R9: 0000000000000007
    R10: 00000000ffffffff  R11: 0000000000000202  R12: 000055d723ad8ca0
    R13: 0000000000000010  R14: 000055d723ad33b0  R15: ffffffffffffffff
    ORIG_RAX: 00000000000000e8  CS: 0033  SS: 002b
Whether this is a regression or not, a lot of userland software relies
on file locking these days. Maybe this is a good time to consider
getting rid of CONFIG_FILE_LOCKING and just hardcoding it on.
By disabling it, it looks like you save 4 bytes in struct inode. I'm not
sure that's worth the hassle of having to deal with the extra test
matrix dimension. In a really stripped down configuration where you
don't need file locking, are you likely to have a lot of inodes in core
anyway?
I guess you also save a little kernel text too, but I still have to
wonder if it's worth it.
-- 
Jeff Layton jlayton@kernel.org

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [REGRESSION] 6.6.10+ and 6.7+ kernels lock up early in init.