[REGRESSION] 6.6.10+ and 6.7+ kernels lock up early in init.

List overview All Threads
Download

newer

older

[PATCH 0/2] HID: bpf: couple of...

[PATCH v1] Revert "usb: typec:...

Paul Thompson

23 Jan 2024 23 Jan '24

4:40 a.m.

Hi;

With my longstanding configuration, kernels upto 6.6.9 work fine. Kernels 6.6.1[0123] and 6.7.[01] all lock up in early (open-rc) init, before even the virtual filesystems are mounted.

The last thing visible on the console is the nfsclient service being started and:

Call to flock failed: Funtion not implemented. (twice)

Then the machine is unresponsive, numlock doesnt toggle the keyboard led, and the alt-sysrq chords appear to do nothing.

The problem is solved by changing my 6.6.9 config option:

# CONFIG_FILE_LOCKING is not set to CONFIG_FILE_LOCKING=y

(This option is under File Systems > Enable POSIX file locking API)

I do not recall why I unset that, but it was working for I think the entire 6.6 series until 6.6.10. Anyway thought I would mention it in case anyone else hits it.

Paul

Show replies by date

Linux regression tracking (Thorsten Leemhuis)

23 Jan 23 Jan

5:37 a.m.

On 23.01.24 05:40, Paul Thompson wrote:

...

With my longstanding configuration, kernels upto 6.6.9 work fine. Kernels 6.6.1[0123] and 6.7.[01] all lock up in early (open-rc) init, before even the virtual filesystems are mounted.

The last thing visible on the console is the nfsclient service being started and:

Call to flock failed: Funtion not implemented. (twice)

Then the machine is unresponsive, numlock doesnt toggle the keyboard led, and the alt-sysrq chords appear to do nothing.

The problem is solved by changing my 6.6.9 config option:

# CONFIG_FILE_LOCKING is not set to CONFIG_FILE_LOCKING=y

(This option is under File Systems > Enable POSIX file locking API)

I do not recall why I unset that, but it was working for I think the entire 6.6 series until 6.6.10. Anyway thought I would mention it in case anyone else hits it.

Thx for the report.

CCing a few people to let them known about this. Among them Jeff, who had a few fs patches that were backported to 6.6.10 (at the end of the list below).

FWIW, in case anyone wonders what went into that stable release, here is a slightly trimmed down list:

$ git log v6.6.9..v6.6.10 --oneline | grep -v -e wifi -e ftrace -e kexec -e ksmb -e 'platform/' -e tracing: -e netfilter: -e mptcp c9a51ebb4bac69 Linux 6.6.10 baa88944038bbe ring-buffer: Fix wake ups when buffer_percent is set to 100 c62b9a2daf2866 Revert "nvme-fc: fix race between error recovery and creating association" d16c5d215b53b3 mm/memory-failure: check the mapcount of the precise page 8c7da70d9ae4c1 mm/memory-failure: cast index to loff_t before shifting it 07550b1461d4d0 mm: migrate high-order folios in swap cache correctly d16eb52c176ccf mm/filemap: avoid buffered read/write race to read inconsistent data 09141f08fdf69a selftests: secretmem: floor the memory size to the multiple of page_size 2c30b8b105d690 maple_tree: do not preallocate nodes for slot stores b5f63f5e8a6820 block: renumber QUEUE_FLAG_HW_WC 183c8972b6a6f8 linux/export: Ensure natural alignment of kcrctab array 466e9af1550724 linux/export: Fix alignment for 64-bit ksymtab entries 28d6cde17f2191 virtio_ring: fix syncs DMA memory with different direction 9a49874443307c fs: cifs: Fix atime update check 23171df51f601c client: convert to new timestamp accessors 5b5599a7eee5e6 fs: new accessor methods for atime and mtime

Ciao, Thorsten

Linux regression tracking (Thorsten Leemhuis)

6:39 a.m.

[a quick follow up with an important correction from the reporter for those I added to the list of recipients]

On 23.01.24 06:37, Linux regression tracking (Thorsten Leemhuis) wrote:

...

On 23.01.24 05:40, Paul Thompson wrote:

...
With my longstanding configuration, kernels upto 6.6.9 work fine. Kernels 6.6.1[0123] and 6.7.[01] all lock up in early (open-rc) init, before even the virtual filesystems are mounted.

The last thing visible on the console is the nfsclient service being started and:

Call to flock failed: Funtion not implemented. (twice)

Then the machine is unresponsive, numlock doesnt toggle the keyboard led, and the alt-sysrq chords appear to do nothing.

The problem is solved by changing my 6.6.9 config option:

# CONFIG_FILE_LOCKING is not set to CONFIG_FILE_LOCKING=y

(This option is under File Systems > Enable POSIX file locking API)

The reporter replied out-of-thread: https://lore.kernel.org/all/Za9TRtSjubbX0bVu@squish.home.loc/

""" Now I feel stupid or like Im losing it, but I went back and grepped for the CONFIG_FILE_LOCKING in my old Configs, and it was turned on in all but 6.6.9. So, somehow I turned that off *after I built 6.6.9? Argh. I just built 6.6.4 with it unset and that locked up too. Sorry if this is just noise, though one would have hoped the failure was less severe... """

...

...
I do not recall why I unset that, but it was working for I think the entire 6.6 series until 6.6.10. Anyway thought I would mention it in case anyone else hits it.

Thx for the report.

CCing a few people to let them known about this. Among them Jeff, who had a few fs patches that were backported to 6.6.10 (at the end of the list below).

FWIW, in case anyone wonders what went into that stable release, here is a slightly trimmed down list:

$ git log v6.6.9..v6.6.10 --oneline | grep -v -e wifi -e ftrace -e kexec -e ksmb -e 'platform/' -e tracing: -e netfilter: -e mptcp c9a51ebb4bac69 Linux 6.6.10 baa88944038bbe ring-buffer: Fix wake ups when buffer_percent is set to 100 c62b9a2daf2866 Revert "nvme-fc: fix race between error recovery and creating association" d16c5d215b53b3 mm/memory-failure: check the mapcount of the precise page 8c7da70d9ae4c1 mm/memory-failure: cast index to loff_t before shifting it 07550b1461d4d0 mm: migrate high-order folios in swap cache correctly d16eb52c176ccf mm/filemap: avoid buffered read/write race to read inconsistent data 09141f08fdf69a selftests: secretmem: floor the memory size to the multiple of page_size 2c30b8b105d690 maple_tree: do not preallocate nodes for slot stores b5f63f5e8a6820 block: renumber QUEUE_FLAG_HW_WC 183c8972b6a6f8 linux/export: Ensure natural alignment of kcrctab array 466e9af1550724 linux/export: Fix alignment for 64-bit ksymtab entries 28d6cde17f2191 virtio_ring: fix syncs DMA memory with different direction 9a49874443307c fs: cifs: Fix atime update check 23171df51f601c client: convert to new timestamp accessors 5b5599a7eee5e6 fs: new accessor methods for atime and mtime

Ciao, Thorsten

Jeff Layton

11:16 a.m.

On Tue, 2024-01-23 at 07:39 +0100, Linux regression tracking (Thorsten Leemhuis) wrote:

...

[a quick follow up with an important correction from the reporter for those I added to the list of recipients]

On 23.01.24 06:37, Linux regression tracking (Thorsten Leemhuis) wrote:

...
On 23.01.24 05:40, Paul Thompson wrote:

...
With my longstanding configuration, kernels upto 6.6.9 work fine. Kernels 6.6.1[0123] and 6.7.[01] all lock up in early (open-rc) init, before even the virtual filesystems are mounted.

The last thing visible on the console is the nfsclient service being started and:

Call to flock failed: Funtion not implemented. (twice)

Then the machine is unresponsive, numlock doesnt toggle the keyboard led, and the alt-sysrq chords appear to do nothing.

The problem is solved by changing my 6.6.9 config option:

# CONFIG_FILE_LOCKING is not set to CONFIG_FILE_LOCKING=y

(This option is under File Systems > Enable POSIX file locking API)

The reporter replied out-of-thread: https://lore.kernel.org/all/Za9TRtSjubbX0bVu@squish.home.loc/

""" Now I feel stupid or like Im losing it, but I went back and grepped for the CONFIG_FILE_LOCKING in my old Configs, and it was turned on in all but 6.6.9. So, somehow I turned that off *after I built 6.6.9? Argh. I just built 6.6.4 with it unset and that locked up too. Sorry if this is just noise, though one would have hoped the failure was less severe... """

Ok, so not necessarily a regression? It might be helpful to know the earliest kernel you can boot with CONFIG_FILE_LOCKING turned off.

...

...

I'll give a try reproducing this later though.

-- Jeff Layton jlayton@kernel.org

Sedat Dilek

11:46 a.m.

On Tue, Jan 23, 2024 at 12:16 PM Jeff Layton jlayton@kernel.org wrote:

...

On Tue, 2024-01-23 at 07:39 +0100, Linux regression tracking (Thorsten Leemhuis) wrote:

...
[a quick follow up with an important correction from the reporter for those I added to the list of recipients]

On 23.01.24 06:37, Linux regression tracking (Thorsten Leemhuis) wrote:

...
On 23.01.24 05:40, Paul Thompson wrote:

...
With my longstanding configuration, kernels upto 6.6.9 work fine. Kernels 6.6.1[0123] and 6.7.[01] all lock up in early (open-rc) init, before even the virtual filesystems are mounted.

The last thing visible on the console is the nfsclient service being started and:

Call to flock failed: Funtion not implemented. (twice)

Then the machine is unresponsive, numlock doesnt toggle the keyboard led, and the alt-sysrq chords appear to do nothing.

The problem is solved by changing my 6.6.9 config option:

# CONFIG_FILE_LOCKING is not set to CONFIG_FILE_LOCKING=y

(This option is under File Systems > Enable POSIX file locking API)

The reporter replied out-of-thread: https://lore.kernel.org/all/Za9TRtSjubbX0bVu@squish.home.loc/

""" Now I feel stupid or like Im losing it, but I went back and grepped for the CONFIG_FILE_LOCKING in my old Configs, and it was turned on in all but 6.6.9. So, somehow I turned that off *after I built 6.6.9? Argh. I just built 6.6.4 with it unset and that locked up too. Sorry if this is just noise, though one would have hoped the failure was less severe... """

Ok, so not necessarily a regression? It might be helpful to know the earliest kernel you can boot with CONFIG_FILE_LOCKING turned off.

...
...
I'll give a try reproducing this later though.

Quote from Paul: " Now I feel stupid or like Im losing it, but I went back and grepped for the CONFIG_FILE_LOCKING in my old Configs, and it was turned on in all but 6.6.9. So, somehow I turned that off *after I built 6.6.9? Argh. I just built 6.6.4 with it unset and that locked up too. Sorry if this is just noise, though one would have hoped the failure was less severe... "

-Sedat-

https://lore.kernel.org/all/Za9TRtSjubbX0bVu@squish.home.loc/#t

...

-- Jeff Layton jlayton@kernel.org

Jeff Layton

1:19 p.m.

On Tue, 2024-01-23 at 12:46 +0100, Sedat Dilek wrote:

...

On Tue, Jan 23, 2024 at 12:16 PM Jeff Layton jlayton@kernel.org wrote:

...
On Tue, 2024-01-23 at 07:39 +0100, Linux regression tracking (Thorsten Leemhuis) wrote:

...
[a quick follow up with an important correction from the reporter for those I added to the list of recipients]

On 23.01.24 06:37, Linux regression tracking (Thorsten Leemhuis) wrote:

...
On 23.01.24 05:40, Paul Thompson wrote:

...
With my longstanding configuration, kernels upto 6.6.9 work fine. Kernels 6.6.1[0123] and 6.7.[01] all lock up in early (open-rc) init, before even the virtual filesystems are mounted.

The last thing visible on the console is the nfsclient service being started and:

Call to flock failed: Funtion not implemented. (twice)

Then the machine is unresponsive, numlock doesnt toggle the keyboard led, and the alt-sysrq chords appear to do nothing.

The problem is solved by changing my 6.6.9 config option:

# CONFIG_FILE_LOCKING is not set to CONFIG_FILE_LOCKING=y

(This option is under File Systems > Enable POSIX file locking API)

The reporter replied out-of-thread: https://lore.kernel.org/all/Za9TRtSjubbX0bVu@squish.home.loc/

""" Now I feel stupid or like Im losing it, but I went back and grepped for the CONFIG_FILE_LOCKING in my old Configs, and it was turned on in all but 6.6.9. So, somehow I turned that off *after I built 6.6.9? Argh. I just built 6.6.4 with it unset and that locked up too. Sorry if this is just noise, though one would have hoped the failure was less severe... """

Ok, so not necessarily a regression? It might be helpful to know the earliest kernel you can boot with CONFIG_FILE_LOCKING turned off.

...
...
I'll give a try reproducing this later though.

Quote from Paul: " Now I feel stupid or like Im losing it, but I went back and grepped for the CONFIG_FILE_LOCKING in my old Configs, and it was turned on in all but 6.6.9. So, somehow I turned that off *after I built 6.6.9? Argh. I just built 6.6.4 with it unset and that locked up too. Sorry if this is just noise, though one would have hoped the failure was less severe... "

-Sedat-

https://lore.kernel.org/all/Za9TRtSjubbX0bVu@squish.home.loc/#t

Ok, I can reproduce this in KVM, which should make this a bit simpler:

I tried turning off CONFIG_FILE_LOCKING on mainline kernels and it also hung for me at boot here (I think it was trying to enable the nvme disks attached to this host):

[ OK ] Reached target sysinit.target - System Initialization. [ OK ] Finished dracut-pre-mount.service - dracut pre-mount hook. [ OK ] Started plymouth-start.service - Show Plymouth Boot Screen. [ OK ] Started systemd-ask-password-plymo…quests to Plymouth Directory Watch. [ OK ] Reached target paths.target - Path Units. [ OK ] Reached target basic.target - Basic System. [ 4.647183] cryptd: max_cpu_qlen set to 1000 [ 4.650280] AVX2 version of gcm_enc/dec engaged. [ 4.651252] AES CTR mode by8 optimization enabled Starting systemd-vconsole-setup.service - Virtual Console Setup... [FAILED] Failed to start systemd-vconsole-s…up.service - Virtual Console Setup. See 'systemctl status systemd-vconsole-setup.service' for details. [ 5.777176] virtio_blk virtio3: 8/0/0 default/read/poll queues [ 5.784633] virtio_blk virtio3: [vda] 41943040 512-byte logical blocks (21.5 GB/20.0 GiB) [ 5.791351] vda: vda1 vda2 vda3 [ 5.792672] virtio_blk virtio6: 8/0/0 default/read/poll queues [ 5.801796] virtio_blk virtio6: [vdb] 209715200 512-byte logical blocks (107 GB/100 GiB) [ 5.807839] virtio_blk virtio7: 8/0/0 default/read/poll queues [ 5.813098] virtio_blk virtio7: [vdc] 209715200 512-byte logical blocks (107 GB/100 GiB) [ 5.818500] virtio_blk virtio8: 8/0/0 default/read/poll queues [ 5.823969] virtio_blk virtio8: [vdd] 209715200 512-byte logical blocks (107 GB/100 GiB) [ 5.829217] virtio_blk virtio9: 8/0/0 default/read/poll queues [ 5.834636] virtio_blk virtio9: [vde] 209715200 512-byte logical blocks (107 GB/100 GiB) [ **] Job dev-disk-by\x2duuid-5a8a135f\x2…art running (13min 46s / no limit)

The last part will just keep spinning forever.

I've gone back as far as v6.0, and I see the same behavior. I then tried changing the disks in the VM to be attached by virtio instead of NVMe, and that also didn't help.

That said, I'm using a fedora 39 cloud image here. I'm not sure it's reasonable to expect that to boot properly with file locking disabled.

Paul, what distro are you running? When you say that it's hung, are you seeing similar behavior?

-- Jeff Layton jlayton@kernel.org

Jeff Layton

1:57 p.m.

On Tue, 2024-01-23 at 08:19 -0500, Jeff Layton wrote:

...

On Tue, 2024-01-23 at 12:46 +0100, Sedat Dilek wrote:

...
On Tue, Jan 23, 2024 at 12:16 PM Jeff Layton jlayton@kernel.org wrote:

...
On Tue, 2024-01-23 at 07:39 +0100, Linux regression tracking (Thorsten Leemhuis) wrote:

...
[a quick follow up with an important correction from the reporter for those I added to the list of recipients]

On 23.01.24 06:37, Linux regression tracking (Thorsten Leemhuis) wrote:

...
On 23.01.24 05:40, Paul Thompson wrote:

...
With my longstanding configuration, kernels upto 6.6.9 work fine. Kernels 6.6.1[0123] and 6.7.[01] all lock up in early (open-rc) init, before even the virtual filesystems are mounted.

The last thing visible on the console is the nfsclient service being started and:

Call to flock failed: Funtion not implemented. (twice)

Then the machine is unresponsive, numlock doesnt toggle the keyboard led, and the alt-sysrq chords appear to do nothing.

The problem is solved by changing my 6.6.9 config option:

# CONFIG_FILE_LOCKING is not set to CONFIG_FILE_LOCKING=y

(This option is under File Systems > Enable POSIX file locking API)

The reporter replied out-of-thread: https://lore.kernel.org/all/Za9TRtSjubbX0bVu@squish.home.loc/

""" Now I feel stupid or like Im losing it, but I went back and grepped for the CONFIG_FILE_LOCKING in my old Configs, and it was turned on in all but 6.6.9. So, somehow I turned that off *after I built 6.6.9? Argh. I just built 6.6.4 with it unset and that locked up too. Sorry if this is just noise, though one would have hoped the failure was less severe... """

Ok, so not necessarily a regression? It might be helpful to know the earliest kernel you can boot with CONFIG_FILE_LOCKING turned off.

...
...
I'll give a try reproducing this later though.

Quote from Paul: " Now I feel stupid or like Im losing it, but I went back and grepped for the CONFIG_FILE_LOCKING in my old Configs, and it was turned on in all but 6.6.9. So, somehow I turned that off *after I built 6.6.9? Argh. I just built 6.6.4 with it unset and that locked up too. Sorry if this is just noise, though one would have hoped the failure was less severe... "

-Sedat-

https://lore.kernel.org/all/Za9TRtSjubbX0bVu@squish.home.loc/#t

Ok, I can reproduce this in KVM, which should make this a bit simpler:

I tried turning off CONFIG_FILE_LOCKING on mainline kernels and it also hung for me at boot here (I think it was trying to enable the nvme disks attached to this host):

[ OK ] Reached target sysinit.target - System Initialization. [ OK ] Finished dracut-pre-mount.service - dracut pre-mount hook. [ OK ] Started plymouth-start.service - Show Plymouth Boot Screen. [ OK ] Started systemd-ask-password-plymo…quests to Plymouth Directory Watch. [ OK ] Reached target paths.target - Path Units. [ OK ] Reached target basic.target - Basic System. [ 4.647183] cryptd: max_cpu_qlen set to 1000 [ 4.650280] AVX2 version of gcm_enc/dec engaged. [ 4.651252] AES CTR mode by8 optimization enabled Starting systemd-vconsole-setup.service - Virtual Console Setup... [FAILED] Failed to start systemd-vconsole-s…up.service - Virtual Console Setup. See 'systemctl status systemd-vconsole-setup.service' for details. [ 5.777176] virtio_blk virtio3: 8/0/0 default/read/poll queues [ 5.784633] virtio_blk virtio3: [vda] 41943040 512-byte logical blocks (21.5 GB/20.0 GiB) [ 5.791351] vda: vda1 vda2 vda3 [ 5.792672] virtio_blk virtio6: 8/0/0 default/read/poll queues [ 5.801796] virtio_blk virtio6: [vdb] 209715200 512-byte logical blocks (107 GB/100 GiB) [ 5.807839] virtio_blk virtio7: 8/0/0 default/read/poll queues [ 5.813098] virtio_blk virtio7: [vdc] 209715200 512-byte logical blocks (107 GB/100 GiB) [ 5.818500] virtio_blk virtio8: 8/0/0 default/read/poll queues [ 5.823969] virtio_blk virtio8: [vdd] 209715200 512-byte logical blocks (107 GB/100 GiB) [ 5.829217] virtio_blk virtio9: 8/0/0 default/read/poll queues [ 5.834636] virtio_blk virtio9: [vde] 209715200 512-byte logical blocks (107 GB/100 GiB) [ **] Job dev-disk-by\x2duuid-5a8a135f\x2…art running (13min 46s / no limit)

The last part will just keep spinning forever.

I've gone back as far as v6.0, and I see the same behavior. I then tried changing the disks in the VM to be attached by virtio instead of NVMe, and that also didn't help.

That said, I'm using a fedora 39 cloud image here. I'm not sure it's reasonable to expect that to boot properly with file locking disabled. Paul, what distro are you running? When you say that it's hung, are you seeing similar behavior?

FWIW, I grabbed a dump of the VM's memory and took a quick look with crash. All of the tasks are either idle, or waiting in epoll. Perhaps there is some subtle dependency between epoll and CONFIG_FILE_LOCKING?

PID: 190 TASK: ffff8fa846eb3080 CPU: 7 COMMAND: "systemd-journal" #0 [ffffb5560063bd18] __schedule at ffffffffa10e8d39 #1 [ffffb5560063bd88] schedule at ffffffffa10e9491 #2 [ffffb5560063bda0] schedule_hrtimeout_range_clock at ffffffffa10eff99 #3 [ffffb5560063be10] do_epoll_wait at ffffffffa0a08106 #4 [ffffb5560063bee8] __x64_sys_epoll_wait at ffffffffa0a0872d #5 [ffffb5560063bf38] do_syscall_64 at ffffffffa10d3af4 #6 [ffffb5560063bf50] entry_SYSCALL_64_after_hwframe at ffffffffa12000e6 RIP: 00007f975753cac7 RSP: 00007ffe07ab17b8 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 000000000000001e RCX: 00007f975753cac7 RDX: 000000000000001e RSI: 000055d723ad8ca0 RDI: 0000000000000007 RBP: 00007ffe07ab18d0 R8: 000055d723ad79ac R9: 0000000000000007 R10: 00000000ffffffff R11: 0000000000000202 R12: 000055d723ad8ca0 R13: 0000000000000010 R14: 000055d723ad33b0 R15: ffffffffffffffff ORIG_RAX: 00000000000000e8 CS: 0033 SS: 002b

Whether this is a regression or not, a lot of userland software relies on file locking these days. Maybe this is a good time to consider getting rid of CONFIG_FILE_LOCKING and just hardcoding it on.

By disabling it, it looks like you save 4 bytes in struct inode. I'm not sure that's worth the hassle of having to deal with the extra test matrix dimension. In a really stripped down configuration where you don't need file locking, are you likely to have a lot of inodes in core anyway?

I guess you also save a little kernel text too, but I still have to wonder if it's worth it.

-- Jeff Layton jlayton@kernel.org

706

days inactive

706

days old

linux-stable-mirror@lists.linaro.org

6 comments

participants

tags (0)

participants (4)

Jeff Layton
Linux regression tracking (Thorsten Leemhuis)
Paul Thompson
Sedat Dilek