Hi,
Since kernel 6.3.0 (and also 6.4rc3), on a ThinkPad Z13 system with Arch Linux, I've noticed that the amd_sfh driver spews a lot of stack traces during boot. Sometimes it is an oops:
BUG: unable to handle page fault for address: 000000000001000f #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 8 PID: 457 Comm: (udev-worker) Not tainted 6.3.3-arch1-1 #1 fa7b7e0107004b3021a57a74b951e0a25e7e8584 Hardware name: LENOVO 21D2CTO1WW/21D2CTO1WW, BIOS N3GET47W (1.27 ) 12/08/2022 RIP: 0010:amd_sfh_get_report+0x1e/0x110 [amd_sfh] Code: 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 8b 87 60 1d 00 00 48 8b 68 08 <8b> 45 10 85 c0 0f 84 a9 00 00 00 49 89 fc 41 89 f7 41 89 d6 31 db RSP: 0018:ffffb164426f3a20 EFLAGS: 00010246 RAX: ffff9b0ae6b7bd00 RBX: ffff9b0ac0f46000 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff9b0ac0f46000 RBP: 000000000000ffff R08: ffffb164426f3ab8 R09: ffffb164426f3ab8 R10: 000000000020031b R11: ffff9b0ace40ac00 R12: ffff9b0ace40ac00 R13: 0000000000000002 R14: 0000000000000002 R15: ffff9b0acd213010 FS: 00007fe9ceb82200(0000) GS:ffff9b1122000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000001000f CR3: 000000010940c000 CR4: 0000000000750ee0 PKRU: 55555554 Call Trace: <TASK> amdtp_hid_request+0x36/0x50 [amd_sfh 2e3095779aada9fdb1764f08ca578ccb14e41fe4] sensor_hub_get_feature+0xad/0x170 [hid_sensor_hub d6157999c9d260a1bfa6f27d4a0dc2c3e2c5654e] hid_sensor_parse_common_attributes+0x217/0x310 [hid_sensor_iio_common 07a7935272aa9c7a28193b574580b3e953a64ec4] hid_gyro_3d_probe+0x7f/0x2e0 [hid_sensor_gyro_3d 9f2eb51294a1f0c0315b365f335617cbaef01eab] platform_probe+0x44/0xa0 really_probe+0x19e/0x3e0 ? __pfx___driver_attach+0x10/0x10 __driver_probe_device+0x78/0x160 driver_probe_device+0x1f/0x90 __driver_attach+0xd2/0x1c0 bus_for_each_dev+0x88/0xd0 bus_add_driver+0x116/0x220 driver_register+0x59/0x100 ? __pfx_init_module+0x10/0x10 [hid_sensor_gyro_3d 9f2eb51294a1f0c0315b365f335617cbaef01eab] do_one_initcall+0x5d/0x240 do_init_module+0x4a/0x200 __do_sys_init_module+0x17f/0x1b0 do_syscall_64+0x60/0x90 ? ksys_read+0x6f/0xf0 ? syscall_exit_to_user_mode+0x1b/0x40 ? do_syscall_64+0x6c/0x90 ? exc_page_fault+0x7c/0x180 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fe9ce721f9e Code: 48 8b 0d bd ed 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8a ed 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd280dd828 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 000055b72a37f630 RCX: 00007fe9ce721f9e RDX: 00007fe9cec7a343 RSI: 00000000000077f8 RDI: 000055b72a56c7f0 RBP: 00007fe9cec7a343 R08: 00000000000077f8 R09: 0000000000000000 R10: 000000000001a0a1 R11: 0000000000000246 R12: 0000000000020000 R13: 000055b72a363b90 R14: 000055b72a37f630 R15: 000055b72a36a070 </TASK> Modules linked in: hid_sensor_accel_3d(+) hid_sensor_gyro_3d(+) qrtr hid_sensor_trigger snd_sof industrialio_triggered_buffer ath11k_pci(+) kfifo_buf snd_sof_utils hid_sensor_iio_common joydev ath11k industrialio snd_soc_core mousedev qmi_helpers snd_compress hid_sensor_hub snd_hda_scodec_cs35l41_spi ac97_bus snd_hda_codec_realtek(+) snd_pcm_dmaengine intel_rapl_msr snd_hda_codec_hdmi snd_hda_codec_generic intel_rapl_common mac80211 snd_pci_ps btusb snd_rpl_pci_acp6x btrtl snd_hda_intel edac_mce_amd uvcvideo btbcm snd_acp_pci snd_intel_dspcfg snd_pci_acp6x videobuf2_vmalloc snd_intel_sdw_acpi libarc4 uvc btintel snd_usb_audio(+) snd_pci_acp5x videobuf2_memops btmtk snd_hda_codec kvm_amd videobuf2_v4l2 snd_hda_scodec_cs35l41_i2c snd_usbmidi_lib snd_hda_scodec_cs35l41 snd_rn_pci_acp3x ucsi_acpi bluetooth videodev snd_hda_core typec_ucsi snd_acp_config snd_hda_cs_dsp_ctls wacom(+) hid_multitouch cfg80211 snd_rawmidi sp5100_tco kvm snd_seq_device cs_dsp videobuf2_common typec ecdh_generic snd_soc_acpi think_lmi snd_hwdep snd_pcm irqbypass crc16 snd_soc_cs35l41_lib mhi thunderbolt firmware_attributes_class snd_pci_acp3x amd_sfh(+) k10temp psmouse roles rapl i2c_piix4 mc snd_timer wmi_bmof serial_multi_instantiate i2c_hid_acpi acpi_tad i2c_hid amd_pmf amd_pmc mac_hid sch_fq tcp_bbr dm_multipath i2c_dev crypto_user fuse loop zram ip_tables x_tables xfs libcrc32c crc32c_generic dm_crypt cbc encrypted_keys trusted asn1_encoder tee usbhid dm_mod amdgpu i2c_algo_bit serio_raw thinkpad_acpi drm_ttm_helper atkbd libps2 crct10dif_pclmul vivaldi_fmap crc32_pclmul ledtrig_audio crc32c_intel polyval_clmulni ttm polyval_generic drm_buddy nvme gf128mul platform_profile gpu_sched ghash_clmulni_intel sha512_ssse3 snd aesni_intel soundcore drm_display_helper crypto_simd rfkill nvme_core xhci_pci cryptd cec ccp xhci_pci_renesas i8042 video nvme_common serio wmi CR2: 000000000001000f ---[ end trace 0000000000000000 ]--- RIP: 0010:amd_sfh_get_report+0x1e/0x110 [amd_sfh] Code: 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 8b 87 60 1d 00 00 48 8b 68 08 <8b> 45 10 85 c0 0f 84 a9 00 00 00 49 89 fc 41 89 f7 41 89 d6 31 db RSP: 0018:ffffb164426f3a20 EFLAGS: 00010246 RAX: ffff9b0ae6b7bd00 RBX: ffff9b0ac0f46000 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff9b0ac0f46000 RBP: 000000000000ffff R08: ffffb164426f3ab8 R09: ffffb164426f3ab8 R10: 000000000020031b R11: ffff9b0ace40ac00 R12: ffff9b0ace40ac00 R13: 0000000000000002 R14: 0000000000000002 R15: ffff9b0acd213010 FS: 00007fe9ceb82200(0000) GS:ffff9b1122000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000001000f CR3: 000000010940c000 CR4: 0000000000750ee0 PKRU: 55555554
Sometimes it is a list corruption in the same function with a similar stack:
------------[ cut here ]------------ list_add corruption. next is NULL. WARNING: CPU: 5 PID: 433 at lib/list_debug.c:25 __list_add_valid+0x57/0xa0 ... CPU: 5 PID: 433 Comm: (udev-worker) Not tainted 6.4.0-rc3-1-mainline #1 b60166e85cb97a6631db26f9dcda0196ed7a0c93 Hardware name: LENOVO 21D2CTO1WW/21D2CTO1WW, BIOS N3GET47W (1.27 ) 12/08/2022 RIP: 0010:__list_add_valid+0x57/0xa0 Code: 01 00 00 00 c3 cc cc cc cc 48 c7 c7 58 91 e6 9a e8 1e b9 a8 ff 0f 0b 31 c0 c3 cc cc cc cc 48 c7 c7 80 91 e6 9a e8 09 b9 a8 ff <0f> 0b eb e9 48 89 c1 48 c7 c7 a8 91 e6 9a e8 f6 b8 a8 ff 0f 0b eb RSP: 0018:ffffad9dc0c7bb10 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff92d5a8099448 RCX: 0000000000000027 RDX: ffff92dbe1f61688 RSI: 0000000000000001 RDI: ffff92dbe1f61680 RBP: ffff92d59ea93508 R08: 0000000000000000 R09: ffffad9dc0c7b9a0 R10: 0000000000000003 R11: ffffffff9b6ca808 R12: 0000000000000000 R13: ffff92d5a8099440 R14: ffff92d59ea93760 R15: 0000000000000002 FS: 00007fbaf0262200(0000) GS:ffff92dbe1f40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005651de666000 CR3: 000000011cfee000 CR4: 0000000000750ee0 PKRU: 55555554 Call Trace: <TASK> amd_sfh_get_report+0xba/0x110 [amd_sfh 78bf82e66cdb2ccf24cbe871a0835ef4eedddb17] amdtp_hid_request+0x36/0x50 [amd_sfh 78bf82e66cdb2ccf24cbe871a0835ef4eedddb17] sensor_hub_get_feature+0xad/0x170 [hid_sensor_hub 30e53e2c49ea1702e2482c0b3860e22265679e39] hid_sensor_parse_common_attributes+0x217/0x310 [hid_sensor_iio_common ed7fba7a4d4147d48156e6a4b2a034ad3fc94350] hid_gyro_3d_probe+0x7f/0x2e0 [hid_sensor_gyro_3d 10978a2cdfc8979f2a7366fcd005e0ea826088eb] platform_probe+0x44/0xa0 really_probe+0x19e/0x3e0 ? __pfx___driver_attach+0x10/0x10 __driver_probe_device+0x78/0x160 driver_probe_device+0x1f/0x90 __driver_attach+0xd2/0x1c0 bus_for_each_dev+0x88/0xd0 bus_add_driver+0x116/0x220 driver_register+0x59/0x100 ? __pfx_hid_gyro_3d_platform_driver_init+0x10/0x10 [hid_sensor_gyro_3d 10978a2cdfc8979f2a7366fcd005e0ea826088eb] do_one_initcall+0x5d/0x240 do_init_module+0x60/0x240 __do_sys_init_module+0x17f/0x1b0 do_syscall_64+0x60/0x90 ? exc_page_fault+0x7f/0x180 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fbaf06c0f9e Code: 48 8b 0d bd ed 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8a ed 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc5ce88528 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 00005651de36dff0 RCX: 00007fbaf06c0f9e RDX: 00007fbaf0ba9343 RSI: 00000000000079f0 RDI: 00005651de646fe0 RBP: 00007fbaf0ba9343 R08: 00000000000079f0 R09: 0000000000000000 R10: 0000000000019fb1 R11: 0000000000000246 R12: 0000000000020000 R13: 00005651de45fb10 R14: 00005651de36dff0 R15: 00005651de44d5f0 </TASK> ---[ end trace 0000000000000000 ]---
This occurs during almost every boot. When it happens there is usually a (udev-worker) process lingering forever, which is unkillable and even prevents shutdown.
Looking at past journals it never happened before 6.3 so I believe it is a regression.
Relevant device: 63:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4] Subsystem: Lenovo Sensor Fusion Hub [17aa:22f1] Kernel driver in use: pcie_mp2_amd Kernel modules: amd_sfh
I would appreciate it if someone could take a look at this.
Best regards, Haochen Tong
On Wed, May 24, 2023 at 01:27:57AM +0800, Haochen Tong wrote:
Hi,
Since kernel 6.3.0 (and also 6.4rc3), on a ThinkPad Z13 system with Arch Linux, I've noticed that the amd_sfh driver spews a lot of stack traces during boot. Sometimes it is an oops:
What last kernel version before this regression occurs? Do you mean v6.2?
Thanks.
Hi,
On 5/24/23 11:58, Bagas Sanjaya wrote:
On Wed, May 24, 2023 at 01:27:57AM +0800, Haochen Tong wrote:
Hi,
Since kernel 6.3.0 (and also 6.4rc3), on a ThinkPad Z13 system with Arch Linux, I've noticed that the amd_sfh driver spews a lot of stack traces during boot. Sometimes it is an oops:
What last kernel version before this regression occurs? Do you mean v6.2?
I was using 6.2.12 (Arch Linux distro kernel) before seeing this regression.
Thanks.
On Wed, May 24, 2023 at 02:10:31PM +0800, Haochen Tong wrote:
What last kernel version before this regression occurs? Do you mean v6.2?
I was using 6.2.12 (Arch Linux distro kernel) before seeing this regression.
Can you perform bisection to find the culprit that introduces the regression? Since you're on Arch Linux, see its wiki article [1] for instructions.
Thanks.
[1]: https://wiki.archlinux.org/title/Bisecting_bugs_with_Git
Hello,
chiming in here as I'm experiencing what looks like the exact same issue, also on a Lenovo Z13 notebook, also on Arch: Oops during startup in task udev-worker followed by udev-worker blocking all attempts to suspend or cleanly shutdown/reboot the machine - in fact I first noticed because the machine surprised with repeatedly running out of battery after it had supposedly been in standby but couldn't. Only then I noticed the error on boot.
bisect result: 904e28c6de083fa4834cdbd0026470ddc30676fc is the first bad commit commit 904e28c6de083fa4834cdbd0026470ddc30676fc Merge: a738688177dc 2f7f4efb9411 Author: Benjamin Tissoires benjamin.tissoires@redhat.com Date: Wed Feb 22 10:44:31 2023 +0100
Merge branch 'for-6.3/hid-bpf' into for-linus
Initial support of HID-BPF (Benjamin Tissoires)
The history is a little long for this series, as it was intended to be sent for v6.2. However some last minute issues forced us to postpone it to v6.3.
Conflicts: * drivers/hid/i2c-hid/Kconfig: commit bf7660dab30d ("HID: stop drivers from selecting CONFIG_HID") conflicts with commit 2afac81dd165 ("HID: fix I2C_HID not selected when I2C_HID_OF_ELAN is") the resolution is simple enough: just drop the "default" and "select" lines as the new commit from Arnd is doing
BR Malte
On Wed, May 24, 2023 at 02:10:31PM +0800, Haochen Tong wrote:
What last kernel version before this regression occurs? Do you mean v6.2?
I was using 6.2.12 (Arch Linux distro kernel) before seeing this
regression.
Can you perform bisection to find the culprit that introduces the regression? Since you're on Arch Linux, see its wiki article [1] for instructions.
Thanks.
On Mon, Jun 05, 2023 at 01:24:25PM +0200, Malte Starostik wrote:
Hello,
chiming in here as I'm experiencing what looks like the exact same issue, also on a Lenovo Z13 notebook, also on Arch: Oops during startup in task udev-worker followed by udev-worker blocking all attempts to suspend or cleanly shutdown/reboot the machine - in fact I first noticed because the machine surprised with repeatedly running out of battery after it had supposedly been in standby but couldn't. Only then I noticed the error on boot.
bisect result: 904e28c6de083fa4834cdbd0026470ddc30676fc is the first bad commit commit 904e28c6de083fa4834cdbd0026470ddc30676fc Merge: a738688177dc 2f7f4efb9411 Author: Benjamin Tissoires benjamin.tissoires@redhat.com Date: Wed Feb 22 10:44:31 2023 +0100
Merge branch 'for-6.3/hid-bpf' into for-linus
Hmm, seems like bad bisect (bisected to HID-BPF which IMO isn't related to amd_sfh). Can you repeat the bisection?
Anyway, tl;dr:
A: http://en.wikipedia.org/wiki/Top_post Q: Were do I find info about this thing called top-posting? A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail?
A: No. Q: Should I include quotations after my reply?
And telling regzbot:
#regzbot introduced: 904e28c6de083f #regzbot title: HID-BPF feature causes amd_sfh kernel oops during boot and suspend/reboot
Thanks.
On 06.06.23 04:36, Bagas Sanjaya wrote:
On Mon, Jun 05, 2023 at 01:24:25PM +0200, Malte Starostik wrote:
Hello,
chiming in here as I'm experiencing what looks like the exact same issue, also on a Lenovo Z13 notebook, also on Arch: Oops during startup in task udev-worker followed by udev-worker blocking all attempts to suspend or cleanly shutdown/reboot the machine - in fact I first noticed because the machine surprised with repeatedly running out of battery after it had supposedly been in standby but couldn't. Only then I noticed the error on boot.
bisect result: 904e28c6de083fa4834cdbd0026470ddc30676fc is the first bad commit commit 904e28c6de083fa4834cdbd0026470ddc30676fc Merge: a738688177dc 2f7f4efb9411 Author: Benjamin Tissoires benjamin.tissoires@redhat.com Date: Wed Feb 22 10:44:31 2023 +0100
Merge branch 'for-6.3/hid-bpf' into for-linus
Hmm, seems like bad bisect (bisected to HID-BPF which IMO isn't related to amd_sfh). Can you repeat the bisection?
Well, amd_sfh afaics apparently interacts with HID (see trace earlier in the thread), so it's not that far away. But it's a merge commit, which is possible, but doesn't happen every day. So a recheck might really be a good idea.
Anyway, tl;dr:
A: http://en.wikipedia.org/wiki/Top_post Q: Were do I find info about this thing called top-posting?
[...]
BTW, I'm not sure if this really is helpful. Teaching this to upcoming kernel developers is definitely worth it, but I wonder if pushing this on all reporters might do more harm than good. I also wonder if asking them a bit more kindly might be wiser (e.g. instead of "Anyway, tl;dr:" something like "BTW, please do not top-post:" or something like that maybe).
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.
On Jun 06 2023, Linux regression tracking (Thorsten Leemhuis) wrote:
On 06.06.23 04:36, Bagas Sanjaya wrote:
On Mon, Jun 05, 2023 at 01:24:25PM +0200, Malte Starostik wrote:
Hello,
chiming in here as I'm experiencing what looks like the exact same issue, also on a Lenovo Z13 notebook, also on Arch: Oops during startup in task udev-worker followed by udev-worker blocking all attempts to suspend or cleanly shutdown/reboot the machine - in fact I first noticed because the machine surprised with repeatedly running out of battery after it had supposedly been in standby but couldn't. Only then I noticed the error on boot.
bisect result: 904e28c6de083fa4834cdbd0026470ddc30676fc is the first bad commit commit 904e28c6de083fa4834cdbd0026470ddc30676fc Merge: a738688177dc 2f7f4efb9411 Author: Benjamin Tissoires benjamin.tissoires@redhat.com Date: Wed Feb 22 10:44:31 2023 +0100
Merge branch 'for-6.3/hid-bpf' into for-linus
Hmm, seems like bad bisect (bisected to HID-BPF which IMO isn't related to amd_sfh). Can you repeat the bisection?
Well, amd_sfh afaics apparently interacts with HID (see trace earlier in the thread), so it's not that far away. But it's a merge commit, which is possible, but doesn't happen every day. So a recheck might really be a good idea.
Let's not rule out that there is a bad interaction between HID-BPF and AMD SFH. HID-BPF is able to process any incoming HID event, whether it comes from AND SFH, USB, BT, I2C or anything else.
However, looking at the stack trace in the initial report[0], it seems we are getting the oops/stack traces while we are still in amd_sfh:
list_add corruption. next is NULL. WARNING: CPU: 5 PID: 433 at lib/list_debug.c:25 __list_add_valid+0x57/0xa0 ... RIP: 0010:__list_add_valid+0x57/0xa0 ... Call Trace: <TASK> amd_sfh_get_report+0xba/0x110 [amd_sfh 78bf82e66cdb2ccf24cbe871a0835ef4eedddb17] ...
If HID-BPF were involved, we should see a call to hid_input_report() IMO. Also AMD SFH calls hid_input_report() in a workqueue, so I would expect a different stack trace.
I have a suspicion on commit 7bcfdab3f0c6 ("HID: amd_sfh: if no sensors are enabled, clean up") because the stack trace says that there is a bad list_add, which could happen if the object is not correctly initialized.
However, that commit was present in v6.2, so it might not be that one.
Back to the merge commit: the hid-bpf tree was merged in the hid tree while it took its branch during the v6.1 cycle. So that might be the reason you get this as a result of bisection because the AMD SFH code in the hid-bpf branch is the one from the v6.1 kernel, and when you merge it to the v6.2+ branch, you get a different code for that driver.
Cheers, Benjamin
[0] https://lore.kernel.org/regressions/f40e3897-76f1-2cd0-2d83-e48d87130eab@hex...
On 6/6/2023 3:08 AM, Benjamin Tissoires wrote:
On Jun 06 2023, Linux regression tracking (Thorsten Leemhuis) wrote:
On 06.06.23 04:36, Bagas Sanjaya wrote:
On Mon, Jun 05, 2023 at 01:24:25PM +0200, Malte Starostik wrote:
Hello,
chiming in here as I'm experiencing what looks like the exact same issue, also on a Lenovo Z13 notebook, also on Arch: Oops during startup in task udev-worker followed by udev-worker blocking all attempts to suspend or cleanly shutdown/reboot the machine - in fact I first noticed because the machine surprised with repeatedly running out of battery after it had supposedly been in standby but couldn't. Only then I noticed the error on boot.
bisect result: 904e28c6de083fa4834cdbd0026470ddc30676fc is the first bad commit commit 904e28c6de083fa4834cdbd0026470ddc30676fc Merge: a738688177dc 2f7f4efb9411 Author: Benjamin Tissoires benjamin.tissoires@redhat.com Date: Wed Feb 22 10:44:31 2023 +0100
Merge branch 'for-6.3/hid-bpf' into for-linus
Hmm, seems like bad bisect (bisected to HID-BPF which IMO isn't related to amd_sfh). Can you repeat the bisection?
Well, amd_sfh afaics apparently interacts with HID (see trace earlier in the thread), so it's not that far away. But it's a merge commit, which is possible, but doesn't happen every day. So a recheck might really be a good idea.
Let's not rule out that there is a bad interaction between HID-BPF and AMD SFH. HID-BPF is able to process any incoming HID event, whether it comes from AND SFH, USB, BT, I2C or anything else.
However, looking at the stack trace in the initial report[0], it seems we are getting the oops/stack traces while we are still in amd_sfh:
list_add corruption. next is NULL. WARNING: CPU: 5 PID: 433 at lib/list_debug.c:25 __list_add_valid+0x57/0xa0 ... RIP: 0010:__list_add_valid+0x57/0xa0 ... Call Trace:
<TASK> amd_sfh_get_report+0xba/0x110 [amd_sfh 78bf82e66cdb2ccf24cbe871a0835ef4eedddb17] ...
If HID-BPF were involved, we should see a call to hid_input_report() IMO. Also AMD SFH calls hid_input_report() in a workqueue, so I would expect a different stack trace.
I have a suspicion on commit 7bcfdab3f0c6 ("HID: amd_sfh: if no sensors are enabled, clean up") because the stack trace says that there is a bad list_add, which could happen if the object is not correctly initialized.
However, that commit was present in v6.2, so it might not be that one.
Back to the merge commit: the hid-bpf tree was merged in the hid tree while it took its branch during the v6.1 cycle. So that might be the reason you get this as a result of bisection because the AMD SFH code in the hid-bpf branch is the one from the v6.1 kernel, and when you merge it to the v6.2+ branch, you get a different code for that driver.
Cheers, Benjamin
[0] https://lore.kernel.org/regressions/f40e3897-76f1-2cd0-2d83-e48d87130eab@hex...
If I'm not mistaken the Z13 doesn't actually have any sensors connected to SFH. So I think the suspicion on 7bcfdab3f0c6 and theory this is triggered by HID init makes a lot of sense.
Can you try this patch?
diff --git a/drivers/hid/amd-sfh-hid/amd_sfh_client.c b/drivers/hid/amd-sfh-hid/amd_sfh_client.c index d9b7b01900b5..fa693a5224c6 100644 --- a/drivers/hid/amd-sfh-hid/amd_sfh_client.c +++ b/drivers/hid/amd-sfh-hid/amd_sfh_client.c @@ -324,6 +324,7 @@ int amd_sfh_hid_client_init(struct amd_mp2_dev *privdata) devm_kfree(dev, cl_data->report_descr[i]); } dev_warn(dev, "Failed to discover, sensors not enabled is %d\n", cl_data->is_any_sensor_enabled); + cl_data->num_hid_devices = 0; return -EOPNOTSUPP; } schedule_delayed_work(&cl_data->work_buffer, msecs_to_jiffies(AMD_SFH_IDLE_LOOP));
Am Dienstag, 6. Juni 2023, 17:25:13 CEST schrieb Limonciello, Mario:
On 6/6/2023 3:08 AM, Benjamin Tissoires wrote:
On Jun 06 2023, Linux regression tracking (Thorsten Leemhuis) wrote:
On Mon, Jun 05, 2023 at 01:24:25PM +0200, Malte Starostik wrote:
Hello,
chiming in here as I'm experiencing what looks like the exact same issue, also on a Lenovo Z13 notebook, also on Arch: Oops during startup in task udev-worker followed by udev-worker blocking all attempts to suspend or cleanly shutdown/reboot the machine
I have a suspicion on commit 7bcfdab3f0c6 ("HID: amd_sfh: if no sensors are enabled, clean up") because the stack trace says that there is a bad list_add, which could happen if the object is not correctly initialized.
However, that commit was present in v6.2, so it might not be that one.
If I'm not mistaken the Z13 doesn't actually have any sensors connected to SFH. So I think the suspicion on 7bcfdab3f0c6 and theory this is triggered by HID init makes a lot of sense.
Can you try this patch?
diff --git a/drivers/hid/amd-sfh-hid/amd_sfh_client.c b/drivers/hid/amd-sfh-hid/amd_sfh_client.c index d9b7b01900b5..fa693a5224c6 100644 --- a/drivers/hid/amd-sfh-hid/amd_sfh_client.c +++ b/drivers/hid/amd-sfh-hid/amd_sfh_client.c @@ -324,6 +324,7 @@ int amd_sfh_hid_client_init(struct amd_mp2_dev *privdata) devm_kfree(dev, cl_data->report_descr[i]); } dev_warn(dev, "Failed to discover, sensors not enabled is %d\n", cl_data->is_any_sensor_enabled);
cl_data->num_hid_devices = 0; return -EOPNOTSUPP; } schedule_delayed_work(&cl_data->work_buffer,
msecs_to_jiffies(AMD_SFH_IDLE_LOOP));
I applied this to 9e87b63ed37e202c77aa17d4112da6ae0c7c097c now, which was the origin when I started the whole bisection. Clean rebuild, issue still persists.
Out of 50 boots, I got:
25 clean 22 Oops as posted by the OP 1 same Oops, followed by a panic 1 lockup [1] 1 hanging with just a blank screen
Not sure whether the lockups are related, but [1] mentions modprobe and udev- worker as well and all problems including the blank screen one appear roughly at the same time during boot. As this is before a graphics mode switch, I suspect the last mentioned case may be like [1] while the screen was blanked. To support the timing correlation: the UVC error for the IR cam shown in the photo (normal boot noise) also appears right before the BUG in the non-lockup bad case.
I do see the dev_warn in dmesg, so the code path modified in your patch is indeed hit: [ 10.897521] pcie_mp2_amd 0000:63:00.7: Failed to discover, sensors not enabled is 1 [ 10.897533] pcie_mp2_amd: probe of 0000:63:00.7 failed with error -95
BR Malte
Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting for once, to make this easily accessible to everyone.
What happens to this? From here it looks like there was no progress to resolve the regression in the past two weeks, but maybe I just missed something.
On 07.06.23 00:57, Malte Starostik wrote:
Am Dienstag, 6. Juni 2023, 17:25:13 CEST schrieb Limonciello, Mario:
On 6/6/2023 3:08 AM, Benjamin Tissoires wrote:
On Jun 06 2023, Linux regression tracking (Thorsten Leemhuis) wrote:
On Mon, Jun 05, 2023 at 01:24:25PM +0200, Malte Starostik wrote:
Hello,
chiming in here as I'm experiencing what looks like the exact same issue, also on a Lenovo Z13 notebook, also on Arch: Oops during startup in task udev-worker followed by udev-worker blocking all attempts to suspend or cleanly shutdown/reboot the machine
I have a suspicion on commit 7bcfdab3f0c6 ("HID: amd_sfh: if no sensors are enabled, clean up") because the stack trace says that there is a bad list_add, which could happen if the object is not correctly initialized.
However, that commit was present in v6.2, so it might not be that one.
If I'm not mistaken the Z13 doesn't actually have any sensors connected to SFH. So I think the suspicion on 7bcfdab3f0c6 and theory this is triggered by HID init makes a lot of sense.
Can you try this patch?
diff --git a/drivers/hid/amd-sfh-hid/amd_sfh_client.c b/drivers/hid/amd-sfh-hid/amd_sfh_client.c index d9b7b01900b5..fa693a5224c6 100644 --- a/drivers/hid/amd-sfh-hid/amd_sfh_client.c +++ b/drivers/hid/amd-sfh-hid/amd_sfh_client.c @@ -324,6 +324,7 @@ int amd_sfh_hid_client_init(struct amd_mp2_dev *privdata) devm_kfree(dev, cl_data->report_descr[i]); } dev_warn(dev, "Failed to discover, sensors not enabled is %d\n", cl_data->is_any_sensor_enabled);
cl_data->num_hid_devices = 0; return -EOPNOTSUPP; } schedule_delayed_work(&cl_data->work_buffer,
msecs_to_jiffies(AMD_SFH_IDLE_LOOP));
I applied this to 9e87b63ed37e202c77aa17d4112da6ae0c7c097c now, which was the origin when I started the whole bisection. Clean rebuild, issue still persists.
Out of 50 boots, I got:
25 clean 22 Oops as posted by the OP 1 same Oops, followed by a panic 1 lockup [1] 1 hanging with just a blank screen
Not sure whether the lockups are related, but [1] mentions modprobe and udev- worker as well and all problems including the blank screen one appear roughly at the same time during boot. As this is before a graphics mode switch, I suspect the last mentioned case may be like [1] while the screen was blanked. To support the timing correlation: the UVC error for the IR cam shown in the photo (normal boot noise) also appears right before the BUG in the non-lockup bad case.
I do see the dev_warn in dmesg, so the code path modified in your patch is indeed hit: [ 10.897521] pcie_mp2_amd 0000:63:00.7: Failed to discover, sensors not enabled is 1 [ 10.897533] pcie_mp2_amd: probe of 0000:63:00.7 failed with error -95
BR Malte
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.
#regzbot poke
I have a suspicion on commit 7bcfdab3f0c6 ("HID: amd_sfh: if no sensors are enabled, clean up") because the stack trace says that there is a bad list_add, which could happen if the object is not correctly initialized.
However, that commit was present in v6.2, so it might not be that one.
If I'm not mistaken the Z13 doesn't actually have any sensors connected to SFH. So I think the suspicion on 7bcfdab3f0c6 and theory this is triggered by HID init makes a lot of sense.
Can you try this patch?
diff --git a/drivers/hid/amd-sfh-hid/amd_sfh_client.c b/drivers/hid/amd-sfh-hid/amd_sfh_client.c index d9b7b01900b5..fa693a5224c6 100644 --- a/drivers/hid/amd-sfh-hid/amd_sfh_client.c +++ b/drivers/hid/amd-sfh-hid/amd_sfh_client.c @@ -324,6 +324,7 @@ int amd_sfh_hid_client_init(struct amd_mp2_dev *privdata) devm_kfree(dev, cl_data->report_descr[i]); } dev_warn(dev, "Failed to discover, sensors not enabled is %d\n", cl_data->is_any_sensor_enabled);
cl_data->num_hid_devices = 0; return -EOPNOTSUPP; } schedule_delayed_work(&cl_data->work_buffer,
msecs_to_jiffies(AMD_SFH_IDLE_LOOP));
I applied this to 9e87b63ed37e202c77aa17d4112da6ae0c7c097c now, which was the origin when I started the whole bisection. Clean rebuild, issue still persists.
Out of 50 boots, I got:
25 clean 22 Oops as posted by the OP 1 same Oops, followed by a panic 1 lockup [1] 1 hanging with just a blank screen
Not sure whether the lockups are related, but [1] mentions modprobe and udev- worker as well and all problems including the blank screen one appear roughly at the same time during boot. As this is before a graphics mode switch, I suspect the last mentioned case may be like [1] while the screen was blanked. To support the timing correlation: the UVC error for the IR cam shown in the photo (normal boot noise) also appears right before the BUG in the non-lockup bad case.
I do see the dev_warn in dmesg, so the code path modified in your patch is indeed hit: [ 10.897521] pcie_mp2_amd 0000:63:00.7: Failed to discover, sensors not enabled is 1 [ 10.897533] pcie_mp2_amd: probe of 0000:63:00.7 failed with error -95
BR Malte
Apologies; for some reason I never got that above reply in my inbox, some server along the way might have deemed it spam.
Anyways; I just double checked the Z13 I have on my hand. I don't have the PCI device for SFH (1022:164a) present on the system.
Can you please double check you are on the latest BIOS?
I'm on the latest release from LVFS, 0.1.57 according to fwupdmgr.
On 6/20/2023 1:50 PM, Limonciello, Mario wrote:
I have a suspicion on commit 7bcfdab3f0c6 ("HID: amd_sfh: if no sensors are enabled, clean up") because the stack trace says that there is a bad list_add, which could happen if the object is not correctly initialized.
However, that commit was present in v6.2, so it might not be that one.
If I'm not mistaken the Z13 doesn't actually have any sensors connected to SFH. So I think the suspicion on 7bcfdab3f0c6 and theory this is triggered by HID init makes a lot of sense.
Can you try this patch?
diff --git a/drivers/hid/amd-sfh-hid/amd_sfh_client.c b/drivers/hid/amd-sfh-hid/amd_sfh_client.c index d9b7b01900b5..fa693a5224c6 100644 --- a/drivers/hid/amd-sfh-hid/amd_sfh_client.c +++ b/drivers/hid/amd-sfh-hid/amd_sfh_client.c @@ -324,6 +324,7 @@ int amd_sfh_hid_client_init(struct amd_mp2_dev *privdata) devm_kfree(dev, cl_data->report_descr[i]); } dev_warn(dev, "Failed to discover, sensors not enabled is %d\n", cl_data->is_any_sensor_enabled); + cl_data->num_hid_devices = 0; return -EOPNOTSUPP; } schedule_delayed_work(&cl_data->work_buffer, msecs_to_jiffies(AMD_SFH_IDLE_LOOP));
I applied this to 9e87b63ed37e202c77aa17d4112da6ae0c7c097c now, which was the origin when I started the whole bisection. Clean rebuild, issue still persists.
Out of 50 boots, I got:
25 clean 22 Oops as posted by the OP 1 same Oops, followed by a panic 1 lockup [1] 1 hanging with just a blank screen
Not sure whether the lockups are related, but [1] mentions modprobe and udev- worker as well and all problems including the blank screen one appear roughly at the same time during boot. As this is before a graphics mode switch, I suspect the last mentioned case may be like [1] while the screen was blanked. To support the timing correlation: the UVC error for the IR cam shown in the photo (normal boot noise) also appears right before the BUG in the non-lockup bad case.
I do see the dev_warn in dmesg, so the code path modified in your patch is indeed hit: [ 10.897521] pcie_mp2_amd 0000:63:00.7: Failed to discover, sensors not enabled is 1 [ 10.897533] pcie_mp2_amd: probe of 0000:63:00.7 failed with error -95
BR Malte
Apologies; for some reason I never got that above reply in my inbox, some server along the way might have deemed it spam.
Anyways; I just double checked the Z13 I have on my hand. I don't have the PCI device for SFH (1022:164a) present on the system.
Can you please double check you are on the latest BIOS?
I'm on the latest release from LVFS, 0.1.57 according to fwupdmgr.
Hopefully the newer BIOS fixes it for you, but if it doesn't I did come up with another patch I've sent out that I guess could be another solution.
https://lore.kernel.org/linux-input/20230620200117.22261-1-mario.limonciello...
Am Dienstag, 20. Juni 2023, 22:03:00 CEST schrieb Limonciello, Mario:
On 6/20/2023 1:50 PM, Limonciello, Mario wrote:
Anyways; I just double checked the Z13 I have on my hand. I don't have the PCI device for SFH (1022:164a) present on the system.
Can you please double check you are on the latest BIOS?
I'm on the latest release from LVFS, 0.1.57 according to fwupdmgr.
I was on 0.1.27 while running the tests. At least when I saw the errors first, there was no update offered. Haven't re- checked until now.
Hopefully the newer BIOS fixes it for you, but if it doesn't I did come up with another patch I've sent out that I guess could be another solution.
After updating to 0.1.57, it looks like I cannot reproduce the error anymore either.
https://lore.kernel.org/linux-input/20230620200117.22261-1-mario.limonciello @amd.com/T/#u
I tested your patch before performing the firmware update. Still got the Oops just like before.
BR Malte
On 6/20/23 21:20, Linux regression tracking (Thorsten Leemhuis) wrote:
Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting for once, to make this easily accessible to everyone.
What happens to this? From here it looks like there was no progress to resolve the regression in the past two weeks, but maybe I just missed something.
Hi,
I just looked at the journal again and this problem seemed to go away after upgrading from 6.3.3 to 6.3.5. At that time the BIOS version was still 1.27. Now, on 1.57, the device 1022:164a is indeed no longer present anymore.
On 20.06.23 15:20, Linux regression tracking (Thorsten Leemhuis) wrote:
On 07.06.23 00:57, Malte Starostik wrote:
Am Dienstag, 6. Juni 2023, 17:25:13 CEST schrieb Limonciello, Mario:
On 6/6/2023 3:08 AM, Benjamin Tissoires wrote:
On Jun 06 2023, Linux regression tracking (Thorsten Leemhuis) wrote:
On Mon, Jun 05, 2023 at 01:24:25PM +0200, Malte Starostik wrote: > > chiming in here as I'm experiencing what looks like the exact same > issue, also on a Lenovo Z13 notebook, also on Arch: > Oops during startup in task udev-worker followed by udev-worker > blocking all attempts to suspend or cleanly shutdown/reboot the > machine
For the record:
#regzbot resolve: fixed in newer firmware and mainline post-6.4; backport possible when needed, but not planned #regzbot ignore-activity
For details see Mario's explanation here (thx for it, btw): https://lore.kernel.org/all/89ea9fb7-9026-ccb6-ad88-50e1c28b4474@amd.com/
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you.
Am Dienstag, 6. Juni 2023, 08:56:16 CEST schrieb Linux regression tracking (Thorsten Leemhuis):
On 06.06.23 04:36, Bagas Sanjaya wrote:
On Mon, Jun 05, 2023 at 01:24:25PM +0200, Malte Starostik wrote:
chiming in here as I'm experiencing what looks like the exact same issue, also on a Lenovo Z13 notebook, also on Arch:
bisect result: 904e28c6de083fa4834cdbd0026470ddc30676fc is the first bad commit commit 904e28c6de083fa4834cdbd0026470ddc30676fc Merge: a738688177dc 2f7f4efb9411 Author: Benjamin Tissoires benjamin.tissoires@redhat.com Date: Wed Feb 22 10:44:31 2023 +0100
Merge branch 'for-6.3/hid-bpf' into for-linus
Hmm, seems like bad bisect (bisected to HID-BPF which IMO isn't related to amd_sfh). Can you repeat the bisection?
I'm digging further. That merge is what git bisect ended at, but admittedly my git skills and especially with a large codebase aren't too advanced. While at 904e28c6de083fa4834cdbd0026470ddc30676fc, git show only shows the diff for tools/testing/selftests/Makefile which can't really be the culprit. However, git diff @~..@ has changes in drivers/hid/amd-sfh-hid/Kconfig (seems innocuous, too), but also some changes to drivers/hid/hid-core.c. Nothing obvious either, but at least it's not too far from the trace.
Well, amd_sfh afaics apparently interacts with HID (see trace earlier in the thread), so it's not that far away. But it's a merge commit, which is possible, but doesn't happen every day. So a recheck might really be a good idea.
I will recheck some more, the Oops only happens with roughly 30 % chance during boot. When it doesn't, there seem to be no other issues until the next boot either. I made sure to reboot a few times after each bisect step, will look deeper into the area.
Anyway, tl;dr:
A: http://en.wikipedia.org/wiki/Top_post Q: Were do I find info about this thing called top-posting?
[...]
BTW, I'm not sure if this really is helpful. Teaching this to upcoming kernel developers is definitely worth it, but I wonder if pushing this on all reporters might do more harm than good. I also wonder if asking them a bit more kindly might be wiser (e.g. instead of "Anyway, tl;dr:" something like "BTW, please do not top-post:" or something like that maybe).
Thanks, and I agree in general. However, my case was in fact even worse :-) I'm totally aware of the badness of top-posting. It happened because I had a draft of the reply. Set In-Reply-To from the link in the wev archive and pasted the previous message from there. Couple days later, I just pasted the result on top and disregarded the existing text.
BR Malte
On Wed, May 24, 2023 at 05:10:45PM +0700, Bagas Sanjaya wrote:
On Wed, May 24, 2023 at 02:10:31PM +0800, Haochen Tong wrote:
What last kernel version before this regression occurs? Do you mean v6.2?
I was using 6.2.12 (Arch Linux distro kernel) before seeing this regression.
Can you perform bisection to find the culprit that introduces the regression? Since you're on Arch Linux, see its wiki article [1] for instructions.
Haochen, any news on this? Has the bisection been done and any result? Another reporter had concluded possibly bad bisect [1].
Thanks.
[1]: https://lore.kernel.org/regressions/3250319.ancTxkQ2z5@zen/
On 6/6/23 10:39, Bagas Sanjaya wrote:
On Wed, May 24, 2023 at 05:10:45PM +0700, Bagas Sanjaya wrote:
On Wed, May 24, 2023 at 02:10:31PM +0800, Haochen Tong wrote:
What last kernel version before this regression occurs? Do you mean v6.2?
I was using 6.2.12 (Arch Linux distro kernel) before seeing this regression.
Can you perform bisection to find the culprit that introduces the regression? Since you're on Arch Linux, see its wiki article [1] for instructions.
Haochen, any news on this? Has the bisection been done and any result? Another reporter had concluded possibly bad bisect [1].
Thanks.
Hi,
Sorry for the late reply. I haven't gotten enough time for it yet.
I took a look at the git logs, and it doesn't look like the modules involved in the original stack trace (amd_sfh, hid_sensor_hub, hid_sensor_iio_common, hid_sensor_gyro_3d) has received any significant changes between v6.2 and v6.3. IMHO, the bisect done by Malte might indicate that the issue could be a problem outside of these modules.
Also, I've upgrade from 6.3.3 to 6.3.5 a week ago and this issue hasn't happened so far in 4 reboots. However, there still doesn't seem to be any changes regarding these modules, so I'm not sure if it's fixed elsewhere or I'm just being lucky. It would be nice if someone can confirm or disprove this.
Thanks,
On Wed, May 24, 2023 at 01:27:57AM +0800, Haochen Tong wrote:
Hi,
Since kernel 6.3.0 (and also 6.4rc3), on a ThinkPad Z13 system with Arch Linux, I've noticed that the amd_sfh driver spews a lot of stack traces during boot. Sometimes it is an oops:
BUG: unable to handle page fault for address: 000000000001000f #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 8 PID: 457 Comm: (udev-worker) Not tainted 6.3.3-arch1-1 #1 fa7b7e0107004b3021a57a74b951e0a25e7e8584 Hardware name: LENOVO 21D2CTO1WW/21D2CTO1WW, BIOS N3GET47W (1.27 ) 12/08/2022 RIP: 0010:amd_sfh_get_report+0x1e/0x110 [amd_sfh] Code: 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 8b 87 60 1d 00 00 48 8b 68 08 <8b> 45 10 85 c0 0f 84 a9 00 00 00 49 89 fc 41 89 f7 41 89 d6 31 db RSP: 0018:ffffb164426f3a20 EFLAGS: 00010246 RAX: ffff9b0ae6b7bd00 RBX: ffff9b0ac0f46000 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff9b0ac0f46000 RBP: 000000000000ffff R08: ffffb164426f3ab8 R09: ffffb164426f3ab8 R10: 000000000020031b R11: ffff9b0ace40ac00 R12: ffff9b0ace40ac00 R13: 0000000000000002 R14: 0000000000000002 R15: ffff9b0acd213010 FS: 00007fe9ceb82200(0000) GS:ffff9b1122000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000001000f CR3: 000000010940c000 CR4: 0000000000750ee0 PKRU: 55555554 Call Trace:
<TASK> amdtp_hid_request+0x36/0x50 [amd_sfh 2e3095779aada9fdb1764f08ca578ccb14e41fe4] sensor_hub_get_feature+0xad/0x170 [hid_sensor_hub d6157999c9d260a1bfa6f27d4a0dc2c3e2c5654e] hid_sensor_parse_common_attributes+0x217/0x310 [hid_sensor_iio_common 07a7935272aa9c7a28193b574580b3e953a64ec4] hid_gyro_3d_probe+0x7f/0x2e0 [hid_sensor_gyro_3d 9f2eb51294a1f0c0315b365f335617cbaef01eab] platform_probe+0x44/0xa0 really_probe+0x19e/0x3e0 ? __pfx___driver_attach+0x10/0x10 __driver_probe_device+0x78/0x160 driver_probe_device+0x1f/0x90 __driver_attach+0xd2/0x1c0 bus_for_each_dev+0x88/0xd0 bus_add_driver+0x116/0x220 driver_register+0x59/0x100 ? __pfx_init_module+0x10/0x10 [hid_sensor_gyro_3d 9f2eb51294a1f0c0315b365f335617cbaef01eab] do_one_initcall+0x5d/0x240 do_init_module+0x4a/0x200 __do_sys_init_module+0x17f/0x1b0 do_syscall_64+0x60/0x90 ? ksys_read+0x6f/0xf0 ? syscall_exit_to_user_mode+0x1b/0x40 ? do_syscall_64+0x6c/0x90 ? exc_page_fault+0x7c/0x180 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fe9ce721f9e Code: 48 8b 0d bd ed 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8a ed 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd280dd828 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 000055b72a37f630 RCX: 00007fe9ce721f9e RDX: 00007fe9cec7a343 RSI: 00000000000077f8 RDI: 000055b72a56c7f0 RBP: 00007fe9cec7a343 R08: 00000000000077f8 R09: 0000000000000000 R10: 000000000001a0a1 R11: 0000000000000246 R12: 0000000000020000 R13: 000055b72a363b90 R14: 000055b72a37f630 R15: 000055b72a36a070 </TASK> Modules linked in: hid_sensor_accel_3d(+) hid_sensor_gyro_3d(+) qrtr hid_sensor_trigger snd_sof industrialio_triggered_buffer ath11k_pci(+) kfifo_buf snd_sof_utils hid_sensor_iio_common joydev ath11k industrialio snd_soc_core mousedev qmi_helpers snd_compress hid_sensor_hub snd_hda_scodec_cs35l41_spi ac97_bus snd_hda_codec_realtek(+) snd_pcm_dmaengine intel_rapl_msr snd_hda_codec_hdmi snd_hda_codec_generic intel_rapl_common mac80211 snd_pci_ps btusb snd_rpl_pci_acp6x btrtl snd_hda_intel edac_mce_amd uvcvideo btbcm snd_acp_pci snd_intel_dspcfg snd_pci_acp6x videobuf2_vmalloc snd_intel_sdw_acpi libarc4 uvc btintel snd_usb_audio(+) snd_pci_acp5x videobuf2_memops btmtk snd_hda_codec kvm_amd videobuf2_v4l2 snd_hda_scodec_cs35l41_i2c snd_usbmidi_lib snd_hda_scodec_cs35l41 snd_rn_pci_acp3x ucsi_acpi bluetooth videodev snd_hda_core typec_ucsi snd_acp_config snd_hda_cs_dsp_ctls wacom(+) hid_multitouch cfg80211 snd_rawmidi sp5100_tco kvm snd_seq_device cs_dsp videobuf2_common typec ecdh_generic snd_soc_acpi think_lmi snd_hwdep snd_pcm irqbypass crc16 snd_soc_cs35l41_lib mhi thunderbolt firmware_attributes_class snd_pci_acp3x amd_sfh(+) k10temp psmouse roles rapl i2c_piix4 mc snd_timer wmi_bmof serial_multi_instantiate i2c_hid_acpi acpi_tad i2c_hid amd_pmf amd_pmc mac_hid sch_fq tcp_bbr dm_multipath i2c_dev crypto_user fuse loop zram ip_tables x_tables xfs libcrc32c crc32c_generic dm_crypt cbc encrypted_keys trusted asn1_encoder tee usbhid dm_mod amdgpu i2c_algo_bit serio_raw thinkpad_acpi drm_ttm_helper atkbd libps2 crct10dif_pclmul vivaldi_fmap crc32_pclmul ledtrig_audio crc32c_intel polyval_clmulni ttm polyval_generic drm_buddy nvme gf128mul platform_profile gpu_sched ghash_clmulni_intel sha512_ssse3 snd aesni_intel soundcore drm_display_helper crypto_simd rfkill nvme_core xhci_pci cryptd cec ccp xhci_pci_renesas i8042 video nvme_common serio wmi CR2: 000000000001000f ---[ end trace 0000000000000000 ]--- RIP: 0010:amd_sfh_get_report+0x1e/0x110 [amd_sfh] Code: 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 8b 87 60 1d 00 00 48 8b 68 08 <8b> 45 10 85 c0 0f 84 a9 00 00 00 49 89 fc 41 89 f7 41 89 d6 31 db RSP: 0018:ffffb164426f3a20 EFLAGS: 00010246 RAX: ffff9b0ae6b7bd00 RBX: ffff9b0ac0f46000 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff9b0ac0f46000 RBP: 000000000000ffff R08: ffffb164426f3ab8 R09: ffffb164426f3ab8 R10: 000000000020031b R11: ffff9b0ace40ac00 R12: ffff9b0ace40ac00 R13: 0000000000000002 R14: 0000000000000002 R15: ffff9b0acd213010 FS: 00007fe9ceb82200(0000) GS:ffff9b1122000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000001000f CR3: 000000010940c000 CR4: 0000000000750ee0 PKRU: 55555554
Sometimes it is a list corruption in the same function with a similar stack:
------------[ cut here ]------------ list_add corruption. next is NULL. WARNING: CPU: 5 PID: 433 at lib/list_debug.c:25 __list_add_valid+0x57/0xa0 ... CPU: 5 PID: 433 Comm: (udev-worker) Not tainted 6.4.0-rc3-1-mainline #1 b60166e85cb97a6631db26f9dcda0196ed7a0c93 Hardware name: LENOVO 21D2CTO1WW/21D2CTO1WW, BIOS N3GET47W (1.27 ) 12/08/2022 RIP: 0010:__list_add_valid+0x57/0xa0 Code: 01 00 00 00 c3 cc cc cc cc 48 c7 c7 58 91 e6 9a e8 1e b9 a8 ff 0f 0b 31 c0 c3 cc cc cc cc 48 c7 c7 80 91 e6 9a e8 09 b9 a8 ff <0f> 0b eb e9 48 89 c1 48 c7 c7 a8 91 e6 9a e8 f6 b8 a8 ff 0f 0b eb RSP: 0018:ffffad9dc0c7bb10 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff92d5a8099448 RCX: 0000000000000027 RDX: ffff92dbe1f61688 RSI: 0000000000000001 RDI: ffff92dbe1f61680 RBP: ffff92d59ea93508 R08: 0000000000000000 R09: ffffad9dc0c7b9a0 R10: 0000000000000003 R11: ffffffff9b6ca808 R12: 0000000000000000 R13: ffff92d5a8099440 R14: ffff92d59ea93760 R15: 0000000000000002 FS: 00007fbaf0262200(0000) GS:ffff92dbe1f40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005651de666000 CR3: 000000011cfee000 CR4: 0000000000750ee0 PKRU: 55555554 Call Trace:
<TASK> amd_sfh_get_report+0xba/0x110 [amd_sfh 78bf82e66cdb2ccf24cbe871a0835ef4eedddb17] amdtp_hid_request+0x36/0x50 [amd_sfh 78bf82e66cdb2ccf24cbe871a0835ef4eedddb17] sensor_hub_get_feature+0xad/0x170 [hid_sensor_hub 30e53e2c49ea1702e2482c0b3860e22265679e39] hid_sensor_parse_common_attributes+0x217/0x310 [hid_sensor_iio_common ed7fba7a4d4147d48156e6a4b2a034ad3fc94350] hid_gyro_3d_probe+0x7f/0x2e0 [hid_sensor_gyro_3d 10978a2cdfc8979f2a7366fcd005e0ea826088eb] platform_probe+0x44/0xa0 really_probe+0x19e/0x3e0 ? __pfx___driver_attach+0x10/0x10 __driver_probe_device+0x78/0x160 driver_probe_device+0x1f/0x90 __driver_attach+0xd2/0x1c0 bus_for_each_dev+0x88/0xd0 bus_add_driver+0x116/0x220 driver_register+0x59/0x100 ? __pfx_hid_gyro_3d_platform_driver_init+0x10/0x10 [hid_sensor_gyro_3d 10978a2cdfc8979f2a7366fcd005e0ea826088eb] do_one_initcall+0x5d/0x240 do_init_module+0x60/0x240 __do_sys_init_module+0x17f/0x1b0 do_syscall_64+0x60/0x90 ? exc_page_fault+0x7f/0x180 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fbaf06c0f9e Code: 48 8b 0d bd ed 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8a ed 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc5ce88528 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 00005651de36dff0 RCX: 00007fbaf06c0f9e RDX: 00007fbaf0ba9343 RSI: 00000000000079f0 RDI: 00005651de646fe0 RBP: 00007fbaf0ba9343 R08: 00000000000079f0 R09: 0000000000000000 R10: 0000000000019fb1 R11: 0000000000000246 R12: 0000000000020000 R13: 00005651de45fb10 R14: 00005651de36dff0 R15: 00005651de44d5f0 </TASK> ---[ end trace 0000000000000000 ]---
This occurs during almost every boot. When it happens there is usually a (udev-worker) process lingering forever, which is unkillable and even prevents shutdown.
Looking at past journals it never happened before 6.3 so I believe it is a regression.
Relevant device: 63:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4] Subsystem: Lenovo Sensor Fusion Hub [17aa:22f1] Kernel driver in use: pcie_mp2_amd Kernel modules: amd_sfh
Thanks for the bug report. I'm adding it to regzbot:
#regzbot ^introduced: v6.2..v6.3 #regzbot title: amd_sfh driver causes kernel oops (udev-worker becomes zombie) on ThinkPad Z13
[TLDR: This mail in primarily relevant for Linux kernel regression tracking. See link in footer if these mails annoy you.]
On 23.05.23 19:27, Haochen Tong wrote:
Since kernel 6.3.0 (and also 6.4rc3), on a ThinkPad Z13 system with Arch Linux, I've noticed that the amd_sfh driver spews a lot of stack traces during boot. Sometimes it is an oops:
For the record:
#regzbot resolve: fixed in newer firmware and mainline post-6.4; backport not planned, as bug unlikely to repeat, but possible when needed #regzbot ignore-activity
For details see Mario's explanation here (thx for it, btw): https://lore.kernel.org/all/89ea9fb7-9026-ccb6-ad88-50e1c28b4474@amd.com/
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you.
linux-stable-mirror@lists.linaro.org