Summary: When attempting to rise or shut down a NIC manually or via network-manager under 5.15, the machine reboots or freezes.
Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline ( https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as liquorix flavours. Does not occur with: 5.14 and 5.13 (both with various flavours)
Hi all,
I'm experiencing a severe bug that causes the machine to reboot or freeze when trying to login and/or rise/shutdown a NIC. Here's a brief description of scenarios I've tested:
Scenario 1: enp6s0 managed manually using /etc/networking/interfaces, DHCP a. Issuing ifdown enp6s0 in terminal will throw "/etc/resolvconf/update.d/libc: Warning: /etc/resolv.conf is not a symbolic link to /run/resolvconf/resolv.conf" and cause the machine to reboot after ~10s of showing a blinking cursor
b. Issuing shutdown -h now or trying to shutdown/reboot machine via GUI: shutdown will stop on "stop job is running for ifdown enp6s0" and after approx. 10..15s the countdown freezes. Repeated ALT-SysReq-REISUB does not reboot the machine, a hard reset is required.
--
Scenario 2: enp6s0 managed manually using /etc/networking/interfaces, STATIC a. Issuing ifdown enp6s0 in terminal will throw "send_packet: Operation not permitted dhclient.c:3010: Failed to send 300 byte long packet over fallback interface." and cause the machine to reboot after ~10s of blinking cursor.
b. Issuing shutdown -h now or trying to shutdown or reboot machine via GUI: shutdown will stop on "stop job is running for ifdown enp6s0" and after approx. 10..15s the countdown freezes. Repeated ALT-SysReq-REISUB does not reboot the machine, a hard reset is required.
--
Scenario 3: enp6s0 managed by network manager a. After booting and logging in either via GUI or TTY, the display will stay blank and only show a blinking cursor and then freeze after 5..10s. ALT-SysReq-REISUB does not reboot the machine, a hard reset is required.
--
Here's a snippet from the journal for Scenario 1a:
Nov 21 10:39:25 computer sudo[5606]: user : TTY=pts/0 ; PWD=/home/user ; USER=root ; COMMAND=/usr/sbin/ifdown enp6s0 Nov 21 10:39:25 computer sudo[5606]: pam_unix(sudo:session): session opened for user root by (uid=0) -- Reboot -- Nov 21 10:40:14 computer systemd-journald[478]: Journal started
--
I'm running Alder Lake i9 12900K but I have E-cores disabled in BIOS. Here are some more specs with working kernel:
$ inxi -bxz System: Kernel: 5.14.0-19.2-liquorix-amd64 x86_64 bits: 64 compiler: N/A Desktop: Xfce 4.16.3 Distro: Ubuntu 20.04.3 LTS (Focal Fossa) Machine: Type: Desktop System: ASUS product: N/A v: N/A serial: N/A Mobo: ASUSTeK model: ROG STRIX Z690-A GAMING WIFI D4 v: Rev 1.xx serial: <filter> UEFI [Legacy]: American Megatrends v: 0707 date: 11/10/2021 CPU: 8-Core: 12th Gen Intel Core i9-12900K type: MT MCP arch: N/A speed: 5381 MHz max: 3201 MHz Graphics: Device-1: NVIDIA vendor: Gigabyte driver: nvidia v: 470.86 bus ID: 01:00.0 Display: server: X.Org 1.20.11 driver: nvidia tty: N/A Message: Unable to show advanced data. Required tool glxinfo missing. Network: Device-1: Intel vendor: ASUSTeK driver: igc v: kernel port: 4000 bus ID: 06:00.0
Please advice how I may assist in debugging!
Thanks.
On Wed, Nov 24, 2021 at 08:28:39AM +0100, Stefan Dietrich wrote:
Summary: When attempting to rise or shut down a NIC manually or via network-manager under 5.15, the machine reboots or freezes.
Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline ( https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as liquorix flavours. Does not occur with: 5.14 and 5.13 (both with various flavours)
Can you use 'git bisect' between 5.14 and 5.15 to find the problem commit?
thanks,
greg k-h
Hi Greg,
I have never done kernel bisect before so I need to do some reading first. I will report back a.s.a.p.
Stefan
On Wed, 2021-11-24 at 08:33 +0100, Greg KH wrote:
On Wed, Nov 24, 2021 at 08:28:39AM +0100, Stefan Dietrich wrote:
Summary: When attempting to rise or shut down a NIC manually or via network-manager under 5.15, the machine reboots or freezes.
Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline ( https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as liquorix flavours. Does not occur with: 5.14 and 5.13 (both with various flavours)
Can you use 'git bisect' between 5.14 and 5.15 to find the problem commit?
thanks,
greg k-h
Hi all,
six exciting hours and a lot of learning later, here it is. Symptomatically, the critical commit appears for me between 5.14.21- 051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64 build for rc1.
Please see the git-bisect output below and let me know how I may further assist in debugging!
Cheers, Stefan
a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit commit a90ec84837325df4b9a6798c2cc0df202b5680bd Author: Vinicius Costa Gomes vinicius.gomes@intel.com Date: Mon Jul 26 20:36:57 2021 -0700
igc: Add support for PTP getcrosststamp()
i225 supports PCIe Precision Time Measurement (PTM), allowing us to support the PTP_SYS_OFFSET_PRECISE ioctl() in the driver via the getcrosststamp() function.
The easiest way to expose the PTM registers would be to configure the PTM dialogs to run periodically, but the PTP_SYS_OFFSET_PRECISE ioctl() semantics are more aligned to using a kind of "one-shot" way of retrieving the PTM timestamps. But this causes a bit more code to be written: the trigger registers for the PTM dialogs are not cleared automatically.
i225 can be configured to send "fake" packets with the PTM information, adding support for handling these types of packets is left for the future.
PTM improves the accuracy of time synchronization, for example, using phc2sys, while a simple application is sending packets as fast as possible. First, without .getcrosststamp():
phc2sys[191.382]: enp4s0 sys offset -959 s2 freq -454 delay 4492 phc2sys[191.482]: enp4s0 sys offset 798 s2 freq +1015 delay 4069 phc2sys[191.583]: enp4s0 sys offset 962 s2 freq +1418 delay 3849 phc2sys[191.683]: enp4s0 sys offset 924 s2 freq +1669 delay 3753 phc2sys[191.783]: enp4s0 sys offset 664 s2 freq +1686 delay 3349 phc2sys[191.883]: enp4s0 sys offset 218 s2 freq +1439 delay 2585 phc2sys[191.983]: enp4s0 sys offset 761 s2 freq +2048 delay 3750 phc2sys[192.083]: enp4s0 sys offset 756 s2 freq +2271 delay 4061 phc2sys[192.183]: enp4s0 sys offset 809 s2 freq +2551 delay 4384 phc2sys[192.283]: enp4s0 sys offset -108 s2 freq +1877 delay 2480 phc2sys[192.383]: enp4s0 sys offset -1145 s2 freq +807 delay 4438 phc2sys[192.484]: enp4s0 sys offset 571 s2 freq +2180 delay 3849 phc2sys[192.584]: enp4s0 sys offset 241 s2 freq +2021 delay 3389 phc2sys[192.684]: enp4s0 sys offset 405 s2 freq +2257 delay 3829 phc2sys[192.784]: enp4s0 sys offset 17 s2 freq +1991 delay 3273 phc2sys[192.884]: enp4s0 sys offset 152 s2 freq +2131 delay 3948 phc2sys[192.984]: enp4s0 sys offset -187 s2 freq +1837 delay 3162 phc2sys[193.084]: enp4s0 sys offset -1595 s2 freq +373 delay 4557 phc2sys[193.184]: enp4s0 sys offset 107 s2 freq +1597 delay 3740 phc2sys[193.284]: enp4s0 sys offset 199 s2 freq +1721 delay 4010 phc2sys[193.385]: enp4s0 sys offset -169 s2 freq +1413 delay 3701 phc2sys[193.485]: enp4s0 sys offset -47 s2 freq +1484 delay 3581 phc2sys[193.585]: enp4s0 sys offset -65 s2 freq +1452 delay 3778 phc2sys[193.685]: enp4s0 sys offset 95 s2 freq +1592 delay 3888 phc2sys[193.785]: enp4s0 sys offset 206 s2 freq +1732 delay 4445 phc2sys[193.885]: enp4s0 sys offset -652 s2 freq +936 delay 2521 phc2sys[193.985]: enp4s0 sys offset -203 s2 freq +1189 delay 3391 phc2sys[194.085]: enp4s0 sys offset -376 s2 freq +955 delay 2951 phc2sys[194.185]: enp4s0 sys offset -134 s2 freq +1084 delay 3330 phc2sys[194.285]: enp4s0 sys offset -22 s2 freq +1156 delay 3479 phc2sys[194.386]: enp4s0 sys offset 32 s2 freq +1204 delay 3602 phc2sys[194.486]: enp4s0 sys offset 122 s2 freq +1303 delay 3731
Statistics for this run (total of 2179 lines), in nanoseconds: average: -1.12 stdev: 634.80 max: 1551 min: -2215
With .getcrosststamp() via PCIe PTM:
phc2sys[367.859]: enp4s0 sys offset 6 s2 freq +1727 delay 0 phc2sys[367.959]: enp4s0 sys offset -2 s2 freq +1721 delay 0 phc2sys[368.059]: enp4s0 sys offset 5 s2 freq +1727 delay 0 phc2sys[368.160]: enp4s0 sys offset -1 s2 freq +1723 delay 0 phc2sys[368.260]: enp4s0 sys offset -4 s2 freq +1719 delay 0 phc2sys[368.360]: enp4s0 sys offset -5 s2 freq +1717 delay 0 phc2sys[368.460]: enp4s0 sys offset 1 s2 freq +1722 delay 0 phc2sys[368.560]: enp4s0 sys offset -3 s2 freq +1718 delay 0 phc2sys[368.660]: enp4s0 sys offset 5 s2 freq +1725 delay 0 phc2sys[368.760]: enp4s0 sys offset -1 s2 freq +1721 delay 0 phc2sys[368.860]: enp4s0 sys offset 0 s2 freq +1721 delay 0 phc2sys[368.960]: enp4s0 sys offset 0 s2 freq +1721 delay 0 phc2sys[369.061]: enp4s0 sys offset 4 s2 freq +1725 delay 0 phc2sys[369.161]: enp4s0 sys offset 1 s2 freq +1724 delay 0 phc2sys[369.261]: enp4s0 sys offset 4 s2 freq +1727 delay 0 phc2sys[369.361]: enp4s0 sys offset 8 s2 freq +1732 delay 0 phc2sys[369.461]: enp4s0 sys offset 7 s2 freq +1733 delay 0 phc2sys[369.561]: enp4s0 sys offset 4 s2 freq +1733 delay 0 phc2sys[369.661]: enp4s0 sys offset 1 s2 freq +1731 delay 0 phc2sys[369.761]: enp4s0 sys offset 1 s2 freq +1731 delay 0 phc2sys[369.861]: enp4s0 sys offset -5 s2 freq +1725 delay 0 phc2sys[369.961]: enp4s0 sys offset -4 s2 freq +1725 delay 0 phc2sys[370.062]: enp4s0 sys offset 2 s2 freq +1730 delay 0 phc2sys[370.162]: enp4s0 sys offset -7 s2 freq +1721 delay 0 phc2sys[370.262]: enp4s0 sys offset -3 s2 freq +1723 delay 0 phc2sys[370.362]: enp4s0 sys offset 1 s2 freq +1726 delay 0 phc2sys[370.462]: enp4s0 sys offset -3 s2 freq +1723 delay 0 phc2sys[370.562]: enp4s0 sys offset -1 s2 freq +1724 delay 0 phc2sys[370.662]: enp4s0 sys offset -4 s2 freq +1720 delay 0 phc2sys[370.762]: enp4s0 sys offset -7 s2 freq +1716 delay 0 phc2sys[370.862]: enp4s0 sys offset -2 s2 freq +1719 delay 0
Statistics for this run (total of 2179 lines), in nanoseconds: average: 0.14 stdev: 5.03 max: 48 min: -27
For reference, the statistics for runs without PCIe congestion show that the improvements from enabling PTM are less dramatic. For two runs of 16466 entries: without PTM: avg -0.04 stdev 10.57 max 39 min -42 with PTM: avg 0.01 stdev 4.20 max 19 min -16
One possible explanation is that when PTM is not enabled, and there's a lot of traffic in the PCIe fabric, some register reads will take more time than the others because of congestion on the PCIe fabric.
When PTM is enabled, even if the PTM dialogs take more time to complete under heavy traffic, the time measurements do not depend on the time to read the registers.
This was implemented following the i225 EAS version 0.993.
Signed-off-by: Vinicius Costa Gomes vinicius.gomes@intel.com Tested-by: Dvora Fuxbrumer dvorax.fuxbrumer@linux.intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com
drivers/net/ethernet/intel/igc/igc.h | 1 + drivers/net/ethernet/intel/igc/igc_defines.h | 31 +++++ drivers/net/ethernet/intel/igc/igc_ptp.c | 179 +++++++++++++++++++++++++++ drivers/net/ethernet/intel/igc/igc_regs.h | 23 ++++ 4 files changed, 234 insertions(+)
On Wed, 2021-11-24 at 08:33 +0100, Greg KH wrote:
On Wed, Nov 24, 2021 at 08:28:39AM +0100, Stefan Dietrich wrote:
Summary: When attempting to rise or shut down a NIC manually or via network-manager under 5.15, the machine reboots or freezes.
Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline ( https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as liquorix flavours. Does not occur with: 5.14 and 5.13 (both with various flavours)
Can you use 'git bisect' between 5.14 and 5.15 to find the problem commit?
thanks,
greg k-h
On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
Hi all,
six exciting hours and a lot of learning later, here it is. Symptomatically, the critical commit appears for me between 5.14.21- 051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64 build for rc1.
Please see the git-bisect output below and let me know how I may further assist in debugging!
Well, let's CC those involved, shall we? :)
Thanks for working thru the bisection!
a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit commit a90ec84837325df4b9a6798c2cc0df202b5680bd Author: Vinicius Costa Gomes vinicius.gomes@intel.com Date: Mon Jul 26 20:36:57 2021 -0700
igc: Add support for PTP getcrosststamp() i225 supports PCIe Precision Time Measurement (PTM), allowing us to support the PTP_SYS_OFFSET_PRECISE ioctl() in the driver via the getcrosststamp() function. The easiest way to expose the PTM registers would be to configure
the PTM dialogs to run periodically, but the PTP_SYS_OFFSET_PRECISE ioctl() semantics are more aligned to using a kind of "one-shot" way of retrieving the PTM timestamps. But this causes a bit more code to be written: the trigger registers for the PTM dialogs are not cleared automatically.
i225 can be configured to send "fake" packets with the PTM information, adding support for handling these types of packets is left for the future. PTM improves the accuracy of time synchronization, for example,
using phc2sys, while a simple application is sending packets as fast as possible. First, without .getcrosststamp():
phc2sys[191.382]: enp4s0 sys offset -959 s2 freq -454
delay 4492 phc2sys[191.482]: enp4s0 sys offset 798 s2 freq +1015 delay 4069 phc2sys[191.583]: enp4s0 sys offset 962 s2 freq +1418 delay 3849 phc2sys[191.683]: enp4s0 sys offset 924 s2 freq +1669 delay 3753 phc2sys[191.783]: enp4s0 sys offset 664 s2 freq +1686 delay 3349 phc2sys[191.883]: enp4s0 sys offset 218 s2 freq +1439 delay 2585 phc2sys[191.983]: enp4s0 sys offset 761 s2 freq +2048 delay 3750 phc2sys[192.083]: enp4s0 sys offset 756 s2 freq +2271 delay 4061 phc2sys[192.183]: enp4s0 sys offset 809 s2 freq +2551 delay 4384 phc2sys[192.283]: enp4s0 sys offset -108 s2 freq +1877 delay 2480 phc2sys[192.383]: enp4s0 sys offset -1145 s2 freq +807 delay 4438 phc2sys[192.484]: enp4s0 sys offset 571 s2 freq +2180 delay 3849 phc2sys[192.584]: enp4s0 sys offset 241 s2 freq +2021 delay 3389 phc2sys[192.684]: enp4s0 sys offset 405 s2 freq +2257 delay 3829 phc2sys[192.784]: enp4s0 sys offset 17 s2 freq +1991 delay 3273 phc2sys[192.884]: enp4s0 sys offset 152 s2 freq +2131 delay 3948 phc2sys[192.984]: enp4s0 sys offset -187 s2 freq +1837 delay 3162 phc2sys[193.084]: enp4s0 sys offset -1595 s2 freq +373 delay 4557 phc2sys[193.184]: enp4s0 sys offset 107 s2 freq +1597 delay 3740 phc2sys[193.284]: enp4s0 sys offset 199 s2 freq +1721 delay 4010 phc2sys[193.385]: enp4s0 sys offset -169 s2 freq +1413 delay 3701 phc2sys[193.485]: enp4s0 sys offset -47 s2 freq +1484 delay 3581 phc2sys[193.585]: enp4s0 sys offset -65 s2 freq +1452 delay 3778 phc2sys[193.685]: enp4s0 sys offset 95 s2 freq +1592 delay 3888 phc2sys[193.785]: enp4s0 sys offset 206 s2 freq +1732 delay 4445 phc2sys[193.885]: enp4s0 sys offset -652 s2 freq +936 delay 2521 phc2sys[193.985]: enp4s0 sys offset -203 s2 freq +1189 delay 3391 phc2sys[194.085]: enp4s0 sys offset -376 s2 freq +955 delay 2951 phc2sys[194.185]: enp4s0 sys offset -134 s2 freq +1084 delay 3330 phc2sys[194.285]: enp4s0 sys offset -22 s2 freq +1156 delay 3479 phc2sys[194.386]: enp4s0 sys offset 32 s2 freq +1204 delay 3602 phc2sys[194.486]: enp4s0 sys offset 122 s2 freq +1303 delay 3731
Statistics for this run (total of 2179 lines), in nanoseconds: average: -1.12 stdev: 634.80 max: 1551 min: -2215 With .getcrosststamp() via PCIe PTM: phc2sys[367.859]: enp4s0 sys offset 6 s2 freq +1727
delay 0 phc2sys[367.959]: enp4s0 sys offset -2 s2 freq +1721 delay 0 phc2sys[368.059]: enp4s0 sys offset 5 s2 freq +1727 delay 0 phc2sys[368.160]: enp4s0 sys offset -1 s2 freq +1723 delay 0 phc2sys[368.260]: enp4s0 sys offset -4 s2 freq +1719 delay 0 phc2sys[368.360]: enp4s0 sys offset -5 s2 freq +1717 delay 0 phc2sys[368.460]: enp4s0 sys offset 1 s2 freq +1722 delay 0 phc2sys[368.560]: enp4s0 sys offset -3 s2 freq +1718 delay 0 phc2sys[368.660]: enp4s0 sys offset 5 s2 freq +1725 delay 0 phc2sys[368.760]: enp4s0 sys offset -1 s2 freq +1721 delay 0 phc2sys[368.860]: enp4s0 sys offset 0 s2 freq +1721 delay 0 phc2sys[368.960]: enp4s0 sys offset 0 s2 freq +1721 delay 0 phc2sys[369.061]: enp4s0 sys offset 4 s2 freq +1725 delay 0 phc2sys[369.161]: enp4s0 sys offset 1 s2 freq +1724 delay 0 phc2sys[369.261]: enp4s0 sys offset 4 s2 freq +1727 delay 0 phc2sys[369.361]: enp4s0 sys offset 8 s2 freq +1732 delay 0 phc2sys[369.461]: enp4s0 sys offset 7 s2 freq +1733 delay 0 phc2sys[369.561]: enp4s0 sys offset 4 s2 freq +1733 delay 0 phc2sys[369.661]: enp4s0 sys offset 1 s2 freq +1731 delay 0 phc2sys[369.761]: enp4s0 sys offset 1 s2 freq +1731 delay 0 phc2sys[369.861]: enp4s0 sys offset -5 s2 freq +1725 delay 0 phc2sys[369.961]: enp4s0 sys offset -4 s2 freq +1725 delay 0 phc2sys[370.062]: enp4s0 sys offset 2 s2 freq +1730 delay 0 phc2sys[370.162]: enp4s0 sys offset -7 s2 freq +1721 delay 0 phc2sys[370.262]: enp4s0 sys offset -3 s2 freq +1723 delay 0 phc2sys[370.362]: enp4s0 sys offset 1 s2 freq +1726 delay 0 phc2sys[370.462]: enp4s0 sys offset -3 s2 freq +1723 delay 0 phc2sys[370.562]: enp4s0 sys offset -1 s2 freq +1724 delay 0 phc2sys[370.662]: enp4s0 sys offset -4 s2 freq +1720 delay 0 phc2sys[370.762]: enp4s0 sys offset -7 s2 freq +1716 delay 0 phc2sys[370.862]: enp4s0 sys offset -2 s2 freq +1719 delay 0
Statistics for this run (total of 2179 lines), in nanoseconds: average: 0.14 stdev: 5.03 max: 48 min: -27 For reference, the statistics for runs without PCIe congestion show that the improvements from enabling PTM are less dramatic. For two runs of 16466 entries: without PTM: avg -0.04 stdev 10.57 max 39 min -42 with PTM: avg 0.01 stdev 4.20 max 19 min -16 One possible explanation is that when PTM is not enabled, and
there's a lot of traffic in the PCIe fabric, some register reads will take more time than the others because of congestion on the PCIe fabric.
When PTM is enabled, even if the PTM dialogs take more time to complete under heavy traffic, the time measurements do not depend
on the time to read the registers.
This was implemented following the i225 EAS version 0.993. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
drivers/net/ethernet/intel/igc/igc.h | 1 + drivers/net/ethernet/intel/igc/igc_defines.h | 31 +++++ drivers/net/ethernet/intel/igc/igc_ptp.c | 179 +++++++++++++++++++++++++++ drivers/net/ethernet/intel/igc/igc_regs.h | 23 ++++ 4 files changed, 234 insertions(+)
On Wed, 2021-11-24 at 08:33 +0100, Greg KH wrote:
On Wed, Nov 24, 2021 at 08:28:39AM +0100, Stefan Dietrich wrote:
Summary: When attempting to rise or shut down a NIC manually or via network-manager under 5.15, the machine reboots or freezes.
Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline ( https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as liquorix flavours. Does not occur with: 5.14 and 5.13 (both with various flavours)
Can you use 'git bisect' between 5.14 and 5.15 to find the problem commit?
thanks,
greg k-h
Hi Stefan,
Jakub Kicinski kuba@kernel.org writes:
On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
Hi all,
six exciting hours and a lot of learning later, here it is. Symptomatically, the critical commit appears for me between 5.14.21- 051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64 build for rc1.
Please see the git-bisect output below and let me know how I may further assist in debugging!
Well, let's CC those involved, shall we? :)
Thanks for working thru the bisection!
a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit commit a90ec84837325df4b9a6798c2cc0df202b5680bd Author: Vinicius Costa Gomes vinicius.gomes@intel.com Date: Mon Jul 26 20:36:57 2021 -0700
igc: Add support for PTP getcrosststamp()
Oh! That's interesting.
Can you try disabling CONFIG_PCIE_PTM in your kernel config? If it works, then it's a point in favor that this commit is indeed the problematic one.
I am still trying to think of what could be causing the lockup you are seeing.
Cheers,
On Wed, 24 Nov 2021 17:07:16 -0800 Vinicius Costa Gomes wrote:
Hi Stefan,
Jakub Kicinski kuba@kernel.org writes:
On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
Hi all,
six exciting hours and a lot of learning later, here it is. Symptomatically, the critical commit appears for me between 5.14.21- 051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64 build for rc1.
Please see the git-bisect output below and let me know how I may further assist in debugging!
Well, let's CC those involved, shall we? :)
Thanks for working thru the bisection!
a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit commit a90ec84837325df4b9a6798c2cc0df202b5680bd Author: Vinicius Costa Gomes vinicius.gomes@intel.com Date: Mon Jul 26 20:36:57 2021 -0700
igc: Add support for PTP getcrosststamp()
Oh! That's interesting.
Can you try disabling CONFIG_PCIE_PTM in your kernel config? If it works, then it's a point in favor that this commit is indeed the problematic one.
I am still trying to think of what could be causing the lockup you are seeing.
Actually we just had another report pointing at commit f32a21376573 ("ethtool: runtime-resume netdev parent before ethtool ioctl ops"). That seems more likely :(
Hi Vinicius,
thanks - this was spot-on: disabling CONFIG_PCIE_PTM resolves the issue for latest 5.15.4 (stable from git) for both manual and network-manager NIC configuration.
Let me know if I may assist in debugging this further.
Cheers, Stefan
On Wed, 2021-11-24 at 17:07 -0800, Vinicius Costa Gomes wrote:
Hi Stefan,
Jakub Kicinski kuba@kernel.org writes:
On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
Hi all,
six exciting hours and a lot of learning later, here it is. Symptomatically, the critical commit appears for me between 5.14.21- 051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64 build for rc1.
Please see the git-bisect output below and let me know how I may further assist in debugging!
Well, let's CC those involved, shall we? :)
Thanks for working thru the bisection!
a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit commit a90ec84837325df4b9a6798c2cc0df202b5680bd Author: Vinicius Costa Gomes vinicius.gomes@intel.com Date: Mon Jul 26 20:36:57 2021 -0700
igc: Add support for PTP getcrosststamp()
Oh! That's interesting.
Can you try disabling CONFIG_PCIE_PTM in your kernel config? If it works, then it's a point in favor that this commit is indeed the problematic one.
I am still trying to think of what could be causing the lockup you are seeing.
Cheers,
Hi, this is your Linux kernel regression tracker speaking.
On 25.11.21 09:41, Stefan Dietrich wrote:
thanks - this was spot-on: disabling CONFIG_PCIE_PTM resolves the issue for latest 5.15.4 (stable from git) for both manual and network-manager NIC configuration.
Let me know if I may assist in debugging this further.
What is the status here? There afaics hasn't been any progress since nearly a week.
Vinicius, do you still have this on your radar? Or was there some progress?
Or is this really related to another issue, as Jakub suspected? Then it might be solved by the patch here:
https://bugzilla.kernel.org/show_bug.cgi?id=215129
Ciao, Thorsten
On Wed, 2021-11-24 at 17:07 -0800, Vinicius Costa Gomes wrote:
Hi Stefan,
Jakub Kicinski kuba@kernel.org writes:
On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
Hi all,
six exciting hours and a lot of learning later, here it is. Symptomatically, the critical commit appears for me between 5.14.21- 051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64 build for rc1.
Please see the git-bisect output below and let me know how I may further assist in debugging!
Well, let's CC those involved, shall we? :)
Thanks for working thru the bisection!
a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit commit a90ec84837325df4b9a6798c2cc0df202b5680bd Author: Vinicius Costa Gomes vinicius.gomes@intel.com Date: Mon Jul 26 20:36:57 2021 -0700
igc: Add support for PTP getcrosststamp()
Oh! That's interesting.
Can you try disabling CONFIG_PCIE_PTM in your kernel config? If it works, then it's a point in favor that this commit is indeed the problematic one.
I am still trying to think of what could be causing the lockup you are seeing.
P.S.: As a Linux kernel regression tracker I'm getting a lot of reports on my table. I can only look briefly into most of them. Unfortunately therefore I sometimes will get things wrong or miss something important. I hope that's not the case here; if you think it is, don't hesitate to tell me about it in a public reply. That's in everyone's interest, as what I wrote above might be misleading to everyone reading this; any suggestion I gave they thus might sent someone reading this down the wrong rabbit hole, which none of us wants.
BTW, I have no personal interest in this issue, which is tracked using regzbot, my Linux kernel regression tracking bot (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting this mail to get things rolling again and hence don't need to be CC on all further activities wrt to this regression.
#regzbot poke
Hi,
Thorsten Leemhuis regressions@leemhuis.info writes:
Hi, this is your Linux kernel regression tracker speaking.
On 25.11.21 09:41, Stefan Dietrich wrote:
thanks - this was spot-on: disabling CONFIG_PCIE_PTM resolves the issue for latest 5.15.4 (stable from git) for both manual and network-manager NIC configuration.
Let me know if I may assist in debugging this further.
What is the status here? There afaics hasn't been any progress since nearly a week.
Vinicius, do you still have this on your radar? Or was there some progress?
Or is this really related to another issue, as Jakub suspected? Then it might be solved by the patch here:
What I am thinking right now is that we are facing a similar problem as the bug above, only in the igc driver. The difference is that it's the PCIe PTM messages (from the PCIe root) that are triggering the deadlock in the suspend/resume path in igc.
I will produce a patch in a few moments, very similar to the one in the bug report, let's see if it helps.
Ciao, Thorsten
On Wed, 2021-11-24 at 17:07 -0800, Vinicius Costa Gomes wrote:
Hi Stefan,
Jakub Kicinski kuba@kernel.org writes:
On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
Hi all,
six exciting hours and a lot of learning later, here it is. Symptomatically, the critical commit appears for me between 5.14.21- 051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64 build for rc1.
Please see the git-bisect output below and let me know how I may further assist in debugging!
Well, let's CC those involved, shall we? :)
Thanks for working thru the bisection!
a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit commit a90ec84837325df4b9a6798c2cc0df202b5680bd Author: Vinicius Costa Gomes vinicius.gomes@intel.com Date: Mon Jul 26 20:36:57 2021 -0700
igc: Add support for PTP getcrosststamp()
Oh! That's interesting.
Can you try disabling CONFIG_PCIE_PTM in your kernel config? If it works, then it's a point in favor that this commit is indeed the problematic one.
I am still trying to think of what could be causing the lockup you are seeing.
P.S.: As a Linux kernel regression tracker I'm getting a lot of reports on my table. I can only look briefly into most of them. Unfortunately therefore I sometimes will get things wrong or miss something important. I hope that's not the case here; if you think it is, don't hesitate to tell me about it in a public reply. That's in everyone's interest, as what I wrote above might be misleading to everyone reading this; any suggestion I gave they thus might sent someone reading this down the wrong rabbit hole, which none of us wants.
BTW, I have no personal interest in this issue, which is tracked using regzbot, my Linux kernel regression tracking bot (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting this mail to get things rolling again and hence don't need to be CC on all further activities wrt to this regression.
#regzbot poke
Inspired by: https://bugzilla.kernel.org/show_bug.cgi?id=215129
Signed-off-by: Vinicius Costa Gomes vinicius.gomes@intel.com --- Just to see if it's indeed the same problem as the bug report above.
drivers/net/ethernet/intel/igc/igc_main.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c index 0e19b4d02e62..c58bf557a2a1 100644 --- a/drivers/net/ethernet/intel/igc/igc_main.c +++ b/drivers/net/ethernet/intel/igc/igc_main.c @@ -6619,7 +6619,7 @@ static void igc_deliver_wake_packet(struct net_device *netdev) netif_rx(skb); }
-static int __maybe_unused igc_resume(struct device *dev) +static int __maybe_unused __igc_resume(struct device *dev, bool rpm) { struct pci_dev *pdev = to_pci_dev(dev); struct net_device *netdev = pci_get_drvdata(pdev); @@ -6661,20 +6661,27 @@ static int __maybe_unused igc_resume(struct device *dev)
wr32(IGC_WUS, ~0);
- rtnl_lock(); + if (!rpm) + rtnl_lock(); if (!err && netif_running(netdev)) err = __igc_open(netdev, true);
if (!err) netif_device_attach(netdev); - rtnl_unlock(); + if (!rpm) + rtnl_unlock();
return err; }
static int __maybe_unused igc_runtime_resume(struct device *dev) { - return igc_resume(dev); + return __igc_resume(dev, true); +} + +static int __maybe_unused igc_resume(struct device *dev) +{ + return __igc_resume(dev, false); }
static int __maybe_unused igc_suspend(struct device *dev) @@ -6738,7 +6745,7 @@ static pci_ers_result_t igc_io_error_detected(struct pci_dev *pdev, * @pdev: Pointer to PCI device * * Restart the card from scratch, as if from a cold-boot. Implementation - * resembles the first-half of the igc_resume routine. + * resembles the first-half of the __igc_resume routine. **/ static pci_ers_result_t igc_io_slot_reset(struct pci_dev *pdev) { @@ -6777,7 +6784,7 @@ static pci_ers_result_t igc_io_slot_reset(struct pci_dev *pdev) * * This callback is called when the error recovery driver tells us that * its OK to resume normal operation. Implementation resembles the - * second-half of the igc_resume routine. + * second-half of the __igc_resume routine. */ static void igc_io_resume(struct pci_dev *pdev) {
On Wed, Dec 01, 2021 at 10:57:31AM -0800, Vinicius Costa Gomes wrote:
Inspired by: https://bugzilla.kernel.org/show_bug.cgi?id=215129
This changelog does not say anything at all, sorry. Please explain what is happening here as the kernel documentation asks you to.
Signed-off-by: Vinicius Costa Gomes vinicius.gomes@intel.com
Just to see if it's indeed the same problem as the bug report above.
<formletter>
This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly.
</formletter>
Greg KH gregkh@linuxfoundation.org writes:
On Wed, Dec 01, 2021 at 10:57:31AM -0800, Vinicius Costa Gomes wrote:
Inspired by: https://bugzilla.kernel.org/show_bug.cgi?id=215129
This changelog does not say anything at all, sorry. Please explain what is happening here as the kernel documentation asks you to.
It was intended as just some patch for the reporter to try while narrowing the problem down. Sorry for the noise.
I should have thought about removing stable from CC.
Thank you,
Hi, this is your Linux kernel regression tracker speaking.
On 24.11.21 08:28, Stefan Dietrich wrote:
Summary: When attempting to rise or shut down a NIC manually or via network-manager under 5.15, the machine reboots or freezes.
Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline ( https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as liquorix flavours. Does not occur with: 5.14 and 5.13 (both with various flavours)
Thx for the report. Small detail: you CCed the stable list, but this afaics is a mainline regression. Likely one in the network subsystem, so it might be good to get the mailing list where the network developer hang out in the loop. But as Greg already said: a bisection would help a lot to find the root cause and thus the developers that need to take care of this.
Anyway, to be sure this issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression tracking bot:
#regzbot ^introduced v5.14..v5.15 #regzbot ignore-activity
Ciao, Thorsten, your Linux kernel regression tracker.
P.S.: If you want to know more about regzbot, check out its web-interface, the getting start guide, and/or the references documentation:
https://linux-regtracking.leemhuis.info/regzbot/ https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md
The last two documents will explain how you can interact with regzbot yourself if your want to.
Hint for the reporter: when reporting a regression it's in your interest to tell #regzbot about it in the report, as that will ensure the regression gets on the radar of regzbot and the regression tracker. That's in your interest, as they will make sure the report won't fall through the cracks unnoticed.
Hint for developers: you normally don't need to care about regzbot, just fix the issue as you normally would. Just remember to include a 'Link:' tag to the report in the commit message, as explained in Documentation/process/submitting-patches.rst That aspect was recently was made more explicit in commit 1f57bd42b77c: https://git.kernel.org/linus/1f57bd42b77c
P.S.: As a Linux kernel regression tracker I'm getting a lot of reports on my table. I can only look briefly into most of them. Unfortunately therefore I sometimes will get things wrong or miss something important. I hope that's not the case here; if you think it is, don't hesitate to tell me about it in a public reply. That's in everyone's interest, as what I wrote above might be misleading to everyone reading this; any suggestion I gave they thus might sent someone reading this down the wrong rabbit hole, which none of us wants.
BTW, I have no personal interest in this issue, which is tracked using regzbot, my Linux kernel regression tracking bot (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting this mail to get things rolling again and hence don't need to be CC on all further activities wrt to this regression.
Hi Thorsten,
thanks for the pointer. netdev should be in the loop now.
Stefan
On Wed, 2021-11-24 at 08:28 +0100, Stefan Dietrich wrote:
Summary: When attempting to rise or shut down a NIC manually or via network-manager under 5.15, the machine reboots or freezes.
Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline ( https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as liquorix flavours. Does not occur with: 5.14 and 5.13 (both with various flavours)
Hi all,
I'm experiencing a severe bug that causes the machine to reboot or freeze when trying to login and/or rise/shutdown a NIC. Here's a brief description of scenarios I've tested:
Scenario 1: enp6s0 managed manually using /etc/networking/interfaces, DHCP a. Issuing ifdown enp6s0 in terminal will throw "/etc/resolvconf/update.d/libc: Warning: /etc/resolv.conf is not a symbolic link to /run/resolvconf/resolv.conf" and cause the machine to reboot after ~10s of showing a blinking cursor
b. Issuing shutdown -h now or trying to shutdown/reboot machine via GUI: shutdown will stop on "stop job is running for ifdown enp6s0" and after approx. 10..15s the countdown freezes. Repeated ALT-SysReq-REISUB does not reboot the machine, a hard reset is required.
--
Scenario 2: enp6s0 managed manually using /etc/networking/interfaces, STATIC a. Issuing ifdown enp6s0 in terminal will throw "send_packet: Operation not permitted dhclient.c:3010: Failed to send 300 byte long packet over fallback interface." and cause the machine to reboot after ~10s of blinking cursor.
b. Issuing shutdown -h now or trying to shutdown or reboot machine via GUI: shutdown will stop on "stop job is running for ifdown enp6s0" and after approx. 10..15s the countdown freezes. Repeated ALT-SysReq- REISUB does not reboot the machine, a hard reset is required.
--
Scenario 3: enp6s0 managed by network manager a. After booting and logging in either via GUI or TTY, the display will stay blank and only show a blinking cursor and then freeze after 5..10s. ALT-SysReq-REISUB does not reboot the machine, a hard reset is required.
--
Here's a snippet from the journal for Scenario 1a:
Nov 21 10:39:25 computer sudo[5606]: user : TTY=pts/0 ; PWD=/home/user ; USER=root ; COMMAND=/usr/sbin/ifdown enp6s0 Nov 21 10:39:25 computer sudo[5606]: pam_unix(sudo:session): session opened for user root by (uid=0) -- Reboot -- Nov 21 10:40:14 computer systemd-journald[478]: Journal started
--
I'm running Alder Lake i9 12900K but I have E-cores disabled in BIOS. Here are some more specs with working kernel:
$ inxi -bxz System: Kernel: 5.14.0-19.2-liquorix-amd64 x86_64 bits: 64 compiler: N/A Desktop: Xfce 4.16.3 Distro: Ubuntu 20.04.3 LTS (Focal Fossa) Machine: Type: Desktop System: ASUS product: N/A v: N/A serial: N/A Mobo: ASUSTeK model: ROG STRIX Z690-A GAMING WIFI D4 v: Rev 1.xx serial: <filter> UEFI [Legacy]: American Megatrends v: 0707 date: 11/10/2021 CPU: 8-Core: 12th Gen Intel Core i9-12900K type: MT MCP arch: N/A speed: 5381 MHz max: 3201 MHz Graphics: Device-1: NVIDIA vendor: Gigabyte driver: nvidia v: 470.86 bus ID: 01:00.0 Display: server: X.Org 1.20.11 driver: nvidia tty: N/A Message: Unable to show advanced data. Required tool glxinfo missing. Network: Device-1: Intel vendor: ASUSTeK driver: igc v: kernel port: 4000 bus ID: 06:00.0
Please advice how I may assist in debugging!
Thanks.
linux-stable-mirror@lists.linaro.org