In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below).
Bisecting the problem tracks the problem down to this commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=l...
Here is how lscpi -nn identifies the cards: 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
Here are the relevant dmesg logs: [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22
Please let me know if I can provide any further information.
On Mon, Sep 20, 2021 at 08:22:44PM +0000, Patrick.Mclean@sony.com wrote:
In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below).
Bisecting the problem tracks the problem down to this commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=l...
Here is how lscpi -nn identifies the cards: 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
Here are the relevant dmesg logs: [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22
Please let me know if I can provide any further information.
If you revert that single change, do things work properly?
Does newer kernels (5.14, 5.15-rc2) work properly for you as well?
thanks,
greg k-h
On Mon, Sep 20, 2021 at 08:22:44PM +0000, Patrick.Mclean@sony.com wrote:
In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below).
Bisecting the problem tracks the problem down to this commit: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/...
Here is how lscpi -nn identifies the cards: 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
Here are the relevant dmesg logs: [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22
Please let me know if I can provide any further information.
If you revert that single change, do things work properly?
Yes, things work properly after reverting that single change (tested with 5.10.67).
Does newer kernels (5.14, 5.15-rc2) work properly for you as well?
We tested 5.14.6, and it works as expected.
On Tue, Sep 21, 2021 at 10:22:57PM +0000, Patrick.Mclean@sony.com wrote:
On Mon, Sep 20, 2021 at 08:22:44PM +0000, Patrick.Mclean@sony.com wrote:
In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below).
Bisecting the problem tracks the problem down to this commit: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/...
Here is how lscpi -nn identifies the cards: 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
Here are the relevant dmesg logs: [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22
Please let me know if I can provide any further information.
If you revert that single change, do things work properly?
Yes, things work properly after reverting that single change (tested with 5.10.67).
The stable@ kernel is missing commit 3d347b1b19da ("net/mlx5: Add support for devlink traps in mlx5 core driver"), which added mlx5 devlink callbacks (.trap_init and .trap_fini).
I don't know why the commit that you reverted was added to stable@ in the first place. It doesn't fix any bug and has no Fixes tag.
Thanks
On Wed, Sep 22, 2021 at 09:21:48AM +0300, Leon Romanovsky wrote:
On Tue, Sep 21, 2021 at 10:22:57PM +0000, Patrick.Mclean@sony.com wrote:
On Mon, Sep 20, 2021 at 08:22:44PM +0000, Patrick.Mclean@sony.com wrote:
In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below).
Bisecting the problem tracks the problem down to this commit: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/...
Here is how lscpi -nn identifies the cards: 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
Here are the relevant dmesg logs: [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22
Please let me know if I can provide any further information.
If you revert that single change, do things work properly?
Yes, things work properly after reverting that single change (tested with 5.10.67).
The stable@ kernel is missing commit 3d347b1b19da ("net/mlx5: Add support for devlink traps in mlx5 core driver"), which added mlx5 devlink callbacks (.trap_init and .trap_fini).
Ok, will go revert this now, thanks for confirming it and letting me know.
I don't know why the commit that you reverted was added to stable@ in the first place. It doesn't fix any bug and has no Fixes tag.
Looks like it was brought in as a dependancy for another fix that required it as the revert was not clean and I had to do it "by hand".
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org