On Wed, Sep 22, 2021 at 09:21:48AM +0300, Leon Romanovsky wrote:
On Tue, Sep 21, 2021 at 10:22:57PM +0000, Patrick.Mclean@sony.com wrote:
On Mon, Sep 20, 2021 at 08:22:44PM +0000, Patrick.Mclean@sony.com wrote:
In 5.10 stable kernels since 5.10.65 certain mlx5 cards are no longer usable (relevant dmesg logs and lspci output are pasted below).
Bisecting the problem tracks the problem down to this commit: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/...
Here is how lscpi -nn identifies the cards: 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
Here are the relevant dmesg logs: [ 13.409473] mlx5_core 0000:41:00.0: firmware version: 16.31.1014 [ 13.415944] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 13.707425] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 13.718221] mlx5_core 0000:41:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 13.740607] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged [ 13.759857] mlx5_core 0000:41:00.0: mlx5_pcie_event:294:(pid 586): PCIe slot advertised sufficient power (75W). [ 17.986973] mlx5_core 0000:41:00.0: E-Switch: cleanup [ 18.686204] mlx5_core 0000:41:00.0: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 18.701352] mlx5_core: probe of 0000:41:00.0 failed with error -22 [ 18.727364] mlx5_core 0000:41:00.1: firmware version: 16.31.1014 [ 18.743853] mlx5_core 0000:41:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 19.015349] mlx5_core 0000:41:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps [ 19.025157] mlx5_core 0000:41:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 19.053569] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged [ 19.062093] mlx5_core 0000:41:00.1: mlx5_pcie_event:294:(pid 591): PCIe slot advertised sufficient power (75W). [ 22.826932] mlx5_core 0000:41:00.1: E-Switch: cleanup [ 23.544747] mlx5_core 0000:41:00.1: init_one:1371:(pid 803): mlx5_load_one failed with error code -22 [ 23.555071] mlx5_core: probe of 0000:41:00.1 failed with error -22
Please let me know if I can provide any further information.
If you revert that single change, do things work properly?
Yes, things work properly after reverting that single change (tested with 5.10.67).
The stable@ kernel is missing commit 3d347b1b19da ("net/mlx5: Add support for devlink traps in mlx5 core driver"), which added mlx5 devlink callbacks (.trap_init and .trap_fini).
Ok, will go revert this now, thanks for confirming it and letting me know.
I don't know why the commit that you reverted was added to stable@ in the first place. It doesn't fix any bug and has no Fixes tag.
Looks like it was brought in as a dependancy for another fix that required it as the revert was not clean and I had to do it "by hand".
thanks,
greg k-h