On 20/04/2024 8:59, Wang Yugui wrote:
External email: Use caution opening links or attachments
Hi,
I'm announcing the release of the 6.6.28 kernel.
All users of the 6.6 kernel series must upgrade.
The updated 6.6.y git tree can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.6.y and can be browsed at the normal kernel.org git web browser: https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git%3Ba=summa...
Linux 6.6.28 failed to boot with the following panic *1 on a server with mellonax CX-6 VPI NIC, but 6.6.27/6.1.87 boot well.
After reverting 'net/mlx5: Restore mistakenly dropped parts in register devlink flow', linux boot well.
there is a similar discussion in net-dev ML[1]. In short, it seems this patch is missing from stable, which is prerequisite for the bad patch: 0553e753ea9e "net/mlx5: E-switch, store eswitch pointer before registering devlink_param".
Wang, can you test it out please?
thanks Shay
[1] https://lore.kernel.org/netdev/20240419162842.69433-1-oxana@cloudflare.com/T...
There is already a patch(*2 ) in upstream, but yet not in queue-6.6(for the coming 6.6.29).
*1 panic info: [ 15.114364] BUG: unable to handle page fault for address: 0000000000001118 [ 15.114815] infiniband bnxt_re0: Device registered with IB successfully [ 15.114822] #PF: supervisor read access in kernel mode [ 15.134119] #PF: error_code(0x0000) - not-present page [ 15.139652] PGD 0 P4D 0 [ 15.142553] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 15.143055] infiniband bnxt_re1: Device registered with IB successfully [ 15.147233] CPU: 1 PID: 1253 Comm: kworker/1:4 Not tainted 6.6.28-1.el7.x86_64 #1 [ 15.147236] Hardware name: Dell Inc. PowerEdge T640/0TWW5Y, BIOS 2.21.0 12/11/2023 [ 15.147238] Workqueue: events work_for_cpu_fn [ 15.174498] RIP: 0010:esw_port_metadata_get+0x19/0x30 [mlx5_core] [ 15.181056] Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 48 89 d3 e8 ce 28 9a cc 48 8b 80 b0 09 00 00 <8b> 80 18 11 00 00 88 03 31 c0 80 23 01 5b c3 cc cc cc cc 0f 1f 40 [ 15.200401] RSP: 0000:ffff9ec05bf1fb98 EFLAGS: 00010286 [ 15.205930] RAX: 0000000000000000 RBX: ffff9ec05bf1fbe4 RCX: 0000000000000028 [ 15.213364] RDX: ffff9ec05bf1fbe4 RSI: 0000000000000013 RDI: ffff8bdd1d696000 [ 15.220801] RBP: ffffffffc1134c60 R08: 0000000000000000 R09: 0000000000000000 [ 15.228235] R10: ffff9ec05bf1fbf8 R11: 0000000000001000 R12: ffff8bdd1d696000 [ 15.235671] R13: ffff8bdd9541c720 R14: 0000000000000000 R15: 0000000000000000 [ 15.243098] FS: 0000000000000000(0000) GS:ffff8c3b7ea00000(0000) knlGS:0000000000000000 [ 15.251480] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 15.257520] CR2: 0000000000001118 CR3: 00000004f9220003 CR4: 00000000007706e0 [ 15.264955] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 15.272383] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 15.279800] PKRU: 55555554 [ 15.282790] Call Trace: [ 15.285523] <TASK> [ 15.287905] ? __die_body+0x1e/0x60 [ 15.291675] ? page_fault_oops+0x151/0x490 [ 15.296050] ? __update_idle_core+0x27/0xc0 [ 15.300505] ? exc_page_fault+0x6b/0x150 [ 15.304700] ? asm_exc_page_fault+0x26/0x30 [ 15.309149] ? esw_port_metadata_get+0x19/0x30 [mlx5_core] [ 15.315066] ? esw_port_metadata_get+0x12/0x30 [mlx5_core] [ 15.320940] devlink_nl_param_fill.constprop.23+0x88/0x5d0 [ 15.326679] ? __alloc_skb+0x87/0x190 [ 15.330594] ? __kmalloc_node_track_caller+0x55/0x130 [ 15.335897] ? __kmalloc_node_track_caller+0x55/0x130 [ 15.341196] ? kmalloc_reserve+0x65/0xf0 [ 15.345370] ? __alloc_skb+0xd9/0x190 [ 15.349280] devlink_param_notify.constprop.20+0x72/0xd0 [ 15.354845] devl_params_register+0x150/0x250 [ 15.359456] esw_offloads_init+0x181/0x1a0 [mlx5_core] [ 15.364967] mlx5_eswitch_init+0x4be/0x6e0 [mlx5_core] [ 15.370471] mlx5_init_once+0xf0/0x550 [mlx5_core] [ 15.375601] mlx5_init_one_devl_locked+0x7a/0x1d0 [mlx5_core] [ 15.381676] mlx5_init_one+0x2e/0x60 [mlx5_core] [ 15.386616] probe_one+0x2b6/0x410 [mlx5_core] [ 15.391382] local_pci_probe+0x45/0xa0 [ 15.395367] work_for_cpu_fn+0x17/0x30 [ 15.399345] process_scheduled_works+0x8a/0x380 [ 15.404102] worker_thread+0x165/0x2d0 [ 15.408082] ? __pfx_worker_thread+0x10/0x10 [ 15.412578] kthread+0xf2/0x120 [ 15.415952] ? __pfx_kthread+0x10/0x10 [ 15.419928] ret_from_fork+0x31/0x40 [ 15.423724] ? __pfx_kthread+0x10/0x10 [ 15.427692] ret_from_fork_asm+0x1b/0x30 [ 15.431827] </TASK> [ 15.434218] Modules linked in: xor bnxt_re zstd_compress raid6_pq ib_uverbs sd_mod ib_core t10_pi mlx5_core(+) pci_hyperv_intf mlxfw ahci libahci bnx2x mpi3mr psample i40e libata tls bnxt_en megaraid_sas scsi_transport_sas crc32c_intel mgag200 mdio i2c_algo_bit wmi dm_mirror dm_region_hash dm_log dm_mod [ 15.461684] CR2: 0000000000001118 [ 15.465213] ---[ end trace 0000000000000000 ]--- [ 15.476059] pstore: backend (erst) writing error (-28) [ 15.481415] RIP: 0010:esw_port_metadata_get+0x19/0x30 [mlx5_core] [ 15.487856] Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 48 89 d3 e8 ce 28 9a cc 48 8b 80 b0 09 00 00 <8b> 80 18 11 00 00 88 03 31 c0 80 23 01 5b c3 cc cc cc cc 0f 1f 40 [ 15.507043] RSP: 0000:ffff9ec05bf1fb98 EFLAGS: 00010286 [ 15.512493] RAX: 0000000000000000 RBX: ffff9ec05bf1fbe4 RCX: 0000000000000028 [ 15.519852] RDX: ffff9ec05bf1fbe4 RSI: 0000000000000013 RDI: ffff8bdd1d696000 [ 15.527209] RBP: ffffffffc1134c60 R08: 0000000000000000 R09: 0000000000000000 [ 15.534568] R10: ffff9ec05bf1fbf8 R11: 0000000000001000 R12: ffff8bdd1d696000 [ 15.541934] R13: ffff8bdd9541c720 R14: 0000000000000000 R15: 0000000000000000 [ 15.549299] FS: 0000000000000000(0000) GS:ffff8c3b7ea00000(0000) knlGS:0000000000000000 [ 15.557618] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 15.563607] CR2: 0000000000001118 CR3: 00000004f9220003 CR4: 00000000007706e0 [ 15.570981] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 15.578356] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 15.585733] PKRU: 55555554 [ 15.588679] Kernel panic - not syncing: Fatal exception [ 15.594163] Kernel Offset: 0xbc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
*2 From bf729988303a27833a86acb561f42b9a3cc12728 Mon Sep 17 00:00:00 2001 From: Shay Drory shayd@nvidia.com Date: Thu, 11 Apr 2024 14:54:41 +0300 Subject: [PATCH] net/mlx5: Restore mistakenly dropped parts in register devlink flow
Fixes: c6e77aa9dd82 ("net/mlx5: Register devlink first under devlink lock")
Best Regards Wang Yugui (wangyugui@e16-tech.com) 2024/04/20