[PATCH] vdpa/mlx5: Fix crash on shutdown for when no ndev exists

List overview All Threads
Download

newer

older

[PATCH 0/2] tun/tap: set sk_uid...

stable-rc/linux-5.10.y baseline:...

Dragos Tatulea

26 Jul 2023 26 Jul '23

7:07 p.m.

The ndev was accessed on shutdown without a check if it actually exists. This triggered the crash pasted below. This patch simply adds a check before using ndev.

BUG: kernel NULL pointer dereference, address: 0000000000000300 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0-rc2_for_upstream_min_debug_2023_07_17_15_05 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017 RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000 RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000 R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die+0x20/0x60 ? page_fault_oops+0x14c/0x3c0 ? exc_page_fault+0x75/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] device_shutdown+0x13e/0x1e0 kernel_restart+0x36/0x90 __do_sys_reboot+0x141/0x210 ? vfs_writev+0xcd/0x140 ? handle_mm_fault+0x161/0x260 ? do_writev+0x6b/0x110 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f496990fb56 RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56 RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8 </TASK> CR2: 0000000000000300 ---[ end trace 0000000000000000 ]---

Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing") Signed-off-by: Dragos Tatulea dtatulea@nvidia.com --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 9138ef2fb2c8..e2e7ebd71798 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device *auxdev) mgtdev = auxiliary_get_drvdata(auxdev); ndev = mgtdev->ndev;

- free_irqs(ndev); + if (ndev) + free_irqs(ndev); }

static const struct auxiliary_device_id mlx5v_id_table[] = {

-- 2.41.0

Show replies by date

kernel test robot

26 Jul 26 Jul

7:10 p.m.

Hi,

Thanks for your patch.

FYI: kernel test robot notices the stable kernel rule is not satisfied.

Rule: 'Cc: stable@vger.kernel.org' or 'commit <sha1> upstream.' Subject: [PATCH] vdpa/mlx5: Fix crash on shutdown for when no ndev exists Link: https://lore.kernel.org/stable/20230726190744.14143-1-dtatulea%40nvidia.com

The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html

-- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki

Michael S. Tsirkin

7:26 p.m.

On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:

...

The ndev was accessed on shutdown without a check if it actually exists. This triggered the crash pasted below. This patch simply adds a check before using ndev.

BUG: kernel NULL pointer dereference, address: 0000000000000300 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0-rc2_for_upstream_min_debug_2023_07_17_15_05 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017 RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000 RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000 R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace:

<TASK> ? __die+0x20/0x60 ? page_fault_oops+0x14c/0x3c0 ? exc_page_fault+0x75/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] device_shutdown+0x13e/0x1e0 kernel_restart+0x36/0x90 __do_sys_reboot+0x141/0x210 ? vfs_writev+0xcd/0x140 ? handle_mm_fault+0x161/0x260 ? do_writev+0x6b/0x110 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f496990fb56 RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56 RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8 </TASK> CR2: 0000000000000300 ---[ end trace 0000000000000000 ]---

Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing") Signed-off-by: Dragos Tatulea dtatulea@nvidia.com

drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 9138ef2fb2c8..e2e7ebd71798 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device *auxdev) mgtdev = auxiliary_get_drvdata(auxdev); ndev = mgtdev->ndev;

free_irqs(ndev);
if (ndev)
free_irqs(ndev);
}

something I don't get: irqs are allocated in mlx5_vdpa_dev_add why are they not freed in mlx5_vdpa_dev_del?

this is what's creating all this mess.

...

static const struct auxiliary_device_id mlx5v_id_table[] = {

2.41.0

Dragos Tatulea

27 Jul 27 Jul

4:02 p.m.

On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote:

...

On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:

...
The ndev was accessed on shutdown without a check if it actually exists. This triggered the crash pasted below. This patch simply adds a check before using ndev.

BUG: kernel NULL pointer dereference, address: 0000000000000300 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0- rc2_for_upstream_min_debug_2023_07_17_15_05 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0- gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017 RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000 RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000 R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die+0x20/0x60 ? page_fault_oops+0x14c/0x3c0 ? exc_page_fault+0x75/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] device_shutdown+0x13e/0x1e0 kernel_restart+0x36/0x90 __do_sys_reboot+0x141/0x210 ? vfs_writev+0xcd/0x140 ? handle_mm_fault+0x161/0x260 ? do_writev+0x6b/0x110 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f496990fb56 RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56 RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8 </TASK> CR2: 0000000000000300 ---[ end trace 0000000000000000 ]---

Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing") Signed-off-by: Dragos Tatulea dtatulea@nvidia.com

drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 9138ef2fb2c8..e2e7ebd71798 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device *auxdev) mgtdev = auxiliary_get_drvdata(auxdev); ndev = mgtdev->ndev; - free_irqs(ndev); + if (ndev) + free_irqs(ndev); }

something I don't get: irqs are allocated in mlx5_vdpa_dev_add why are they not freed in mlx5_vdpa_dev_del?

That is a good point. I will try to find out. I also don't get why free_irq is called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. Maybe I can change that in a different refactoring.

...

this is what's creating all this mess.

Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's what I see. Or am I missing something?

...

...
static const struct auxiliary_device_id mlx5v_id_table[] = {

2.41.0

Michael S. Tsirkin

4:28 p.m.

On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote:

...

On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote:

...
On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:

...
The ndev was accessed on shutdown without a check if it actually exists. This triggered the crash pasted below. This patch simply adds a check before using ndev.

BUG: kernel NULL pointer dereference, address: 0000000000000300 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0- rc2_for_upstream_min_debug_2023_07_17_15_05 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0- gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017 RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000 RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000 R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die+0x20/0x60 ? page_fault_oops+0x14c/0x3c0 ? exc_page_fault+0x75/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] device_shutdown+0x13e/0x1e0 kernel_restart+0x36/0x90 __do_sys_reboot+0x141/0x210 ? vfs_writev+0xcd/0x140 ? handle_mm_fault+0x161/0x260 ? do_writev+0x6b/0x110 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f496990fb56 RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56 RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8 </TASK> CR2: 0000000000000300 ---[ end trace 0000000000000000 ]---

Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing") Signed-off-by: Dragos Tatulea dtatulea@nvidia.com

drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 9138ef2fb2c8..e2e7ebd71798 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device *auxdev) mgtdev = auxiliary_get_drvdata(auxdev); ndev = mgtdev->ndev; - free_irqs(ndev); + if (ndev) + free_irqs(ndev); }

something I don't get: irqs are allocated in mlx5_vdpa_dev_add why are they not freed in mlx5_vdpa_dev_del?

That is a good point. I will try to find out. I also don't get why free_irq is called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. Maybe I can change that in a different refactoring.

as it is I have no idea whether e.g. ndev can change between these two call sites. that would make the check pointless.

...

...
this is what's creating all this mess.

Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's what I see. Or am I missing something?

and why do we care whether irqs are freed on shutdown?

...

...
...
static const struct auxiliary_device_id mlx5v_id_table[] = {

2.41.0

Dragos Tatulea

31 Jul 31 Jul

7:15 a.m.

On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote:

...

On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote:

...
On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote:

...
On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:

...
The ndev was accessed on shutdown without a check if it actually exists. This triggered the crash pasted below. This patch simply adds a check before using ndev.

BUG: kernel NULL pointer dereference, address: 0000000000000300 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0- rc2_for_upstream_min_debug_2023_07_17_15_05 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0- gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017 RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000 RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000 R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die+0x20/0x60 ? page_fault_oops+0x14c/0x3c0 ? exc_page_fault+0x75/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] device_shutdown+0x13e/0x1e0 kernel_restart+0x36/0x90 __do_sys_reboot+0x141/0x210 ? vfs_writev+0xcd/0x140 ? handle_mm_fault+0x161/0x260 ? do_writev+0x6b/0x110 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f496990fb56 RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56 RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8 </TASK> CR2: 0000000000000300 ---[ end trace 0000000000000000 ]---

Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing") Signed-off-by: Dragos Tatulea dtatulea@nvidia.com

drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 9138ef2fb2c8..e2e7ebd71798 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device *auxdev) mgtdev = auxiliary_get_drvdata(auxdev); ndev = mgtdev->ndev; - free_irqs(ndev); + if (ndev) + free_irqs(ndev); }

something I don't get: irqs are allocated in mlx5_vdpa_dev_add why are they not freed in mlx5_vdpa_dev_del?

That is a good point. I will try to find out. I also don't get why free_irq is called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. Maybe I can change that in a different refactoring.

as it is I have no idea whether e.g. ndev can change between these two call sites. that would make the check pointless.

...
...
this is what's creating all this mess.

Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's what I see. Or am I missing something?

and why do we care whether irqs are freed on shutdown?

Had to ask around a bit to find out the answer: there can be issues with kexec IRQ allocation on some platforms. It is documented here [0] for mlx5_core.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...

Thanks, Dragos

Michael S. Tsirkin

9:08 a.m.

On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote:

...

On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote:

...
On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote:

...
On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote:

...
On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:

...
The ndev was accessed on shutdown without a check if it actually exists. This triggered the crash pasted below. This patch simply adds a check before using ndev.

BUG: kernel NULL pointer dereference, address: 0000000000000300 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0- rc2_for_upstream_min_debug_2023_07_17_15_05 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0- gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017 RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000 RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000 R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die+0x20/0x60 ? page_fault_oops+0x14c/0x3c0 ? exc_page_fault+0x75/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] device_shutdown+0x13e/0x1e0 kernel_restart+0x36/0x90 __do_sys_reboot+0x141/0x210 ? vfs_writev+0xcd/0x140 ? handle_mm_fault+0x161/0x260 ? do_writev+0x6b/0x110 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f496990fb56 RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56 RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8 </TASK> CR2: 0000000000000300 ---[ end trace 0000000000000000 ]---

Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing") Signed-off-by: Dragos Tatulea dtatulea@nvidia.com

drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 9138ef2fb2c8..e2e7ebd71798 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device *auxdev) mgtdev = auxiliary_get_drvdata(auxdev); ndev = mgtdev->ndev; - free_irqs(ndev); + if (ndev) + free_irqs(ndev); }

something I don't get: irqs are allocated in mlx5_vdpa_dev_add why are they not freed in mlx5_vdpa_dev_del?

That is a good point. I will try to find out. I also don't get why free_irq is called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. Maybe I can change that in a different refactoring.

as it is I have no idea whether e.g. ndev can change between these two call sites. that would make the check pointless.

...
...
this is what's creating all this mess.

Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's what I see. Or am I missing something?

and why do we care whether irqs are freed on shutdown?

Had to ask around a bit to find out the answer: there can be issues with kexec IRQ allocation on some platforms. It is documented here [0] for mlx5_core.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...

Thanks, Dragos

It's quite weird. * Some platforms requiring freeing the IRQ's in the shutdown * flow. If they aren't freed they can't be allocated after * kexec. There is no need to cleanup the mlx5_core software * contexts.

but most drivers don't have a shutdown callback how do they work then? do you know which platforms these are?

I don't really know much about why shutdown callback is even necessary. I guess this is to detect shutdown and do a faster cleanup than the slow, graceful removal, just cleaning hardware resources?

-- MST

Jason Wang

1 Aug 1 Aug

3:59 a.m.

On Mon, Jul 31, 2023 at 5:08 PM Michael S. Tsirkin mst@redhat.com wrote:

...

On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote:

...
On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote:

...
On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote:

...
On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote:

...
On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:

...
The ndev was accessed on shutdown without a check if it actually exists. This triggered the crash pasted below. This patch simply adds a check before using ndev.

BUG: kernel NULL pointer dereference, address: 0000000000000300 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0- rc2_for_upstream_min_debug_2023_07_17_15_05 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0- gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017 RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000 RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000 R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace:

<TASK> ? __die+0x20/0x60 ? page_fault_oops+0x14c/0x3c0 ? exc_page_fault+0x75/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] device_shutdown+0x13e/0x1e0 kernel_restart+0x36/0x90 __do_sys_reboot+0x141/0x210 ? vfs_writev+0xcd/0x140 ? handle_mm_fault+0x161/0x260 ? do_writev+0x6b/0x110 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f496990fb56 RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56 RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8 </TASK> CR2: 0000000000000300 ---[ end trace 0000000000000000 ]---

Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing") Signed-off-by: Dragos Tatulea dtatulea@nvidia.com

drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 9138ef2fb2c8..e2e7ebd71798 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device *auxdev) mgtdev = auxiliary_get_drvdata(auxdev); ndev = mgtdev->ndev;
  free_irqs(ndev);
  if (ndev)
          free_irqs(ndev);
}
something I don't get: irqs are allocated in mlx5_vdpa_dev_add why are they not freed in mlx5_vdpa_dev_del?
That is a good point. I will try to find out. I also don't get why free_irq is called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. Maybe I can change that in a different refactoring.
as it is I have no idea whether e.g. ndev can change between these two call sites. that would make the check pointless.

...
...
this is what's creating all this mess.

Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's what I see. Or am I missing something?

and why do we care whether irqs are freed on shutdown?
Had to ask around a bit to find out the answer: there can be issues with kexec IRQ allocation on some platforms. It is documented here [0] for mlx5_core.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...

Thanks, Dragos
It's quite weird. * Some platforms requiring freeing the IRQ's in the shutdown * flow. If they aren't freed they can't be allocated after * kexec. There is no need to cleanup the mlx5_core software * contexts.

but most drivers don't have a shutdown callback how do they work then? do you know which platforms these are?

There used to be bzs that requires virtio drivers to add a shutdown to fix kexec:

https://bugzilla.redhat.com/show_bug.cgi?id=2108406

Thanks

...

I don't really know much about why shutdown callback is even necessary. I guess this is to detect shutdown and do a faster cleanup than the slow, graceful removal, just cleaning hardware resources?

-- MST

Dragos Tatulea

8:17 a.m.

On Tue, 2023-08-01 at 11:59 +0800, Jason Wang wrote:

...

On Mon, Jul 31, 2023 at 5:08 PM Michael S. Tsirkin mst@redhat.com wrote:

...
On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote:

...
On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote:

...
On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote:

...
On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote:

...
On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote: > The ndev was accessed on shutdown without a check if it actually > exists. > This triggered the crash pasted below. This patch simply adds a > check > before using ndev. > > BUG: kernel NULL pointer dereference, address: 0000000000000300 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: 0000 [#1] SMP > CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0- > rc2_for_upstream_min_debug_2023_07_17_15_05 #1 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel- > 1.13.0-0- > gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 > RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] > RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 > RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017 > RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000 > RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001 > R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000 > R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000 > FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > ? __die+0x20/0x60 > ? page_fault_oops+0x14c/0x3c0 > ? exc_page_fault+0x75/0x140 > ? asm_exc_page_fault+0x22/0x30 > ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] > device_shutdown+0x13e/0x1e0 > kernel_restart+0x36/0x90 > __do_sys_reboot+0x141/0x210 > ? vfs_writev+0xcd/0x140 > ? handle_mm_fault+0x161/0x260 > ? do_writev+0x6b/0x110 > do_syscall_64+0x3d/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f496990fb56 > RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: > 00000000000000a9 > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56 > RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead > RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 > R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8 > </TASK> > CR2: 0000000000000300 > ---[ end trace 0000000000000000 ]--- > > Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing") > Signed-off-by: Dragos Tatulea dtatulea@nvidia.com > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 9138ef2fb2c8..e2e7ebd71798 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct > auxiliary_device > *auxdev) > mgtdev = auxiliary_get_drvdata(auxdev); > ndev = mgtdev->ndev; > > - free_irqs(ndev); > + if (ndev) > + free_irqs(ndev); > } >

something I don't get: irqs are allocated in mlx5_vdpa_dev_add why are they not freed in mlx5_vdpa_dev_del?

That is a good point. I will try to find out. I also don't get why free_irq is called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. Maybe I can change that in a different refactoring.

as it is I have no idea whether e.g. ndev can change between these two call sites. that would make the check pointless.

...
...
this is what's creating all this mess.

Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's what I see. Or am I missing something?

and why do we care whether irqs are freed on shutdown?

Had to ask around a bit to find out the answer: there can be issues with kexec IRQ allocation on some platforms. It is documented here [0] for mlx5_core.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...

Thanks, Dragos

It's quite weird. * Some platforms requiring freeing the IRQ's in the shutdown * flow. If they aren't freed they can't be allocated after * kexec. There is no need to cleanup the mlx5_core software * contexts.

but most drivers don't have a shutdown callback how do they work then? do you know which platforms these are?

I don't. x86_64 is not one of them though. I will do some more digging ...

...

There used to be bzs that requires virtio drivers to add a shutdown to fix kexec:

https://bugzilla.redhat.com/show_bug.cgi?id=2108406

I don't have access to this. What is it about?

Thanks, Dragos

...

Thanks

...
I don't really know much about why shutdown callback is even necessary. I guess this is to detect shutdown and do a faster cleanup than the slow, graceful removal, just cleaning hardware resources?

-- MST

Jason Wang

2 Aug 2 Aug

2:51 a.m.

On Tue, Aug 1, 2023 at 4:17 PM Dragos Tatulea dtatulea@nvidia.com wrote:

...

On Tue, 2023-08-01 at 11:59 +0800, Jason Wang wrote:

...
On Mon, Jul 31, 2023 at 5:08 PM Michael S. Tsirkin mst@redhat.com wrote:

...
On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote:

...
On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote:

...
On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote:

...
On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote: > On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote: > > The ndev was accessed on shutdown without a check if it actually > > exists. > > This triggered the crash pasted below. This patch simply adds a > > check > > before using ndev. > > > > BUG: kernel NULL pointer dereference, address: 0000000000000300 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > PGD 0 P4D 0 > > Oops: 0000 [#1] SMP > > CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0- > > rc2_for_upstream_min_debug_2023_07_17_15_05 #1 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel- > > 1.13.0-0- > > gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 > > RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] > > RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 > > RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: 0000000000000017 > > RDX: 0000000000000001 RSI: 0000000000000212 RDI: ffff888109d28000 > > RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: 0000000000000001 > > R10: 0000000000000000 R11: 0000000000000000 R12: ffff888109d28000 > > R13: ffff888109d28080 R14: 00000000fee1dead R15: 0000000000000000 > > FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) > > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000000000000300 CR3: 00000001051cd006 CR4: 0000000000370eb0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Call Trace: > > <TASK> > > ? __die+0x20/0x60 > > ? page_fault_oops+0x14c/0x3c0 > > ? exc_page_fault+0x75/0x140 > > ? asm_exc_page_fault+0x22/0x30 > > ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] > > device_shutdown+0x13e/0x1e0 > > kernel_restart+0x36/0x90 > > __do_sys_reboot+0x141/0x210 > > ? vfs_writev+0xcd/0x140 > > ? handle_mm_fault+0x161/0x260 > > ? do_writev+0x6b/0x110 > > do_syscall_64+0x3d/0x90 > > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > RIP: 0033:0x7f496990fb56 > > RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: > > 00000000000000a9 > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f496990fb56 > > RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead > > RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: 0000000000000000 > > R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 > > R13: 00007fffc7bddf10 R14: 0000000000000000 R15: 00007fffc7bde2b8 > > </TASK> > > CR2: 0000000000000300 > > ---[ end trace 0000000000000000 ]--- > > > > Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing") > > Signed-off-by: Dragos Tatulea dtatulea@nvidia.com > > --- > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > index 9138ef2fb2c8..e2e7ebd71798 100644 > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct > > auxiliary_device > > *auxdev) > > mgtdev = auxiliary_get_drvdata(auxdev); > > ndev = mgtdev->ndev; > > > > - free_irqs(ndev); > > + if (ndev) > > + free_irqs(ndev); > > } > > > > something I don't get: > irqs are allocated in mlx5_vdpa_dev_add > why are they not freed in mlx5_vdpa_dev_del? > That is a good point. I will try to find out. I also don't get why free_irq is called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. Maybe I can change that in a different refactoring.

as it is I have no idea whether e.g. ndev can change between these two call sites. that would make the check pointless.

...
> this is what's creating all this mess. > > Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's what I see. Or am I missing something?

and why do we care whether irqs are freed on shutdown?

Had to ask around a bit to find out the answer: there can be issues with kexec IRQ allocation on some platforms. It is documented here [0] for mlx5_core.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...

Thanks, Dragos

It's quite weird. * Some platforms requiring freeing the IRQ's in the shutdown * flow. If they aren't freed they can't be allocated after * kexec. There is no need to cleanup the mlx5_core software * contexts.

but most drivers don't have a shutdown callback how do they work then? do you know which platforms these are?

I don't. x86_64 is not one of them though. I will do some more digging ...

...
There used to be bzs that requires virtio drivers to add a shutdown to fix kexec:

https://bugzilla.redhat.com/show_bug.cgi?id=2108406

I don't have access to this. What is it about?

This bug might be more accurate:

https://bugzilla.redhat.com/show_bug.cgi?id=1820521

It's about the kexec guys (cced relevant people) wanting to add a shutdown method for virito to fix potential kexec issues.

Thanks

...

Thanks, Dragos

...
Thanks

...
I don't really know much about why shutdown callback is even necessary. I guess this is to detect shutdown and do a faster cleanup than the slow, graceful removal, just cleaning hardware resources?

-- MST

Dragos Tatulea

7:56 a.m.

On Wed, 2023-08-02 at 10:51 +0800, Jason Wang wrote:

...

On Tue, Aug 1, 2023 at 4:17 PM Dragos Tatulea dtatulea@nvidia.com wrote:

...
On Tue, 2023-08-01 at 11:59 +0800, Jason Wang wrote:

...
On Mon, Jul 31, 2023 at 5:08 PM Michael S. Tsirkin mst@redhat.com wrote:

...
On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote:

...
On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote:

...
On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote: > On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote: > > On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote: > > > The ndev was accessed on shutdown without a check if it > > > actually > > > exists. > > > This triggered the crash pasted below. This patch simply adds > > > a > > > check > > > before using ndev. > > > > > > BUG: kernel NULL pointer dereference, address: > > > 0000000000000300 > > > #PF: supervisor read access in kernel mode > > > #PF: error_code(0x0000) - not-present page > > > PGD 0 P4D 0 > > > Oops: 0000 [#1] SMP > > > CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0- > > > rc2_for_upstream_min_debug_2023_07_17_15_05 #1 > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel- > > > 1.13.0-0- > > > gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 > > > RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] > > > RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 > > > RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: > > > 0000000000000017 > > > RDX: 0000000000000001 RSI: 0000000000000212 RDI: > > > ffff888109d28000 > > > RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: > > > 0000000000000001 > > > R10: 0000000000000000 R11: 0000000000000000 R12: > > > ffff888109d28000 > > > R13: ffff888109d28080 R14: 00000000fee1dead R15: > > > 0000000000000000 > > > FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) > > > knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 0000000000000300 CR3: 00000001051cd006 CR4: > > > 0000000000370eb0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > 0000000000000400 > > > Call Trace: > > > <TASK> > > > ? __die+0x20/0x60 > > > ? page_fault_oops+0x14c/0x3c0 > > > ? exc_page_fault+0x75/0x140 > > > ? asm_exc_page_fault+0x22/0x30 > > > ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] > > > device_shutdown+0x13e/0x1e0 > > > kernel_restart+0x36/0x90 > > > __do_sys_reboot+0x141/0x210 > > > ? vfs_writev+0xcd/0x140 > > > ? handle_mm_fault+0x161/0x260 > > > ? do_writev+0x6b/0x110 > > > do_syscall_64+0x3d/0x90 > > > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > > RIP: 0033:0x7f496990fb56 > > > RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: > > > 00000000000000a9 > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: > > > 00007f496990fb56 > > > RDX: 0000000001234567 RSI: 0000000028121969 RDI: > > > fffffffffee1dead > > > RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: > > > 0000000000000000 > > > R10: 0000000000000000 R11: 0000000000000206 R12: > > > 0000000000000000 > > > R13: 00007fffc7bddf10 R14: 0000000000000000 R15: > > > 00007fffc7bde2b8 > > > </TASK> > > > CR2: 0000000000000300 > > > ---[ end trace 0000000000000000 ]--- > > > > > > Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing") > > > Signed-off-by: Dragos Tatulea dtatulea@nvidia.com > > > --- > > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > index 9138ef2fb2c8..e2e7ebd71798 100644 > > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct > > > auxiliary_device > > > *auxdev) > > > mgtdev = auxiliary_get_drvdata(auxdev); > > > ndev = mgtdev->ndev; > > > > > > - free_irqs(ndev); > > > + if (ndev) > > > + free_irqs(ndev); > > > } > > > > > > > something I don't get: > > irqs are allocated in mlx5_vdpa_dev_add > > why are they not freed in mlx5_vdpa_dev_del? > > > That is a good point. I will try to find out. I also don't get why > free_irq > is > called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. > Maybe I > can > change > that in a different refactoring.

as it is I have no idea whether e.g. ndev can change between these two call sites. that would make the check pointless.

> > this is what's creating all this mess. > > > > > Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct > vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's > what > I > see. Or > am I missing something?

and why do we care whether irqs are freed on shutdown?

Had to ask around a bit to find out the answer: there can be issues with kexec IRQ allocation on some platforms. It is documented here [0] for mlx5_core.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...

Thanks, Dragos

It's quite weird. * Some platforms requiring freeing the IRQ's in the shutdown * flow. If they aren't freed they can't be allocated after * kexec. There is no need to cleanup the mlx5_core software * contexts.

but most drivers don't have a shutdown callback how do they work then? do you know which platforms these are?

I don't. x86_64 is not one of them though. I will do some more digging ...

Turns out that this fix (releasing the irqs on .shutdown on mlx5_core) was required for PPC arch but only for certain mainframe systems. That's all the info I could find.

...

...
...
There used to be bzs that requires virtio drivers to add a shutdown to fix kexec:

https://bugzilla.redhat.com/show_bug.cgi?id=2108406

I don't have access to this. What is it about?

This bug might be more accurate:

https://bugzilla.redhat.com/show_bug.cgi?id=1820521

It's about the kexec guys (cced relevant people) wanting to add a shutdown method for virito to fix potential kexec issues.

Thanks

...
Thanks, Dragos

...
Thanks

...
I don't really know much about why shutdown callback is even necessary. I guess this is to detect shutdown and do a faster cleanup than the slow, graceful removal, just cleaning hardware resources?

.shutdown could be removed in mlx5_vdpa. But I notice that mlx5_core's .shutdown kicks in from pci_device_shutdown to clean the irqs. So the irqs will still be freed but as a side effect. Which is not good.

Thanks, Dragos

Dragos Tatulea

3 Aug 3 Aug

3:02 p.m.

On Wed, 2023-08-02 at 09:56 +0200, Dragos Tatulea wrote:

...

On Wed, 2023-08-02 at 10:51 +0800, Jason Wang wrote:

...
On Tue, Aug 1, 2023 at 4:17 PM Dragos Tatulea dtatulea@nvidia.com wrote:

...
On Tue, 2023-08-01 at 11:59 +0800, Jason Wang wrote:

...
On Mon, Jul 31, 2023 at 5:08 PM Michael S. Tsirkin mst@redhat.com wrote:

...
On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote:

...
On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote: > On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote: > > On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote: > > > On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea > > > wrote: > > > > The ndev was accessed on shutdown without a check if it > > > > actually > > > > exists. > > > > This triggered the crash pasted below. This patch simply > > > > adds > > > > a > > > > check > > > > before using ndev. > > > > > > > > BUG: kernel NULL pointer dereference, address: > > > > 0000000000000300 > > > > #PF: supervisor read access in kernel mode > > > > #PF: error_code(0x0000) - not-present page > > > > PGD 0 P4D 0 > > > > Oops: 0000 [#1] SMP > > > > CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0- > > > > rc2_for_upstream_min_debug_2023_07_17_15_05 #1 > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > > > rel- > > > > 1.13.0-0- > > > > gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 > > > > RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] > > > > RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 > > > > RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: > > > > 0000000000000017 > > > > RDX: 0000000000000001 RSI: 0000000000000212 RDI: > > > > ffff888109d28000 > > > > RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: > > > > 0000000000000001 > > > > R10: 0000000000000000 R11: 0000000000000000 R12: > > > > ffff888109d28000 > > > > R13: ffff888109d28080 R14: 00000000fee1dead R15: > > > > 0000000000000000 > > > > FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) > > > > knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 0000000000000300 CR3: 00000001051cd006 CR4: > > > > 0000000000370eb0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > 0000000000000400 > > > > Call Trace: > > > > <TASK> > > > > ? __die+0x20/0x60 > > > > ? page_fault_oops+0x14c/0x3c0 > > > > ? exc_page_fault+0x75/0x140 > > > > ? asm_exc_page_fault+0x22/0x30 > > > > ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] > > > > device_shutdown+0x13e/0x1e0 > > > > kernel_restart+0x36/0x90 > > > > __do_sys_reboot+0x141/0x210 > > > > ? vfs_writev+0xcd/0x140 > > > > ? handle_mm_fault+0x161/0x260 > > > > ? do_writev+0x6b/0x110 > > > > do_syscall_64+0x3d/0x90 > > > > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > > > RIP: 0033:0x7f496990fb56 > > > > RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: > > > > 00000000000000a9 > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: > > > > 00007f496990fb56 > > > > RDX: 0000000001234567 RSI: 0000000028121969 RDI: > > > > fffffffffee1dead > > > > RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: > > > > 0000000000000000 > > > > R10: 0000000000000000 R11: 0000000000000206 R12: > > > > 0000000000000000 > > > > R13: 00007fffc7bddf10 R14: 0000000000000000 R15: > > > > 00007fffc7bde2b8 > > > > </TASK> > > > > CR2: 0000000000000300 > > > > ---[ end trace 0000000000000000 ]--- > > > > > > > > Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt > > > > bypassing") > > > > Signed-off-by: Dragos Tatulea dtatulea@nvidia.com > > > > --- > > > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > > index 9138ef2fb2c8..e2e7ebd71798 100644 > > > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > > @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct > > > > auxiliary_device > > > > *auxdev) > > > > mgtdev = auxiliary_get_drvdata(auxdev); > > > > ndev = mgtdev->ndev; > > > > > > > > - free_irqs(ndev); > > > > + if (ndev) > > > > + free_irqs(ndev); > > > > } > > > > > > > > > > something I don't get: > > > irqs are allocated in mlx5_vdpa_dev_add > > > why are they not freed in mlx5_vdpa_dev_del? > > > > > That is a good point. I will try to find out. I also don't get > > why > > free_irq > > is > > called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. > > Maybe I > > can > > change > > that in a different refactoring. > > as it is I have no idea whether e.g. ndev can change > between these two call sites. that would make the check > pointless. > > > > this is what's creating all this mess. > > > > > > > > Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct > > vdpa_mgmtdev_ops) doesn't get called on shutdown. At least > > that's > > what > > I > > see. Or > > am I missing something? > > and why do we care whether irqs are freed on shutdown? > Had to ask around a bit to find out the answer: there can be issues with kexec IRQ allocation on some platforms. It is documented here [0] for mlx5_core.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...

Thanks, Dragos

It's quite weird. * Some platforms requiring freeing the IRQ's in the shutdown * flow. If they aren't freed they can't be allocated after * kexec. There is no need to cleanup the mlx5_core software * contexts.

but most drivers don't have a shutdown callback how do they work then? do you know which platforms these are?

I don't. x86_64 is not one of them though. I will do some more digging ...

Turns out that this fix (releasing the irqs on .shutdown on mlx5_core) was required for PPC arch but only for certain mainframe systems. That's all the info I could find.

I will send a v2 for this patch that removes the shutdown op. The irqs will be released by the mlx5_core shutdown handler which is responsible for the VF.

Thanks, Dragos

...

...
...
...
There used to be bzs that requires virtio drivers to add a shutdown to fix kexec:

https://bugzilla.redhat.com/show_bug.cgi?id=2108406

I don't have access to this. What is it about?

This bug might be more accurate:

https://bugzilla.redhat.com/show_bug.cgi?id=1820521

It's about the kexec guys (cced relevant people) wanting to add a shutdown method for virito to fix potential kexec issues.

Thanks

...
Thanks, Dragos

...
Thanks

...
I don't really know much about why shutdown callback is even necessary. I guess this is to detect shutdown and do a faster cleanup than the slow, graceful removal, just cleaning hardware resources?

.shutdown could be removed in mlx5_vdpa. But I notice that mlx5_core's .shutdown kicks in from pci_device_shutdown to clean the irqs. So the irqs will still be freed but as a side effect. Which is not good.

Thanks, Dragos

Michael S. Tsirkin

3:12 p.m.

On Thu, Aug 03, 2023 at 03:02:59PM +0000, Dragos Tatulea wrote:

...

On Wed, 2023-08-02 at 09:56 +0200, Dragos Tatulea wrote:

...
On Wed, 2023-08-02 at 10:51 +0800, Jason Wang wrote:

...
On Tue, Aug 1, 2023 at 4:17 PM Dragos Tatulea dtatulea@nvidia.com wrote:

...
On Tue, 2023-08-01 at 11:59 +0800, Jason Wang wrote:

...
On Mon, Jul 31, 2023 at 5:08 PM Michael S. Tsirkin mst@redhat.com wrote:

...
On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote: > On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote: > > On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote: > > > On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote: > > > > On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea > > > > wrote: > > > > > The ndev was accessed on shutdown without a check if it > > > > > actually > > > > > exists. > > > > > This triggered the crash pasted below. This patch simply > > > > > adds > > > > > a > > > > > check > > > > > before using ndev. > > > > > > > > > > BUG: kernel NULL pointer dereference, address: > > > > > 0000000000000300 > > > > > #PF: supervisor read access in kernel mode > > > > > #PF: error_code(0x0000) - not-present page > > > > > PGD 0 P4D 0 > > > > > Oops: 0000 [#1] SMP > > > > > CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0- > > > > > rc2_for_upstream_min_debug_2023_07_17_15_05 #1 > > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > > > > rel- > > > > > 1.13.0-0- > > > > > gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 > > > > > RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] > > > > > RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286 > > > > > RAX: ffff888103befba0 RBX: ffff888109d28008 RCX: > > > > > 0000000000000017 > > > > > RDX: 0000000000000001 RSI: 0000000000000212 RDI: > > > > > ffff888109d28000 > > > > > RBP: 0000000000000000 R08: 0000000d3a3a3882 R09: > > > > > 0000000000000001 > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: > > > > > ffff888109d28000 > > > > > R13: ffff888109d28080 R14: 00000000fee1dead R15: > > > > > 0000000000000000 > > > > > FS: 00007f4969e0be40(0000) GS:ffff88852c800000(0000) > > > > > knlGS:0000000000000000 > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > CR2: 0000000000000300 CR3: 00000001051cd006 CR4: > > > > > 0000000000370eb0 > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > > 0000000000000000 > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > > 0000000000000400 > > > > > Call Trace: > > > > > <TASK> > > > > > ? __die+0x20/0x60 > > > > > ? page_fault_oops+0x14c/0x3c0 > > > > > ? exc_page_fault+0x75/0x140 > > > > > ? asm_exc_page_fault+0x22/0x30 > > > > > ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa] > > > > > device_shutdown+0x13e/0x1e0 > > > > > kernel_restart+0x36/0x90 > > > > > __do_sys_reboot+0x141/0x210 > > > > > ? vfs_writev+0xcd/0x140 > > > > > ? handle_mm_fault+0x161/0x260 > > > > > ? do_writev+0x6b/0x110 > > > > > do_syscall_64+0x3d/0x90 > > > > > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > > > > RIP: 0033:0x7f496990fb56 > > > > > RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX: > > > > > 00000000000000a9 > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: > > > > > 00007f496990fb56 > > > > > RDX: 0000000001234567 RSI: 0000000028121969 RDI: > > > > > fffffffffee1dead > > > > > RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09: > > > > > 0000000000000000 > > > > > R10: 0000000000000000 R11: 0000000000000206 R12: > > > > > 0000000000000000 > > > > > R13: 00007fffc7bddf10 R14: 0000000000000000 R15: > > > > > 00007fffc7bde2b8 > > > > > </TASK> > > > > > CR2: 0000000000000300 > > > > > ---[ end trace 0000000000000000 ]--- > > > > > > > > > > Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt > > > > > bypassing") > > > > > Signed-off-by: Dragos Tatulea dtatulea@nvidia.com > > > > > --- > > > > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++- > > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > > > index 9138ef2fb2c8..e2e7ebd71798 100644 > > > > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > > > @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct > > > > > auxiliary_device > > > > > *auxdev) > > > > > mgtdev = auxiliary_get_drvdata(auxdev); > > > > > ndev = mgtdev->ndev; > > > > > > > > > > - free_irqs(ndev); > > > > > + if (ndev) > > > > > + free_irqs(ndev); > > > > > } > > > > > > > > > > > > > something I don't get: > > > > irqs are allocated in mlx5_vdpa_dev_add > > > > why are they not freed in mlx5_vdpa_dev_del? > > > > > > > That is a good point. I will try to find out. I also don't get > > > why > > > free_irq > > > is > > > called in the vdpa dev .free op instead of mlx5_vdpa_dev_del. > > > Maybe I > > > can > > > change > > > that in a different refactoring. > > > > as it is I have no idea whether e.g. ndev can change > > between these two call sites. that would make the check > > pointless. > > > > > > this is what's creating all this mess. > > > > > > > > > > > Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct > > > vdpa_mgmtdev_ops) doesn't get called on shutdown. At least > > > that's > > > what > > > I > > > see. Or > > > am I missing something? > > > > and why do we care whether irqs are freed on shutdown? > > > Had to ask around a bit to find out the answer: there can be issues > with > kexec > IRQ allocation on some platforms. It is documented here [0] for > mlx5_core. > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv... > > Thanks, > Dragos

It's quite weird. * Some platforms requiring freeing the IRQ's in the shutdown * flow. If they aren't freed they can't be allocated after * kexec. There is no need to cleanup the mlx5_core software * contexts.

but most drivers don't have a shutdown callback how do they work then? do you know which platforms these are?

I don't. x86_64 is not one of them though. I will do some more digging ...

Turns out that this fix (releasing the irqs on .shutdown on mlx5_core) was required for PPC arch but only for certain mainframe systems. That's all the info I could find.

I will send a v2 for this patch that removes the shutdown op. The irqs will be released by the mlx5_core shutdown handler which is responsible for the VF.

Thanks, Dragos

Certainly seems cleaner. Thanks!

...

...
...
...
...
There used to be bzs that requires virtio drivers to add a shutdown to fix kexec:

https://bugzilla.redhat.com/show_bug.cgi?id=2108406

I don't have access to this. What is it about?

This bug might be more accurate:

https://bugzilla.redhat.com/show_bug.cgi?id=1820521

It's about the kexec guys (cced relevant people) wanting to add a shutdown method for virito to fix potential kexec issues.

Thanks

...
Thanks, Dragos

...
Thanks

...
I don't really know much about why shutdown callback is even necessary. I guess this is to detect shutdown and do a faster cleanup than the slow, graceful removal, just cleaning hardware resources?

.shutdown could be removed in mlx5_vdpa. But I notice that mlx5_core's .shutdown kicks in from pci_device_shutdown to clean the irqs. So the irqs will still be freed but as a side effect. Which is not good.

Thanks, Dragos

890

days inactive

898

days old

linux-stable-mirror@lists.linaro.org

12 comments

participants

tags (0)

participants (4)

Dragos Tatulea
Jason Wang
kernel test robot
Michael S. Tsirkin