[PATCH] mpt3sas: Fix kernel panic observed on soft HBA unplug

List overview All Threads
Download

newer

older

[PATCH 4.19.113 0/5] mmc: Fix some...

[PATCH v2] mm/sparse: Fix kernel...

Sreekanth Reddy

11 Mar 2020 11 Mar '20

10:36 a.m.

Generic protection fault type kernel panic is observed when user performs soft(ordered) HBA unplug operation while IOs are running on drives connected to HBA.

When user performs ordered HBA removal operation then kernel calls PCI device's .remove() call back function where driver is flushing out all the outstanding SCSI IO commands with DID_NO_CONNECT host byte and also un-maps sg buffers allocated for these IO commands. But in the ordered HBA removal case (unlike of real HBA hot unplug) HBA device is still alive and hence HBA hardware is performing the DMA operations to those buffers on the system memory which are already unmapped while flushing out the outstanding SCSI IO commands and this leads to Kernel panic.

Fix: Don't flush out the outstanding IOs from .remove() path in case of ordered HBA removal since HBA will be still alive in this case and it can complete the outstanding IOs. Flush out the outstanding IOs only in case physical HBA hot unplug where their won't be any communication with the HBA.

Cc: stable@vger.kernel.org Signed-off-by: Sreekanth Reddy sreekanth.reddy@broadcom.com --- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 778d5e6..04a40af 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -9908,8 +9908,8 @@ static void scsih_remove(struct pci_dev *pdev)

ioc->remove_host = 1;

- mpt3sas_wait_for_commands_to_complete(ioc); - _scsih_flush_running_cmds(ioc); + if (!pci_device_is_present(pdev)) + _scsih_flush_running_cmds(ioc);

_scsih_fw_event_cleanup_queue(ioc);

@@ -9992,8 +9992,8 @@ static void scsih_remove(struct pci_dev *pdev)

ioc->remove_host = 1;

- mpt3sas_wait_for_commands_to_complete(ioc); - _scsih_flush_running_cmds(ioc); + if (!pci_device_is_present(pdev)) + _scsih_flush_running_cmds(ioc);

_scsih_fw_event_cleanup_queue(ioc);

-- 1.8.3.1

Show replies by date

Amit Shah

11 Mar 11 Mar

11:04 a.m.

On Wed, 2020-03-11 at 06:36 -0400, Sreekanth Reddy wrote:

...

Generic protection fault type kernel panic is observed when user performs soft(ordered) HBA unplug operation while IOs are running on drives connected to HBA.

When user performs ordered HBA removal operation then kernel calls PCI device's .remove() call back function where driver is flushing out all the outstanding SCSI IO commands with DID_NO_CONNECT host byte and also un-maps sg buffers allocated for these IO commands. But in the ordered HBA removal case (unlike of real HBA hot unplug) HBA device is still alive and hence HBA hardware is performing the DMA operations to those buffers on the system memory which are already unmapped while flushing out the outstanding SCSI IO commands and this leads to Kernel panic.

Fix: Don't flush out the outstanding IOs from .remove() path in case of ordered HBA removal since HBA will be still alive in this case and it can complete the outstanding IOs. Flush out the outstanding IOs only in case physical HBA hot unplug where their won't be any communication with the HBA.

Can you please point to the commit that introduces the bug?

...

Cc: stable@vger.kernel.org Signed-off-by: Sreekanth Reddy sreekanth.reddy@broadcom.com

drivers/scsi/mpt3sas/mpt3sas_scsih.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 778d5e6..04a40af 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -9908,8 +9908,8 @@ static void scsih_remove(struct pci_dev *pdev) ioc->remove_host = 1;

mpt3sas_wait_for_commands_to_complete(ioc);

_scsih_flush_running_cmds(ioc);
if (!pci_device_is_present(pdev))
_scsih_flush_running_cmds(ioc);
_scsih_fw_event_cleanup_queue(ioc); @@ -9992,8 +9992,8 @@ static void scsih_remove(struct pci_dev *pdev)

Just a note: this function is scsih_shutdown(). Doesn't block application of the patch, though. Just wondering how the patch was created.

...

ioc->remove_host = 1;

mpt3sas_wait_for_commands_to_complete(ioc);

_scsih_flush_running_cmds(ioc);
if (!pci_device_is_present(pdev))
_scsih_flush_running_cmds(ioc);
_scsih_fw_event_cleanup_queue(ioc);

Sreekanth Reddy

11:25 a.m.

On Wed, Mar 11, 2020 at 4:35 PM Amit Shah amit@kernel.org wrote:

...

On Wed, 2020-03-11 at 06:36 -0400, Sreekanth Reddy wrote:

...
Generic protection fault type kernel panic is observed when user performs soft(ordered) HBA unplug operation while IOs are running on drives connected to HBA.

When user performs ordered HBA removal operation then kernel calls PCI device's .remove() call back function where driver is flushing out all the outstanding SCSI IO commands with DID_NO_CONNECT host byte and also un-maps sg buffers allocated for these IO commands. But in the ordered HBA removal case (unlike of real HBA hot unplug) HBA device is still alive and hence HBA hardware is performing the DMA operations to those buffers on the system memory which are already unmapped while flushing out the outstanding SCSI IO commands and this leads to Kernel panic.

Fix: Don't flush out the outstanding IOs from .remove() path in case of ordered HBA removal since HBA will be still alive in this case and it can complete the outstanding IOs. Flush out the outstanding IOs only in case physical HBA hot unplug where their won't be any communication with the HBA.

Can you please point to the commit that introduces the bug?

Sure I will add the commit ID which introduced this bug in the next patch.

...

...
Cc: stable@vger.kernel.org Signed-off-by: Sreekanth Reddy sreekanth.reddy@broadcom.com

drivers/scsi/mpt3sas/mpt3sas_scsih.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 778d5e6..04a40af 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -9908,8 +9908,8 @@ static void scsih_remove(struct pci_dev *pdev)
  ioc->remove_host = 1;
mpt3sas_wait_for_commands_to_complete(ioc);
_scsih_flush_running_cmds(ioc);
if (!pci_device_is_present(pdev))
        _scsih_flush_running_cmds(ioc);

_scsih_fw_event_cleanup_queue(ioc);
@@ -9992,8 +9992,8 @@ static void scsih_remove(struct pci_dev *pdev)
Just a note: this function is scsih_shutdown(). Doesn't block application of the patch, though. Just wondering how the patch was created.

Sorry I didn't get you. Can you please elaborate your query?

...

  ioc->remove_host = 1;

mpt3sas_wait_for_commands_to_complete(ioc);

```
_scsih_flush_running_cmds(ioc);
```

```
if (!pci_device_is_present(pdev))
```

        _scsih_flush_running_cmds(ioc);

_scsih_fw_event_cleanup_queue(ioc);

Sreekanth Reddy

11:49 a.m.

On Wed, Mar 11, 2020 at 4:55 PM Sreekanth Reddy sreekanth.reddy@broadcom.com wrote:

...

On Wed, Mar 11, 2020 at 4:35 PM Amit Shah amit@kernel.org wrote:

...
On Wed, 2020-03-11 at 06:36 -0400, Sreekanth Reddy wrote:

...
Generic protection fault type kernel panic is observed when user performs soft(ordered) HBA unplug operation while IOs are running on drives connected to HBA.

When user performs ordered HBA removal operation then kernel calls PCI device's .remove() call back function where driver is flushing out all the outstanding SCSI IO commands with DID_NO_CONNECT host byte and also un-maps sg buffers allocated for these IO commands. But in the ordered HBA removal case (unlike of real HBA hot unplug) HBA device is still alive and hence HBA hardware is performing the DMA operations to those buffers on the system memory which are already unmapped while flushing out the outstanding SCSI IO commands and this leads to Kernel panic.

Fix: Don't flush out the outstanding IOs from .remove() path in case of ordered HBA removal since HBA will be still alive in this case and it can complete the outstanding IOs. Flush out the outstanding IOs only in case physical HBA hot unplug where their won't be any communication with the HBA.

Can you please point to the commit that introduces the bug?

Sure I will add the commit ID which introduced this bug in the next patch.

...
...
Cc: stable@vger.kernel.org Signed-off-by: Sreekanth Reddy sreekanth.reddy@broadcom.com

drivers/scsi/mpt3sas/mpt3sas_scsih.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 778d5e6..04a40af 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -9908,8 +9908,8 @@ static void scsih_remove(struct pci_dev *pdev)
  ioc->remove_host = 1;
mpt3sas_wait_for_commands_to_complete(ioc);
_scsih_flush_running_cmds(ioc);
if (!pci_device_is_present(pdev))
        _scsih_flush_running_cmds(ioc);

_scsih_fw_event_cleanup_queue(ioc);
@@ -9992,8 +9992,8 @@ static void scsih_remove(struct pci_dev *pdev)
Just a note: this function is scsih_shutdown(). Doesn't block application of the patch, though. Just wondering how the patch was created.

I got your query now, yes this hunk change is in scsih_shutdown() function. I am not sure why scsih_remove name is getting displayed here in this hunk. I have used 'git format-patch' to generate the patch.

...

Sorry I didn't get you. Can you please elaborate your query?

...

  ioc->remove_host = 1;

mpt3sas_wait_for_commands_to_complete(ioc);

```
_scsih_flush_running_cmds(ioc);
```

```
if (!pci_device_is_present(pdev))
```

        _scsih_flush_running_cmds(ioc);

_scsih_fw_event_cleanup_queue(ioc);

Amit Shah

2:48 p.m.

On Wed, 2020-03-11 at 17:19 +0530, Sreekanth Reddy wrote:

...

On Wed, Mar 11, 2020 at 4:55 PM Sreekanth Reddy sreekanth.reddy@broadcom.com wrote:

...
On Wed, Mar 11, 2020 at 4:35 PM Amit Shah amit@kernel.org wrote:

...
On Wed, 2020-03-11 at 06:36 -0400, Sreekanth Reddy wrote:

...
Generic protection fault type kernel panic is observed when user performs soft(ordered) HBA unplug operation while IOs are running on drives connected to HBA.

When user performs ordered HBA removal operation then kernel calls PCI device's .remove() call back function where driver is flushing out all the outstanding SCSI IO commands with DID_NO_CONNECT host byte and also un-maps sg buffers allocated for these IO commands. But in the ordered HBA removal case (unlike of real HBA hot unplug) HBA device is still alive and hence HBA hardware is performing the DMA operations to those buffers on the system memory which are already unmapped while flushing out the outstanding SCSI IO commands and this leads to Kernel panic.

Fix: Don't flush out the outstanding IOs from .remove() path in case of ordered HBA removal since HBA will be still alive in this case and it can complete the outstanding IOs. Flush out the outstanding IOs only in case physical HBA hot unplug where their won't be any communication with the HBA.

Can you please point to the commit that introduces the bug?

Sure I will add the commit ID which introduced this bug in the next patch.

Thanks.

...

...
...
...
Cc: stable@vger.kernel.org Signed-off-by: Sreekanth Reddy sreekanth.reddy@broadcom.com

drivers/scsi/mpt3sas/mpt3sas_scsih.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 778d5e6..04a40af 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -9908,8 +9908,8 @@ static void scsih_remove(struct pci_dev *pdev)
  ioc->remove_host = 1;
mpt3sas_wait_for_commands_to_complete(ioc);
_scsih_flush_running_cmds(ioc);
if (!pci_device_is_present(pdev))
        _scsih_flush_running_cmds(ioc);

_scsih_fw_event_cleanup_queue(ioc);
@@ -9992,8 +9992,8 @@ static void scsih_remove(struct pci_dev *pdev)
Just a note: this function is scsih_shutdown(). Doesn't block application of the patch, though. Just wondering how the patch was created.
I got your query now, yes this hunk change is in scsih_shutdown() function. I am not sure why scsih_remove name is getting displayed here in this hunk. I have used 'git format-patch' to generate the patch.

Thanks. Does the commit description need an update as well? It only talks about remove callback.

...

Sorry I didn't get you. Can you please elaborate your query?

...

  ioc->remove_host = 1;

mpt3sas_wait_for_commands_to_complete(ioc);

```
_scsih_flush_running_cmds(ioc);
```

```
if (!pci_device_is_present(pdev))
```

        _scsih_flush_running_cmds(ioc);

_scsih_fw_event_cleanup_queue(ioc);

Elliott, Robert (Servers)

14 Mar 14 Mar

2:25 a.m.

...

-----Original Message----- From: linux-scsi-owner@vger.kernel.org <linux-scsi- owner@vger.kernel.org> On Behalf Of Sreekanth Reddy Sent: Wednesday, March 11, 2020 6:50 AM To: Amit Shah amit@kernel.org Subject: Re: [PATCH] mpt3sas: Fix kernel panic observed on soft HBA unplug

On Wed, Mar 11, 2020 at 4:55 PM Sreekanth Reddy sreekanth.reddy@broadcom.com wrote:

...
On Wed, Mar 11, 2020 at 4:35 PM Amit Shah amit@kernel.org wrote:

...

...
...
...
@@ -9992,8 +9992,8 @@ static void scsih_remove(struct pci_dev

*pdev)

...
...
Just a note: this function is scsih_shutdown(). Doesn't block application of the patch, though. Just wondering how the patch was created.

I got your query now, yes this hunk change is in scsih_shutdown() function. I am not sure why scsih_remove name is getting displayed here in this hunk. I have used 'git format-patch' to generate the patch.

The scsih_shutdown() function definition is spread over three lines (although it could easily fit on one line), while scsih_remove() was the last function whose definition was on one line. git is apparently not recognizing scsi_shutdown() as a function definition.

Elliott, Robert (Servers)

2:25 a.m.

...

-----Original Message----- From: linux-scsi-owner@vger.kernel.org <linux-scsi- owner@vger.kernel.org> On Behalf Of Sreekanth Reddy Sent: Wednesday, March 11, 2020 5:37 AM To: martin.petersen@oracle.com Cc: linux-scsi@vger.kernel.org; sathya.prakash@broadcom.com; suganath- prabu.subramani@broadcom.com; stable@vger.kernel.org; amit@kernel.org; Sreekanth Reddy sreekanth.reddy@broadcom.com Subject: [PATCH] mpt3sas: Fix kernel panic observed on soft HBA unplug

Generic protection fault type kernel panic is observed when user performs soft(ordered) HBA unplug operation while IOs are running on drives connected to HBA.

When user performs ordered HBA removal operation then kernel calls PCI device's .remove() call back function where driver is flushing out all the outstanding SCSI IO commands with DID_NO_CONNECT host byte and also un-maps sg buffers allocated for these IO commands. But in the ordered HBA removal case (unlike of real HBA hot unplug) HBA device is still alive and hence HBA hardware is performing the DMA operations to those buffers on the system memory which are already unmapped while flushing out the outstanding SCSI IO commands and this leads to Kernel panic.

Fix: Don't flush out the outstanding IOs from .remove() path in case of ordered HBA removal since HBA will be still alive in this case and it can complete the outstanding IOs. Flush out the outstanding IOs only in case physical HBA hot unplug where their won't be any communication with the HBA.

Cc: stable@vger.kernel.org Signed-off-by: Sreekanth Reddy sreekanth.reddy@broadcom.com

drivers/scsi/mpt3sas/mpt3sas_scsih.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 778d5e6..04a40af 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -9908,8 +9908,8 @@ static void scsih_remove(struct pci_dev *pdev)

ioc->remove_host = 1;

mpt3sas_wait_for_commands_to_complete(ioc);

Immediately removing the driver with IOs pending seems dangerous.

That function includes a timeout to avoid hanging forever, which is reasonable (avoid hanging during system shutdown). Perhaps the kernel panic was happening because that function timed out?

Reporting a warning or error and doing special handling might be appropriate if that occurs. That should be rare, though; the normal case should be to cleanly finish any outstanding commands.

...

_scsih_flush_running_cmds(ioc);
if (!pci_device_is_present(pdev))
_scsih_flush_running_cmds(ioc);

If that branch is not taken, then it proceeds to remove the driver with IOs pending. That'll wipe out all sorts of ioc structures and things like interrupt handler code, leaving memory mapped forever (no code left to call scsi_dma_unmap). That might be better than a kernel panic, but still not good.

Sreekanth Reddy

16 Mar 16 Mar

6:15 a.m.

On Sat, Mar 14, 2020 at 7:56 AM Elliott, Robert (Servers) elliott@hpe.com wrote:

...

...
-----Original Message----- From: linux-scsi-owner@vger.kernel.org <linux-scsi- owner@vger.kernel.org> On Behalf Of Sreekanth Reddy Sent: Wednesday, March 11, 2020 5:37 AM To: martin.petersen@oracle.com Cc: linux-scsi@vger.kernel.org; sathya.prakash@broadcom.com; suganath- prabu.subramani@broadcom.com; stable@vger.kernel.org; amit@kernel.org; Sreekanth Reddy sreekanth.reddy@broadcom.com Subject: [PATCH] mpt3sas: Fix kernel panic observed on soft HBA unplug

Generic protection fault type kernel panic is observed when user performs soft(ordered) HBA unplug operation while IOs are running on drives connected to HBA.

When user performs ordered HBA removal operation then kernel calls PCI device's .remove() call back function where driver is flushing out all the outstanding SCSI IO commands with DID_NO_CONNECT host byte and also un-maps sg buffers allocated for these IO commands. But in the ordered HBA removal case (unlike of real HBA hot unplug) HBA device is still alive and hence HBA hardware is performing the DMA operations to those buffers on the system memory which are already unmapped while flushing out the outstanding SCSI IO commands and this leads to Kernel panic.

Fix: Don't flush out the outstanding IOs from .remove() path in case of ordered HBA removal since HBA will be still alive in this case and it can complete the outstanding IOs. Flush out the outstanding IOs only in case physical HBA hot unplug where their won't be any communication with the HBA.

Cc: stable@vger.kernel.org Signed-off-by: Sreekanth Reddy sreekanth.reddy@broadcom.com

drivers/scsi/mpt3sas/mpt3sas_scsih.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 778d5e6..04a40af 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -9908,8 +9908,8 @@ static void scsih_remove(struct pci_dev *pdev)
  ioc->remove_host = 1;
mpt3sas_wait_for_commands_to_complete(ioc);
Immediately removing the driver with IOs pending seems dangerous.

That function includes a timeout to avoid hanging forever, which is reasonable (avoid hanging during system shutdown). Perhaps the kernel panic was happening because that function timed out?

Reporting a warning or error and doing special handling might be appropriate if that occurs. That should be rare, though; the normal case should be to cleanly finish any outstanding commands.

...
_scsih_flush_running_cmds(ioc);
if (!pci_device_is_present(pdev))
        _scsih_flush_running_cmds(ioc);
If that branch is not taken, then it proceeds to remove the driver with IOs pending. That'll wipe out all sorts of ioc structures and things like interrupt handler code, leaving memory mapped forever (no code left to call scsi_dma_unmap). That might be better than a kernel panic, but still not good.

In the unload path driver call sas_remove_host() API before releasing the resources. This sas_remove_host() API waits for all the outstanding IOs to be completed. So here, indirectly driver is waiting for the outstanding IOs to be processed before releasing the HBA resources. So only in the cases where HBA is inaccessible (e.g. HBA unplug case), driver is flushing out the outstanding commands to avoid SCSI error handling over head and can quilkey complete the driver unload operation.

...

Martin K. Petersen

27 Mar 27 Mar

1:50 a.m.

Sreekanth,

...

In the unload path driver call sas_remove_host() API before releasing the resources. This sas_remove_host() API waits for all the outstanding IOs to be completed. So here, indirectly driver is waiting for the outstanding IOs to be processed before releasing the HBA resources. So only in the cases where HBA is inaccessible (e.g. HBA unplug case), driver is flushing out the outstanding commands to avoid SCSI error handling over head and can quilkey complete the driver unload operation.

None of this is clear from the commit description. Please resubmit patch with a new description clarifying why and when it is safe to drop outstanding commands.

-- Martin K. Petersen Oracle Linux Engineering

Sreekanth Reddy

10:35 a.m.

On Fri, Mar 27, 2020 at 7:20 AM Martin K. Petersen martin.petersen@oracle.com wrote:

...

Sreekanth,

...
In the unload path driver call sas_remove_host() API before releasing the resources. This sas_remove_host() API waits for all the outstanding IOs to be completed. So here, indirectly driver is waiting for the outstanding IOs to be processed before releasing the HBA resources. So only in the cases where HBA is inaccessible (e.g. HBA unplug case), driver is flushing out the outstanding commands to avoid SCSI error handling over head and can quilkey complete the driver unload operation.

None of this is clear from the commit description. Please resubmit patch with a new description clarifying why and when it is safe to drop outstanding commands.

Martin,

Posted the patch with updated description. https://patchwork.kernel.org/patch/11462107/

Thanks & Regards, Sreekanth

...

-- Martin K. Petersen Oracle Linux Engineering

2112

days inactive

2128

days old

linux-stable-mirror@lists.linaro.org

9 comments

participants

tags (0)

participants (4)

Amit Shah
Elliott, Robert (Servers)
Martin K. Petersen
Sreekanth Reddy