Hi,
This series fixes a couple of issues reported by Johan in [1]. First one fixes a deadlock that happens during shutdown and suspend. Second one fixes the driver's PM behavior.
[1] https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com
Signed-off-by: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org --- Manivannan Sadhasivam (2): bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume()
drivers/bus/mhi/host/pci_generic.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) --- base-commit: fc033cf25e612e840e545f8d5ad2edd6ba613ed5 change-id: 20250108-mhi_recovery_fix-a8f37168f91c
Best regards,
From: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org
There are multiple places from where the recovery work gets scheduled asynchronously. Also, there are multiple places where the caller waits synchronously for the recovery to be completed. One such place is during the PM shutdown() callback.
If the device is not alive during recovery_work, it will try to reset the device using pci_reset_function(). This function internally will take the device_lock() first before resetting the device. By this time, if the lock has already been acquired, then recovery_work will get stalled while waiting for the lock. And if the lock was already acquired by the caller which waits for the recovery_work to be completed, it will lead to deadlock.
This is what happened on the X1E80100 CRD device when the device died before shutdown() callback. Driver core calls the driver's shutdown() callback while holding the device_lock() leading to deadlock.
And this deadlock scenario can occur on other paths as well, like during the PM suspend() callback, where the driver core would hold the device_lock() before calling driver's suspend() callback. And if the recovery_work was already started, it could lead to deadlock. This is also observed on the X1E80100 CRD.
So to fix both issues, use pci_try_reset_function() in recovery_work. This function first checks for the availability of the device_lock() before trying to reset the device. If the lock is available, it will acquire it and reset the device. Otherwise, it will return -EAGAIN. If that happens, recovery_work will fail with the error message "Recovery failed" as not much could be done.
Cc: stable@vger.kernel.org # 5.12 Reported-by: Johan Hovold johan@kernel.org Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure") Signed-off-by: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org --- drivers/bus/mhi/host/pci_generic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/bus/mhi/host/pci_generic.c b/drivers/bus/mhi/host/pci_generic.c index 07645ce2119a..e92df380c785 100644 --- a/drivers/bus/mhi/host/pci_generic.c +++ b/drivers/bus/mhi/host/pci_generic.c @@ -1040,7 +1040,7 @@ static void mhi_pci_recovery_work(struct work_struct *work) err_unprepare: mhi_unprepare_after_power_down(mhi_cntrl); err_try_reset: - if (pci_reset_function(pdev)) + if (pci_try_reset_function(pdev)) dev_err(&pdev->dev, "Recovery failed\n"); }
On Wed, 8 Jan 2025 at 14:39, Manivannan Sadhasivam via B4 Relay devnull+manivannan.sadhasivam.linaro.org@kernel.org wrote:
From: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org
There are multiple places from where the recovery work gets scheduled asynchronously. Also, there are multiple places where the caller waits synchronously for the recovery to be completed. One such place is during the PM shutdown() callback.
If the device is not alive during recovery_work, it will try to reset the device using pci_reset_function(). This function internally will take the device_lock() first before resetting the device. By this time, if the lock has already been acquired, then recovery_work will get stalled while waiting for the lock. And if the lock was already acquired by the caller which waits for the recovery_work to be completed, it will lead to deadlock.
This is what happened on the X1E80100 CRD device when the device died before shutdown() callback. Driver core calls the driver's shutdown() callback while holding the device_lock() leading to deadlock.
And this deadlock scenario can occur on other paths as well, like during the PM suspend() callback, where the driver core would hold the device_lock() before calling driver's suspend() callback. And if the recovery_work was already started, it could lead to deadlock. This is also observed on the X1E80100 CRD.
So to fix both issues, use pci_try_reset_function() in recovery_work. This function first checks for the availability of the device_lock() before trying to reset the device. If the lock is available, it will acquire it and reset the device. Otherwise, it will return -EAGAIN. If that happens, recovery_work will fail with the error message "Recovery failed" as not much could be done.
Cc: stable@vger.kernel.org # 5.12 Reported-by: Johan Hovold johan@kernel.org Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure") Signed-off-by: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org
Reviewed-by: Loic Poulain loic.poulain@linaro.org
From: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org
Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work is started asynchronously and success is returned. But this doesn't align with what PM core expects as documented in Documentation/power/runtime_pm.rst:
"Once the subsystem-level resume callback (or the driver resume callback, if invoked directly) has completed successfully, the PM core regards the device as fully operational, which means that the device _must_ be able to complete I/O operations as needed. The runtime PM status of the device is then 'active'."
So the PM core ends up marking the runtime PM status of the device as 'active', even though the device is not able to handle the I/O operations. This same condition more or less applies to system resume as well.
So to avoid this ambiguity, try to recover the device synchronously from mhi_pci_runtime_resume() and return the actual error code in the case of recovery failure.
For doing so, move the recovery code to __mhi_pci_recovery_work() helper and call that from both mhi_pci_recovery_work() and mhi_pci_runtime_resume(). Former still ignores the return value, while the latter passes it to PM core.
Cc: stable@vger.kernel.org # 5.13 Reported-by: Johan Hovold johan@kernel.org Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM") Signed-off-by: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org --- drivers/bus/mhi/host/pci_generic.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/drivers/bus/mhi/host/pci_generic.c b/drivers/bus/mhi/host/pci_generic.c index e92df380c785..f6de407e077e 100644 --- a/drivers/bus/mhi/host/pci_generic.c +++ b/drivers/bus/mhi/host/pci_generic.c @@ -997,10 +997,8 @@ static void mhi_pci_runtime_put(struct mhi_controller *mhi_cntrl) pm_runtime_put(mhi_cntrl->cntrl_dev); }
-static void mhi_pci_recovery_work(struct work_struct *work) +static int __mhi_pci_recovery_work(struct mhi_pci_device *mhi_pdev) { - struct mhi_pci_device *mhi_pdev = container_of(work, struct mhi_pci_device, - recovery_work); struct mhi_controller *mhi_cntrl = &mhi_pdev->mhi_cntrl; struct pci_dev *pdev = to_pci_dev(mhi_cntrl->cntrl_dev); int err; @@ -1035,13 +1033,25 @@ static void mhi_pci_recovery_work(struct work_struct *work)
set_bit(MHI_PCI_DEV_STARTED, &mhi_pdev->status); mod_timer(&mhi_pdev->health_check_timer, jiffies + HEALTH_CHECK_PERIOD); - return; + + return 0;
err_unprepare: mhi_unprepare_after_power_down(mhi_cntrl); err_try_reset: - if (pci_try_reset_function(pdev)) + err = pci_try_reset_function(pdev); + if (err) dev_err(&pdev->dev, "Recovery failed\n"); + + return err; +} + +static void mhi_pci_recovery_work(struct work_struct *work) +{ + struct mhi_pci_device *mhi_pdev = container_of(work, struct mhi_pci_device, + recovery_work); + + __mhi_pci_recovery_work(mhi_pdev); }
static void health_check(struct timer_list *t) @@ -1400,15 +1410,10 @@ static int __maybe_unused mhi_pci_runtime_resume(struct device *dev) return 0;
err_recovery: - /* Do not fail to not mess up our PCI device state, the device likely - * lost power (d3cold) and we simply need to reset it from the recovery - * procedure, trigger the recovery asynchronously to prevent system - * suspend exit delaying. - */ - queue_work(system_long_wq, &mhi_pdev->recovery_work); + err = __mhi_pci_recovery_work(mhi_pdev); pm_runtime_mark_last_busy(dev);
- return 0; + return err; }
static int __maybe_unused mhi_pci_suspend(struct device *dev)
On Wed, 8 Jan 2025 at 14:39, Manivannan Sadhasivam via B4 Relay devnull+manivannan.sadhasivam.linaro.org@kernel.org wrote:
From: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org
Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work is started asynchronously and success is returned. But this doesn't align with what PM core expects as documented in Documentation/power/runtime_pm.rst:
"Once the subsystem-level resume callback (or the driver resume callback, if invoked directly) has completed successfully, the PM core regards the device as fully operational, which means that the device _must_ be able to complete I/O operations as needed. The runtime PM status of the device is then 'active'."
So the PM core ends up marking the runtime PM status of the device as 'active', even though the device is not able to handle the I/O operations. This same condition more or less applies to system resume as well.
So to avoid this ambiguity, try to recover the device synchronously from mhi_pci_runtime_resume() and return the actual error code in the case of recovery failure.
For doing so, move the recovery code to __mhi_pci_recovery_work() helper and call that from both mhi_pci_recovery_work() and mhi_pci_runtime_resume(). Former still ignores the return value, while the latter passes it to PM core.
Cc: stable@vger.kernel.org # 5.13 Reported-by: Johan Hovold johan@kernel.org Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM") Signed-off-by: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org
Note that it will noticeably impact the user experience on system-wide resume (mhi_pci_resume), because MHI devices usually take a while (a few seconds) to cold boot and reach a ready state (or time out in the worst case). So we may have people complaining about delayed resume regression on their laptop even if they are not using the MHI device/modem function. Are we ok with that?
Regards, Loic
On Wed, Jan 08, 2025 at 04:19:06PM +0100, Loic Poulain wrote:
On Wed, 8 Jan 2025 at 14:39, Manivannan Sadhasivam via B4 Relay devnull+manivannan.sadhasivam.linaro.org@kernel.org wrote:
From: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org
Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work is started asynchronously and success is returned. But this doesn't align with what PM core expects as documented in Documentation/power/runtime_pm.rst:
"Once the subsystem-level resume callback (or the driver resume callback, if invoked directly) has completed successfully, the PM core regards the device as fully operational, which means that the device _must_ be able to complete I/O operations as needed. The runtime PM status of the device is then 'active'."
So the PM core ends up marking the runtime PM status of the device as 'active', even though the device is not able to handle the I/O operations. This same condition more or less applies to system resume as well.
So to avoid this ambiguity, try to recover the device synchronously from mhi_pci_runtime_resume() and return the actual error code in the case of recovery failure.
For doing so, move the recovery code to __mhi_pci_recovery_work() helper and call that from both mhi_pci_recovery_work() and mhi_pci_runtime_resume(). Former still ignores the return value, while the latter passes it to PM core.
Cc: stable@vger.kernel.org # 5.13 Reported-by: Johan Hovold johan@kernel.org Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM") Signed-off-by: Manivannan Sadhasivam manivannan.sadhasivam@linaro.org
Note that it will noticeably impact the user experience on system-wide resume (mhi_pci_resume), because MHI devices usually take a while (a few seconds) to cold boot and reach a ready state (or time out in the worst case). So we may have people complaining about delayed resume regression on their laptop even if they are not using the MHI device/modem function. Are we ok with that?
Are you saying that the modem will enter D3Cold all the time during system suspend? I think you are referring to x86 host machines here.
If that is the case, we should not be using mhi_pci_runtime_*() calls in mhi_pci_suspend/resume(). Rather the MHI stack should be powered down during suspend and powered ON during resume.
- Mani
linux-stable-mirror@lists.linaro.org