From: Brian Kao powenkao@google.com
[ Upstream commit b4bb6daf4ac4d4560044ecdd81e93aa2f6acbb06 ]
When a W-LUN resume fails, its parent devices in the SCSI hierarchy, including the scsi_target, may be runtime suspended. Subsequently, the error handler in ufshcd_recover_pm_error() fails to set the W-LUN device back to active because the parent target is not active. This results in the following errors:
google-ufshcd 3c2d0000.ufs: ufshcd_err_handler started; HBA state eh_fatal; ... ufs_device_wlun 0:0:0:49488: START_STOP failed for power mode: 1, result 40000 ufs_device_wlun 0:0:0:49488: ufshcd_wl_runtime_resume failed: -5 ... ufs_device_wlun 0:0:0:49488: runtime PM trying to activate child device 0:0:0:49488 but parent (target0:0:0) is not active
Address this by:
1. Ensuring the W-LUN's parent scsi_target is runtime resumed before attempting to set the W-LUN to active within ufshcd_recover_pm_error().
2. Explicitly checking for power.runtime_error on the HBA and W-LUN devices before calling pm_runtime_set_active() to clear the error state.
3. Adding pm_runtime_get_sync(hba->dev) in ufshcd_err_handling_prepare() to ensure the HBA itself is active during error recovery, even if a child device resume failed.
These changes ensure the device power states are managed correctly during error recovery.
Signed-off-by: Brian Kao powenkao@google.com Tested-by: Brian Kao powenkao@google.com Reviewed-by: Bart Van Assche bvanassche@acm.org Link: https://patch.msgid.link/20251112063214.1195761-1-powenkao@google.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
## Analysis Summary
### Problem Description This commit fixes a bug where UFS error recovery fails after a W-LUN (Well Known Logical Unit Number) resume error. When a W-LUN resume fails:
1. The parent devices in the SCSI hierarchy (including `scsi_target`) remain runtime suspended 2. The error handler's `ufshcd_recover_pm_error()` attempts to set the W-LUN device active 3. The PM core rejects this because the parent target is not active, producing the error: ``` runtime PM trying to activate child device 0:0:0:49488 but parent (target0:0:0) is not active ```
### Technical Analysis of the Fix
**Original (buggy) code:** ```c ret = pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev); if (ret) ret = pm_runtime_set_active(hba->dev); ```
This incorrectly tries to activate a child device while its parent is suspended - violating the PM hierarchy requirement.
**The fix does three things:** 1. **Adds `pm_runtime_get_sync(hba->dev)` in `ufshcd_err_handling_prepare()`** - Ensures HBA is active during error recovery even if a child device resume failed 2. **Fixes PM hierarchy in `ufshcd_recover_pm_error()`** - Explicitly resumes the parent `scsi_target` before trying to activate the child W-LUN 3. **Checks `power.runtime_error`** - Only clears error state on devices that actually have an error, rather than relying on return values
### Stable Tree Criteria Assessment
| Criterion | Assessment | |-----------|------------| | Fixes real bug | ✅ YES - Error handling completely fails without this | | Obviously correct | ✅ YES - Follows standard PM parent-before-child rules | | Small and contained | ✅ YES - ~35 lines in one file, error handling path only | | No new features | ✅ YES - Pure bugfix | | Tested | ✅ YES - Has Tested-by and Reviewed-by: Bart Van Assche | | Exists in stable | ✅ YES - `ufshcd_recover_pm_error()` introduced in v5.10 | | Dependencies | ✅ NONE - Uses standard, long-existing PM APIs |
### User Impact - **Affected systems**: Android devices, embedded systems with UFS storage - **Severity**: Error handler failure can lead to I/O failures and potential data loss - **Frequency**: Occurs when W-LUN resume fails and error handler tries to recover
### Risk Assessment - **Low regression risk**: Changes are in error handling code path only - **Well-understood fix**: Standard PM hierarchy handling pattern - **Good review**: Bart Van Assche (experienced SCSI/block maintainer) reviewed it
### Conclusion
This is a clear-cut bug fix for UFS error handling. The bug causes the error handler to fail completely when a W-LUN resume error occurs, which can leave the storage subsystem in an unrecoverable state. The fix correctly implements PM hierarchy requirements (parent must be active before child), is well-tested, has expert review, and uses standard APIs that exist in all stable trees since v5.10. The changes are contained to the error handling path with minimal regression risk.
**YES**
drivers/ufs/core/ufshcd.c | 36 ++++++++++++++++++++++++++++-------- 1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index d6a060a724618..ce52c3bafbe8f 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -6498,6 +6498,11 @@ static void ufshcd_clk_scaling_suspend(struct ufs_hba *hba, bool suspend)
static void ufshcd_err_handling_prepare(struct ufs_hba *hba) { + /* + * A WLUN resume failure could potentially lead to the HBA being + * runtime suspended, so take an extra reference on hba->dev. + */ + pm_runtime_get_sync(hba->dev); ufshcd_rpm_get_sync(hba); if (pm_runtime_status_suspended(&hba->ufs_device_wlun->sdev_gendev) || hba->is_sys_suspended) { @@ -6537,6 +6542,7 @@ static void ufshcd_err_handling_unprepare(struct ufs_hba *hba) if (ufshcd_is_clkscaling_supported(hba)) ufshcd_clk_scaling_suspend(hba, false); ufshcd_rpm_put(hba); + pm_runtime_put(hba->dev); }
static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba) @@ -6551,28 +6557,42 @@ static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba) #ifdef CONFIG_PM static void ufshcd_recover_pm_error(struct ufs_hba *hba) { + struct scsi_target *starget = hba->ufs_device_wlun->sdev_target; struct Scsi_Host *shost = hba->host; struct scsi_device *sdev; struct request_queue *q; - int ret; + bool resume_sdev_queues = false;
hba->is_sys_suspended = false; + /* - * Set RPM status of wlun device to RPM_ACTIVE, - * this also clears its runtime error. + * Ensure the parent's error status is cleared before proceeding + * to the child, as the parent must be active to activate the child. */ - ret = pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev); + if (hba->dev->power.runtime_error) { + /* hba->dev has no functional parent thus simplily set RPM_ACTIVE */ + pm_runtime_set_active(hba->dev); + resume_sdev_queues = true; + } + + if (hba->ufs_device_wlun->sdev_gendev.power.runtime_error) { + /* + * starget, parent of wlun, might be suspended if wlun resume failed. + * Make sure parent is resumed before set child (wlun) active. + */ + pm_runtime_get_sync(&starget->dev); + pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev); + pm_runtime_put_sync(&starget->dev); + resume_sdev_queues = true; + }
- /* hba device might have a runtime error otherwise */ - if (ret) - ret = pm_runtime_set_active(hba->dev); /* * If wlun device had runtime error, we also need to resume those * consumer scsi devices in case any of them has failed to be * resumed due to supplier runtime resume failure. This is to unblock * blk_queue_enter in case there are bios waiting inside it. */ - if (!ret) { + if (resume_sdev_queues) { shost_for_each_device(sdev, shost) { q = sdev->request_queue; if (q->dev && (q->rpm_status == RPM_SUSPENDED ||