namespace's request queue is frozen and quiesced during error recovering, writeback IO is blocked in bio_queue_enter(), so fsync_bdev() <- del_gendisk() can't move on, and causes IO hang. Removal could be from sysfs, hard unplug or error handling.
Fix this kind of issue by marking controller as DEAD if removal breaks error recovery.
This ways is reasonable too, because controller can't be recovered any more after being removed.
Cc: stable@vger.kernel.org Reported-by: Chunguang Xu brookxu.cn@gmail.com Closes: https://lore.kernel.org/linux-nvme/cover.1685350577.git.chunguang.xu@shopee.... Reported-by: Yi Zhang yi.zhang@redhat.com Signed-off-by: Ming Lei ming.lei@redhat.com --- V2: - patch style fix, as suggested by Christoph - document this handling
drivers/nvme/host/core.c | 9 ++++++++- drivers/nvme/host/nvme.h | 1 + 2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index fdfcf2781c85..1419eb35b47a 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -567,6 +567,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, }
if (changed) { + ctrl->old_state = ctrl->state; ctrl->state = new_state; wake_up_all(&ctrl->state_wq); } @@ -4054,8 +4055,14 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) * disconnected. In that case, we won't be able to flush any data while * removing the namespaces' disks; fail all the queues now to avoid * potentially having to clean up the failed sync later. + * + * If this removal happens during error recovering, resetting part + * may not be started, or controller isn't be recovered completely, + * so we have to treat controller as DEAD for avoiding IO hang since + * queues can be left as frozen and quiesced. */ - if (ctrl->state == NVME_CTRL_DEAD) { + if (ctrl->state == NVME_CTRL_DEAD || + ctrl->old_state != NVME_CTRL_LIVE) { nvme_mark_namespaces_dead(ctrl); nvme_unquiesce_io_queues(ctrl); } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 9a98c14c552a..ce67856d4d4f 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -254,6 +254,7 @@ struct nvme_ctrl { bool comp_seen; bool identified; enum nvme_ctrl_state state; + enum nvme_ctrl_state old_state; spinlock_t lock; struct mutex scan_lock; const struct nvme_ctrl_ops *ops;
On Thu, Jun 29, 2023 at 02:48:18PM +0800, Ming Lei wrote:
@@ -4054,8 +4055,14 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) * disconnected. In that case, we won't be able to flush any data while * removing the namespaces' disks; fail all the queues now to avoid * potentially having to clean up the failed sync later.
*
* If this removal happens during error recovering, resetting part
* may not be started, or controller isn't be recovered completely,
* so we have to treat controller as DEAD for avoiding IO hang since
*/* queues can be left as frozen and quiesced.
- if (ctrl->state == NVME_CTRL_DEAD) {
- if (ctrl->state == NVME_CTRL_DEAD ||
nvme_mark_namespaces_dead(ctrl); nvme_unquiesce_io_queues(ctrl);ctrl->old_state != NVME_CTRL_LIVE) {
Thanks for the comment and style, but I really still think doing the state check was wrong to start with, and adding a check on the old state makes things significantly worse. Can we try to brainstorm on how do this properly?
I think we need to first figure out how to balance the quiesce/unquiesce calls, the placement of the nvme_mark_namespaces_dead call should be the simple part.
On Thu, Jun 29, 2023 at 09:33:05AM +0200, Christoph Hellwig wrote:
On Thu, Jun 29, 2023 at 02:48:18PM +0800, Ming Lei wrote:
@@ -4054,8 +4055,14 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) * disconnected. In that case, we won't be able to flush any data while * removing the namespaces' disks; fail all the queues now to avoid * potentially having to clean up the failed sync later.
*
* If this removal happens during error recovering, resetting part
* may not be started, or controller isn't be recovered completely,
* so we have to treat controller as DEAD for avoiding IO hang since
*/* queues can be left as frozen and quiesced.
- if (ctrl->state == NVME_CTRL_DEAD) {
- if (ctrl->state == NVME_CTRL_DEAD ||
nvme_mark_namespaces_dead(ctrl); nvme_unquiesce_io_queues(ctrl);ctrl->old_state != NVME_CTRL_LIVE) {
Thanks for the comment and style, but I really still think doing the state check was wrong to start with, and adding a check on the old state makes things significantly worse. Can we try to brainstorm on how do this properly?
Removal comes during error recovery, and the old state is just for figuring out this situation, and I think it is documented clearly, and same with the change itself.
We need to fix it in one simple way for -stable and downstream, so I'd suggest to apply the simple fix first.
I think we need to first figure out how to balance the quiesce/unquiesce calls, the placement of the nvme_mark_namespaces_dead call should be the simple part.
The root cause is that nvme driver takes 2 stage error recovery(teardown and reset), both two are run from different contexts. The teardown part freezes and quiesces queues, and reset stage does the unfreeze and unquiesce part. Device removal can come any time, maybe before starting the reset, and maybe during resetting, so it is hard to balance the two pair APIs if not calling them in same context. And it has been one long-term issue.
I am working to move freeze to reset handler[1], which can balance freeze API.
For quiesce API, I think it still need to be done unconditionally when removing NSs.
But this kind of work can't be fix candidate for -stable or downstream cause the change is bigger & more complicated.
[1] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@gri...
Thanks, Ming
linux-stable-mirror@lists.linaro.org