[PATCH] block: Fix a race in the runtime power management code

List overview All Threads
Download

newer

older

[PATCH 01/39] drm/i915/gem: Avoid...

[PATCH] nvme: Revert: Fix...

Bart Van Assche

24 Aug 2020 24 Aug '20

3:06 a.m.

With the current implementation the following race can happen: * blk_pre_runtime_suspend() calls blk_freeze_queue_start() and blk_mq_unfreeze_queue(). * blk_queue_enter() calls blk_queue_pm_only() and that function returns true. * blk_queue_enter() calls blk_pm_request_resume() and that function does not call pm_request_resume() because the queue runtime status is RPM_ACTIVE. * blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING before switching q_usage_counter to atomic mode.

Cc: Alan Stern stern@rowland.harvard.edu Cc: Stanley Chu stanley.chu@mediatek.com Cc: Ming Lei ming.lei@redhat.com Cc: stable stable@vger.kernel.org Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management") Signed-off-by: Can Guo cang@codeaurora.org Signed-off-by: Bart Van Assche bvanassche@acm.org --- block/blk-pm.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-pm.c b/block/blk-pm.c index b85234d758f7..17bd020268d4 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q)

WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);

+ spin_lock_irq(&q->queue_lock); + q->rpm_status = RPM_SUSPENDING; + spin_unlock_irq(&q->queue_lock); + /* * Increase the pm_only counter before checking whether any * non-PM blk_queue_enter() calls are in progress to avoid that any @@ -89,15 +93,14 @@ int blk_pre_runtime_suspend(struct request_queue *q) /* Switch q_usage_counter back to per-cpu mode. */ blk_mq_unfreeze_queue(q);

- spin_lock_irq(&q->queue_lock); - if (ret < 0) + if (ret < 0) { + spin_lock_irq(&q->queue_lock); + q->rpm_status = RPM_ACTIVE; pm_runtime_mark_last_busy(q->dev); - else - q->rpm_status = RPM_SUSPENDING; - spin_unlock_irq(&q->queue_lock); + spin_unlock_irq(&q->queue_lock);

- if (ret) blk_clear_pm_only(q); + }

return ret; }

Show replies by date

Alan Stern

24 Aug 24 Aug

2:47 p.m.

On Sun, Aug 23, 2020 at 08:06:07PM -0700, Bart Van Assche wrote:

...

With the current implementation the following race can happen:

blk_pre_runtime_suspend() calls blk_freeze_queue_start() and blk_mq_unfreeze_queue().

blk_queue_enter() calls blk_queue_pm_only() and that function returns true.

blk_queue_enter() calls blk_pm_request_resume() and that function does not call pm_request_resume() because the queue runtime status is RPM_ACTIVE.

blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING before switching q_usage_counter to atomic mode.

Cc: Alan Stern stern@rowland.harvard.edu Cc: Stanley Chu stanley.chu@mediatek.com Cc: Ming Lei ming.lei@redhat.com Cc: stable stable@vger.kernel.org Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management") Signed-off-by: Can Guo cang@codeaurora.org Signed-off-by: Bart Van Assche bvanassche@acm.org

block/blk-pm.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-pm.c b/block/blk-pm.c index b85234d758f7..17bd020268d4 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q) WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);

spin_lock_irq(&q->queue_lock);

q->rpm_status = RPM_SUSPENDING;

spin_unlock_irq(&q->queue_lock);

/*

Increase the pm_only counter before checking whether any

non-PM blk_queue_enter() calls are in progress to avoid that any

@@ -89,15 +93,14 @@ int blk_pre_runtime_suspend(struct request_queue *q) /* Switch q_usage_counter back to per-cpu mode. */ blk_mq_unfreeze_queue(q);

spin_lock_irq(&q->queue_lock);

if (ret < 0)
if (ret < 0) {
spin_lock_irq(&q->queue_lock);
q->rpm_status = RPM_ACTIVE;
pm_runtime_mark_last_busy(q->dev);
else
q->rpm_status = RPM_SUSPENDING;
spin_unlock_irq(&q->queue_lock);
spin_unlock_irq(&q->queue_lock);
if (ret) blk_clear_pm_only(q);

}

return ret; }

Acked-by: Alan Stern stern@rowland.harvard.edu

Stanley Chu

25 Aug 25 Aug

9:01 a.m.

Hi Bart,

On Sun, 2020-08-23 at 20:06 -0700, Bart Van Assche wrote:

...

With the current implementation the following race can happen:

blk_pre_runtime_suspend() calls blk_freeze_queue_start() and blk_mq_unfreeze_queue().

blk_queue_enter() calls blk_queue_pm_only() and that function returns true.

blk_queue_enter() calls blk_pm_request_resume() and that function does not call pm_request_resume() because the queue runtime status is RPM_ACTIVE.

blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING before switching q_usage_counter to atomic mode.

Cc: Alan Stern stern@rowland.harvard.edu Cc: Stanley Chu stanley.chu@mediatek.com Cc: Ming Lei ming.lei@redhat.com Cc: stable stable@vger.kernel.org Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management") Signed-off-by: Can Guo cang@codeaurora.org Signed-off-by: Bart Van Assche bvanassche@acm.org

block/blk-pm.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-pm.c b/block/blk-pm.c index b85234d758f7..17bd020268d4 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q) WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);

spin_lock_irq(&q->queue_lock);

q->rpm_status = RPM_SUSPENDING;

spin_unlock_irq(&q->queue_lock);

Has below alternative way been considered that RPM_SUSPENDING is set after blk_mq_unfreeze_queue()?

blk_freeze_queue_start(q);

+ spin_lock_irq(&q->queue_lock); + q->rpm_status = RPM_SUSPENDING; + spin_unlock_irq(&q->queue_lock);

Otherwise requests can enter queue while rpm_status is RPM_SUSPENDING during a small window, i.e., before blk_set_pm_only() is invoked. This would make the definition of rpm_status ambiguous.

In this way, the racing could be also solved:

- Before blk_freeze_queue_start(), any requests shall be allowed to enter queue - blk_freeze_queue_start() freezes the queue and blocks all upcoming requests (make them wait_event(q->mq_freeze_wq)) - rpm_status is set as RPM_SUSPENDING - blk_mq_unfreeze_queue() wakes up q->mq_freeze_wq and then blk_pm_request_resume() can be executed

Thanks,

Stanley Chu

...

/* * Increase the pm_only counter before checking whether any * non-PM blk_queue_enter() calls are in progress to avoid that any @@ -89,15 +93,14 @@ int blk_pre_runtime_suspend(struct request_queue *q) /* Switch q_usage_counter back to per-cpu mode. */ blk_mq_unfreeze_queue(q);

spin_lock_irq(&q->queue_lock);

if (ret < 0)
if (ret < 0) {
spin_lock_irq(&q->queue_lock);
q->rpm_status = RPM_ACTIVE;
pm_runtime_mark_last_busy(q->dev);
else
q->rpm_status = RPM_SUSPENDING;
spin_unlock_irq(&q->queue_lock);
spin_unlock_irq(&q->queue_lock);
if (ret) blk_clear_pm_only(q);

}

return ret; }

Stanley Chu

9:11 a.m.

Sorry, resend to fix typo.

Hi Bart,

On Sun, 2020-08-23 at 20:06 -0700, Bart Van Assche wrote:

...

With the current implementation the following race can happen:

blk_pre_runtime_suspend() calls blk_freeze_queue_start() and blk_mq_unfreeze_queue().

blk_queue_enter() calls blk_queue_pm_only() and that function returns true.

blk_queue_enter() calls blk_pm_request_resume() and that function does not call pm_request_resume() because the queue runtime status is RPM_ACTIVE.

blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING before switching q_usage_counter to atomic mode.

Cc: Alan Stern stern@rowland.harvard.edu Cc: Stanley Chu stanley.chu@mediatek.com Cc: Ming Lei ming.lei@redhat.com Cc: stable stable@vger.kernel.org Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management") Signed-off-by: Can Guo cang@codeaurora.org Signed-off-by: Bart Van Assche bvanassche@acm.org

block/blk-pm.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-pm.c b/block/blk-pm.c index b85234d758f7..17bd020268d4 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q) WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);

spin_lock_irq(&q->queue_lock);

q->rpm_status = RPM_SUSPENDING;

spin_unlock_irq(&q->queue_lock);

Has below alternative way been considered that RPM_SUSPENDING is set after blk_freeze_queue_start()?

blk_freeze_queue_start(q);

+ spin_lock_irq(&q->queue_lock); + q->rpm_status = RPM_SUSPENDING; + spin_unlock_irq(&q->queue_lock);

Otherwise requests can enter queue while rpm_status is RPM_SUSPENDING during a small window, i.e., before blk_set_pm_only() is invoked. This would make the definition of rpm_status ambiguous.

In this way, the racing could be also solved:

Thanks,

Stanley Chu

...

/* * Increase the pm_only counter before checking whether any * non-PM blk_queue_enter() calls are in progress to avoid that any @@ -89,15 +93,14 @@ int blk_pre_runtime_suspend(struct request_queue *q) /* Switch q_usage_counter back to per-cpu mode. */ blk_mq_unfreeze_queue(q);

spin_lock_irq(&q->queue_lock);

if (ret < 0)
if (ret < 0) {
spin_lock_irq(&q->queue_lock);
q->rpm_status = RPM_ACTIVE;
pm_runtime_mark_last_busy(q->dev);
else
q->rpm_status = RPM_SUSPENDING;
spin_unlock_irq(&q->queue_lock);
spin_unlock_irq(&q->queue_lock);
if (ret) blk_clear_pm_only(q);

}

return ret; }

Alan Stern

6:24 p.m.

On Tue, Aug 25, 2020 at 05:11:21PM +0800, Stanley Chu wrote:

...

Sorry, resend to fix typo.

Hi Bart,

On Sun, 2020-08-23 at 20:06 -0700, Bart Van Assche wrote:

...
With the current implementation the following race can happen:

blk_pre_runtime_suspend() calls blk_freeze_queue_start() and blk_mq_unfreeze_queue().

blk_queue_enter() calls blk_queue_pm_only() and that function returns true.

blk_queue_enter() calls blk_pm_request_resume() and that function does not call pm_request_resume() because the queue runtime status is RPM_ACTIVE.

blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING before switching q_usage_counter to atomic mode.

Cc: Alan Stern stern@rowland.harvard.edu Cc: Stanley Chu stanley.chu@mediatek.com Cc: Ming Lei ming.lei@redhat.com Cc: stable stable@vger.kernel.org Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management") Signed-off-by: Can Guo cang@codeaurora.org Signed-off-by: Bart Van Assche bvanassche@acm.org

block/blk-pm.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-pm.c b/block/blk-pm.c index b85234d758f7..17bd020268d4 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q) WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);

spin_lock_irq(&q->queue_lock);

q->rpm_status = RPM_SUSPENDING;

spin_unlock_irq(&q->queue_lock);

Has below alternative way been considered that RPM_SUSPENDING is set after blk_freeze_queue_start()?

blk_freeze_queue_start(q);

spin_lock_irq(&q->queue_lock);

q->rpm_status = RPM_SUSPENDING;

spin_unlock_irq(&q->queue_lock);

Otherwise requests can enter queue while rpm_status is RPM_SUSPENDING during a small window, i.e., before blk_set_pm_only() is invoked. This would make the definition of rpm_status ambiguous.

In this way, the racing could be also solved:

Before blk_freeze_queue_start(), any requests shall be allowed to

enter queue

blk_freeze_queue_start() freezes the queue and blocks all upcoming

requests (make them wait_event(q->mq_freeze_wq))

rpm_status is set as RPM_SUSPENDING

blk_mq_unfreeze_queue() wakes up q->mq_freeze_wq and then

blk_pm_request_resume() can be executed

A very similar question arises concerning the transition from RPM_SUSPENDING to RPM_SUSPENDED. I am not convinced that the existing synchronization is sufficient. However, this may not matter because...

A related question concerns the BLK_MQ_REQ_PREEMPT flag. If it is set then the request is allowed whilel rpm_status is RPM_SUSPENDING. But in fact, the only requests which should be allowed at that time are those created by the lower-layer driver as part of its runtime-suspend handling; all other requests should be deferred. The BLK_MQ_REQ_PREEMPT flag doesn't seem like the right way to achieve this. Should we be using a different flag?

Alan Stern

Bart Van Assche

10:22 p.m.

On 2020-08-25 11:24, Alan Stern wrote:

...

A related question concerns the BLK_MQ_REQ_PREEMPT flag. If it is set then the request is allowed while rpm_status is RPM_SUSPENDING. But in fact, the only requests which should be allowed at that time are those created by the lower-layer driver as part of its runtime-suspend handling; all other requests should be deferred. The BLK_MQ_REQ_PREEMPT flag doesn't seem like the right way to achieve this. Should we be using a different flag?

I think there is already a flag that is used to mark power management commands, namely RQF_PM.

My understanding is that the BLK_MQ_REQ_PREEMPT/RQF_PREEMPT flags are used for the following purposes: * In the SCSI core, scsi_execute() sets the BLK_MQ_PREEMPT flag. As a result, SCSI commands submitted with scsi_execute() are processed in the SDEV_QUIESCE SCSI device state. Since scsi_execute() is only used to submit other than medium access commands, in the SDEV_QUIESCE state medium access commands are paused and commands that do not access the storage medium (e.g. START/STOP commands) are processed. * In the IDE-driver, for making one request preempt another request. From an old IDE commit (not sure if this is still relevant): + * If action is ide_preempt, then the rq is queued at the head of + * the request queue, displacing the currently-being-processed + * request and this function returns immediately without waiting + * for the new rq to be completed. This is VERY DANGEROUS, and is + * intended for careful use by the ATAPI tape/cdrom driver code.

Bart.

Alan Stern

26 Aug 26 Aug

1:51 a.m.

On Tue, Aug 25, 2020 at 03:22:03PM -0700, Bart Van Assche wrote:

...

On 2020-08-25 11:24, Alan Stern wrote:

...
A related question concerns the BLK_MQ_REQ_PREEMPT flag. If it is set then the request is allowed while rpm_status is RPM_SUSPENDING. But in fact, the only requests which should be allowed at that time are those created by the lower-layer driver as part of its runtime-suspend handling; all other requests should be deferred. The BLK_MQ_REQ_PREEMPT flag doesn't seem like the right way to achieve this. Should we be using a different flag?

I think there is already a flag that is used to mark power management commands, namely RQF_PM.

My understanding is that the BLK_MQ_REQ_PREEMPT/RQF_PREEMPT flags are used for the following purposes:

In the SCSI core, scsi_execute() sets the BLK_MQ_PREEMPT flag. As a result, SCSI commands submitted with scsi_execute() are processed in the SDEV_QUIESCE SCSI device state. Since scsi_execute() is only used to submit other than medium access commands, in the SDEV_QUIESCE state medium access commands are paused and commands that do not access the storage medium (e.g. START/STOP commands) are processed.

In the IDE-driver, for making one request preempt another request. From an old IDE commit (not sure if this is still relevant):

If action is ide_preempt, then the rq is queued at the head of

the request queue, displacing the currently-being-processed

request and this function returns immediately without waiting

for the new rq to be completed. This is VERY DANGEROUS, and is

intended for careful use by the ATAPI tape/cdrom driver code.

Ah, perfect. So in blk_queue_enter(), pm should be defined in terms of RQF_PM rather than BLK_MQ_REQ_PREEMPT.

The difficulty is that the flags argument is the wrong type; RQF_PM is defined as req_flags_t, not blk_mq_req_flags_t. It is associated with a particular request after the request has been created, so after blk_queue_enter() has been called.

How can we solve this?

Alan Stern

Bart Van Assche

27 Aug 27 Aug

3:35 a.m.

On 2020-08-25 18:51, Alan Stern wrote:

...

Ah, perfect. So in blk_queue_enter(), pm should be defined in terms of RQF_PM rather than BLK_MQ_REQ_PREEMPT.

The difficulty is that the flags argument is the wrong type; RQF_PM is defined as req_flags_t, not blk_mq_req_flags_t. It is associated with a particular request after the request has been created, so after blk_queue_enter() has been called.

How can we solve this?

The current code looks a bit weird because my focus when modifying the PM code has been on not breaking any existing code.

scsi_device_quiesce() relies on blk_queue_enter() processing all PREEMPT requests. A difficulty is that scsi_device_quiesce() is used for two separate purposes: * Runtime power management. * SCSI domain validation. See e.g. https://lwn.net/Articles/75917/.

I think that modifying blk_queue_enter() such that it only accepts PM requests will require to split scsi_device_quiesce() into two functions: one function that is used by the runtime power management code and another function that is used by the SCSI domain validation code. This may require to introduce new SCSI device states. If new SCSI device states are introduced, that should be done without modifying the state that is reported to user space. See also sdev_states[] and show_state_field in scsi_sysfs.c.

Thanks,

Bart.

Alan Stern

8:33 p.m.

On Wed, Aug 26, 2020 at 08:35:42PM -0700, Bart Van Assche wrote:

...

On 2020-08-25 18:51, Alan Stern wrote:

...
Ah, perfect. So in blk_queue_enter(), pm should be defined in terms of RQF_PM rather than BLK_MQ_REQ_PREEMPT.

The difficulty is that the flags argument is the wrong type; RQF_PM is defined as req_flags_t, not blk_mq_req_flags_t. It is associated with a particular request after the request has been created, so after blk_queue_enter() has been called.

How can we solve this?

The current code looks a bit weird because my focus when modifying the PM code has been on not breaking any existing code.

scsi_device_quiesce() relies on blk_queue_enter() processing all PREEMPT requests. A difficulty is that scsi_device_quiesce() is used for two separate purposes:

Runtime power management.

SCSI domain validation. See e.g. https://lwn.net/Articles/75917/.

I think that modifying blk_queue_enter() such that it only accepts PM requests will require to split scsi_device_quiesce() into two functions: one function that is used by the runtime power management code and another function that is used by the SCSI domain validation code. This may require to introduce new SCSI device states. If new SCSI device states are introduced, that should be done without modifying the state that is reported to user space. See also sdev_states[] and show_state_field in scsi_sysfs.c.

It may not need to be that complicated. what about something like this?

Alan Stern

Index: usb-devel/block/blk-core.c =================================================================== --- usb-devel.orig/block/blk-core.c +++ usb-devel/block/blk-core.c @@ -420,11 +420,11 @@ EXPORT_SYMBOL(blk_cleanup_queue); /** * blk_queue_enter() - try to increase q->q_usage_counter * @q: request queue pointer - * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PM */ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) { - const bool pm = flags & BLK_MQ_REQ_PREEMPT; + const bool pm = flags & BLK_MQ_REQ_PM;

while (true) { bool success = false; @@ -626,7 +626,8 @@ struct request *blk_get_request(struct r struct request *req;

WARN_ON_ONCE(op & REQ_NOWAIT); - WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT)); + WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT | + BLK_MQ_REQ_PM));

req = blk_mq_alloc_request(q, op, flags); if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn) Index: usb-devel/drivers/scsi/scsi_lib.c =================================================================== --- usb-devel.orig/drivers/scsi/scsi_lib.c +++ usb-devel/drivers/scsi/scsi_lib.c @@ -245,11 +245,15 @@ int __scsi_execute(struct scsi_device *s { struct request *req; struct scsi_request *rq; + blk_mq_req_flags_t mq_req_flags; int ret = DRIVER_ERROR << 24;

+ mq_req_flags = BLK_MQ_REQ_PREEMPT; + if (rq_flags & RQF_PM) + mq_req_flags |= BLK_MQ_REQ_PM; req = blk_get_request(sdev->request_queue, data_direction == DMA_TO_DEVICE ? - REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT); + REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, mq_req_flags); if (IS_ERR(req)) return ret; rq = scsi_req(req); Index: usb-devel/include/linux/blk-mq.h =================================================================== --- usb-devel.orig/include/linux/blk-mq.h +++ usb-devel/include/linux/blk-mq.h @@ -435,6 +435,8 @@ enum { BLK_MQ_REQ_RESERVED = (__force blk_mq_req_flags_t)(1 << 1), /* set RQF_PREEMPT */ BLK_MQ_REQ_PREEMPT = (__force blk_mq_req_flags_t)(1 << 3), + /* used for power management */ + BLK_MQ_REQ_PM = (__force blk_mq_req_flags_t)(1 << 4), };

struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,

Bart Van Assche

28 Aug 28 Aug

3:27 a.m.

On 2020-08-27 13:33, Alan Stern wrote:

...

It may not need to be that complicated. what about something like this?

Index: usb-devel/block/blk-core.c

--- usb-devel.orig/block/blk-core.c +++ usb-devel/block/blk-core.c @@ -420,11 +420,11 @@ EXPORT_SYMBOL(blk_cleanup_queue); /**

blk_queue_enter() - try to increase q->q_usage_counter

@q: request queue pointer

@flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT

@flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PM

*/

int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) {

const bool pm = flags & BLK_MQ_REQ_PREEMPT;

const bool pm = flags & BLK_MQ_REQ_PM;

while (true) { bool success = false; @@ -626,7 +626,8 @@ struct request *blk_get_request(struct r struct request *req; WARN_ON_ONCE(op & REQ_NOWAIT);

WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT |
	BLK_MQ_REQ_PM));
req = blk_mq_alloc_request(q, op, flags); if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn) Index: usb-devel/drivers/scsi/scsi_lib.c =================================================================== --- usb-devel.orig/drivers/scsi/scsi_lib.c +++ usb-devel/drivers/scsi/scsi_lib.c @@ -245,11 +245,15 @@ int __scsi_execute(struct scsi_device *s { struct request *req; struct scsi_request *rq;

blk_mq_req_flags_t mq_req_flags; int ret = DRIVER_ERROR << 24;
mq_req_flags = BLK_MQ_REQ_PREEMPT;

if (rq_flags & RQF_PM)
mq_req_flags |= BLK_MQ_REQ_PM;
req = blk_get_request(sdev->request_queue, data_direction == DMA_TO_DEVICE ?
	REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
	REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, mq_req_flags);
if (IS_ERR(req)) return ret; rq = scsi_req(req);
Index: usb-devel/include/linux/blk-mq.h

--- usb-devel.orig/include/linux/blk-mq.h +++ usb-devel/include/linux/blk-mq.h @@ -435,6 +435,8 @@ enum { BLK_MQ_REQ_RESERVED = (__force blk_mq_req_flags_t)(1 << 1), /* set RQF_PREEMPT */ BLK_MQ_REQ_PREEMPT = (__force blk_mq_req_flags_t)(1 << 3),

/* used for power management */

BLK_MQ_REQ_PM = (__force blk_mq_req_flags_t)(1 << 4),

}; struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,

I think this patch will break SCSI domain validation. The SCSI domain validation code calls scsi_device_quiesce() and that function in turn calls blk_set_pm_only(). The SCSI domain validation code submits SCSI commands with the BLK_MQ_REQ_PREEMPT flag. Since the above code postpones such requests while blk_set_pm_only() is in effect, I think the above patch will cause the SCSI domain validation code to deadlock.

Bart.

Alan Stern

3:37 p.m.

On Thu, Aug 27, 2020 at 08:27:49PM -0700, Bart Van Assche wrote:

...

On 2020-08-27 13:33, Alan Stern wrote:

...
It may not need to be that complicated. what about something like this?

...

I think this patch will break SCSI domain validation. The SCSI domain validation code calls scsi_device_quiesce() and that function in turn calls blk_set_pm_only(). The SCSI domain validation code submits SCSI commands with the BLK_MQ_REQ_PREEMPT flag. Since the above code postpones such requests while blk_set_pm_only() is in effect, I think the above patch will cause the SCSI domain validation code to deadlock.

Yes, you're right.

There may be an even simpler solution: Ensure that SCSI domain validation is mutually exclusive with runtime PM. It's already mutually exclusive with system PM, so this makes sense.

What do you think of the patch below?

Alan Stern

Index: usb-devel/drivers/scsi/scsi_transport_spi.c =================================================================== --- usb-devel.orig/drivers/scsi/scsi_transport_spi.c +++ usb-devel/drivers/scsi/scsi_transport_spi.c @@ -1001,7 +1001,7 @@ spi_dv_device(struct scsi_device *sdev) * Because this function and the power management code both call * scsi_device_quiesce(), it is not safe to perform domain validation * while suspend or resume is in progress. Hence the - * lock/unlock_system_sleep() calls. + * lock/unlock_system_sleep() and scsi_autopm_get/put_device() calls. */ lock_system_sleep();

@@ -1018,10 +1018,13 @@ spi_dv_device(struct scsi_device *sdev) if (unlikely(!buffer)) goto out_put;

+ if (scsi_autopm_get_device(sdev)) + goto out_free; + /* We need to verify that the actual device will quiesce; the * later target quiesce is just a nice to have */ if (unlikely(scsi_device_quiesce(sdev))) - goto out_free; + goto out_autopm_put;

scsi_target_quiesce(starget);

@@ -1041,6 +1044,8 @@ spi_dv_device(struct scsi_device *sdev)

spi_initial_dv(starget) = 1;

+ out_autopm_put: + scsi_autopm_put_device(sdev); out_free: kfree(buffer); out_put:

Bart Van Assche

29 Aug 29 Aug

12:51 a.m.

On 2020-08-28 08:37, Alan Stern wrote:

...

On Thu, Aug 27, 2020 at 08:27:49PM -0700, Bart Van Assche wrote:

...
On 2020-08-27 13:33, Alan Stern wrote:

...
It may not need to be that complicated. what about something like this?

...
I think this patch will break SCSI domain validation. The SCSI domain validation code calls scsi_device_quiesce() and that function in turn calls blk_set_pm_only(). The SCSI domain validation code submits SCSI commands with the BLK_MQ_REQ_PREEMPT flag. Since the above code postpones such requests while blk_set_pm_only() is in effect, I think the above patch will cause the SCSI domain validation code to deadlock.

Yes, you're right.

There may be an even simpler solution: Ensure that SCSI domain validation is mutually exclusive with runtime PM. It's already mutually exclusive with system PM, so this makes sense.

What do you think of the patch below?

Alan Stern

Index: usb-devel/drivers/scsi/scsi_transport_spi.c

--- usb-devel.orig/drivers/scsi/scsi_transport_spi.c +++ usb-devel/drivers/scsi/scsi_transport_spi.c @@ -1001,7 +1001,7 @@ spi_dv_device(struct scsi_device *sdev) * Because this function and the power management code both call * scsi_device_quiesce(), it is not safe to perform domain validation * while suspend or resume is in progress. Hence the
* lock/unlock_system_sleep() calls.
* lock/unlock_system_sleep() and scsi_autopm_get/put_device() calls.
*/ lock_system_sleep();
@@ -1018,10 +1018,13 @@ spi_dv_device(struct scsi_device *sdev) if (unlikely(!buffer)) goto out_put;
if (scsi_autopm_get_device(sdev))
goto out_free;
/* We need to verify that the actual device will quiesce; the

later target quiesce is just a nice to have */

if (unlikely(scsi_device_quiesce(sdev)))
goto out_free;
goto out_autopm_put;
scsi_target_quiesce(starget); @@ -1041,6 +1044,8 @@ spi_dv_device(struct scsi_device *sdev) spi_initial_dv(starget) = 1;

out_autopm_put:

scsi_autopm_put_device(sdev); out_free: kfree(buffer); out_put:

Hi Alan,

I think this is only a part of the solution. scsi_target_quiesce() invokes scsi_device_quiesce() and that function in turn calls blk_set_pm_only(). So I think that the above is not sufficient to fix the deadlock mentioned in my previous email.

Maybe it is possible to fix this by creating a new request queue and by submitting the DV SCSI commands to that new request queue. There may be better solutions.

Bart.

Alan Stern

1:12 a.m.

On Fri, Aug 28, 2020 at 05:51:03PM -0700, Bart Van Assche wrote:

...

On 2020-08-28 08:37, Alan Stern wrote:

...
On Thu, Aug 27, 2020 at 08:27:49PM -0700, Bart Van Assche wrote:

...
On 2020-08-27 13:33, Alan Stern wrote:

...
It may not need to be that complicated. what about something like this?

...
I think this patch will break SCSI domain validation. The SCSI domain validation code calls scsi_device_quiesce() and that function in turn calls blk_set_pm_only(). The SCSI domain validation code submits SCSI commands with the BLK_MQ_REQ_PREEMPT flag. Since the above code postpones such requests while blk_set_pm_only() is in effect, I think the above patch will cause the SCSI domain validation code to deadlock.

Yes, you're right.

There may be an even simpler solution: Ensure that SCSI domain validation is mutually exclusive with runtime PM. It's already mutually exclusive with system PM, so this makes sense.

What do you think of the patch below?

Alan Stern

Index: usb-devel/drivers/scsi/scsi_transport_spi.c

--- usb-devel.orig/drivers/scsi/scsi_transport_spi.c +++ usb-devel/drivers/scsi/scsi_transport_spi.c @@ -1001,7 +1001,7 @@ spi_dv_device(struct scsi_device *sdev) * Because this function and the power management code both call * scsi_device_quiesce(), it is not safe to perform domain validation * while suspend or resume is in progress. Hence the
* lock/unlock_system_sleep() calls.
* lock/unlock_system_sleep() and scsi_autopm_get/put_device() calls.
*/ lock_system_sleep();
@@ -1018,10 +1018,13 @@ spi_dv_device(struct scsi_device *sdev) if (unlikely(!buffer)) goto out_put;
if (scsi_autopm_get_device(sdev))
goto out_free;
/* We need to verify that the actual device will quiesce; the

later target quiesce is just a nice to have */

if (unlikely(scsi_device_quiesce(sdev)))
goto out_free;
goto out_autopm_put;
scsi_target_quiesce(starget); @@ -1041,6 +1044,8 @@ spi_dv_device(struct scsi_device *sdev) spi_initial_dv(starget) = 1;

out_autopm_put:

scsi_autopm_put_device(sdev); out_free: kfree(buffer); out_put:
Hi Alan,

I think this is only a part of the solution. scsi_target_quiesce() invokes scsi_device_quiesce() and that function in turn calls blk_set_pm_only(). So I think that the above is not sufficient to fix the deadlock mentioned in my previous email.

Sorry, it sounds like you misinterpreted my preceding email. I meant to suggest that the patch above should be considered _instead of_ the patch that introduced BLK_MQ_REQ_PM. So blk_queue_enter() would remain unchanged, using the BLK_MQ_REQ_PREEMPT flag to decide whether or not to postpone new requests. Thus the deadlock you're concerned about would not arise.

...

Maybe it is possible to fix this by creating a new request queue and by submitting the DV SCSI commands to that new request queue. There may be better solutions.

I don't think that is necessary. After all, quiescing is quiescing, whether it is done for runtime power management, domain validation, or anything else. What we need to avoid is one thread trying to keep a device quiescent at the same time that another thread is re-activating it.

To some extent the SCSI core addresses this by allowing only one thread to put a device into the SDEV_QUIESCE state at a time. Making domain validation mutually exclusive with runtime suspend is my attempt to prevent any odd corner-case interactions from cropping up.

Alan Stern

Bart Van Assche

2:57 a.m.

On 2020-08-28 18:12, Alan Stern wrote:

...

On Fri, Aug 28, 2020 at 05:51:03PM -0700, Bart Van Assche wrote:

...
On 2020-08-28 08:37, Alan Stern wrote:

...
There may be an even simpler solution: Ensure that SCSI domain validation is mutually exclusive with runtime PM. It's already mutually exclusive with system PM, so this makes sense.

What do you think of the patch below?

Alan Stern

Index: usb-devel/drivers/scsi/scsi_transport_spi.c

--- usb-devel.orig/drivers/scsi/scsi_transport_spi.c +++ usb-devel/drivers/scsi/scsi_transport_spi.c @@ -1001,7 +1001,7 @@ spi_dv_device(struct scsi_device *sdev) * Because this function and the power management code both call * scsi_device_quiesce(), it is not safe to perform domain validation * while suspend or resume is in progress. Hence the
* lock/unlock_system_sleep() calls.
* lock/unlock_system_sleep() and scsi_autopm_get/put_device() calls.
*/ lock_system_sleep();
@@ -1018,10 +1018,13 @@ spi_dv_device(struct scsi_device *sdev) if (unlikely(!buffer)) goto out_put;
if (scsi_autopm_get_device(sdev))
goto out_free;
/* We need to verify that the actual device will quiesce; the

later target quiesce is just a nice to have */

if (unlikely(scsi_device_quiesce(sdev)))
goto out_free;
goto out_autopm_put;
scsi_target_quiesce(starget); @@ -1041,6 +1044,8 @@ spi_dv_device(struct scsi_device *sdev) spi_initial_dv(starget) = 1;

out_autopm_put:

scsi_autopm_put_device(sdev); out_free: kfree(buffer); out_put:
Hi Alan,

I think this is only a part of the solution. scsi_target_quiesce() invokes scsi_device_quiesce() and that function in turn calls blk_set_pm_only(). So I think that the above is not sufficient to fix the deadlock mentioned in my previous email.
Sorry, it sounds like you misinterpreted my preceding email. I meant to suggest that the patch above should be considered _instead of_ the patch that introduced BLK_MQ_REQ_PM. So blk_queue_enter() would remain unchanged, using the BLK_MQ_REQ_PREEMPT flag to decide whether or not to postpone new requests. Thus the deadlock you're concerned about would not arise.

Hi Alan,

Thanks for the clarification. I agree that the above patch should serialize SCSI DV and runtime PM. I'm fine with the above patch.

Thanks,

Bart.

Bart Van Assche

26 Aug 26 Aug

2:58 a.m.

On 2020-08-25 02:11, Stanley Chu wrote:

...

...
diff --git a/block/blk-pm.c b/block/blk-pm.c index b85234d758f7..17bd020268d4 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q) WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);

spin_lock_irq(&q->queue_lock);

q->rpm_status = RPM_SUSPENDING;

spin_unlock_irq(&q->queue_lock);

Has below alternative way been considered that RPM_SUSPENDING is set after blk_freeze_queue_start()?

blk_freeze_queue_start(q);

spin_lock_irq(&q->queue_lock);

q->rpm_status = RPM_SUSPENDING;

spin_unlock_irq(&q->queue_lock);

Otherwise requests can enter queue while rpm_status is RPM_SUSPENDING during a small window, i.e., before blk_set_pm_only() is invoked. This would make the definition of rpm_status ambiguous.

In this way, the racing could be also solved:

Before blk_freeze_queue_start(), any requests shall be allowed to

enter queue

blk_freeze_queue_start() freezes the queue and blocks all upcoming

requests (make them wait_event(q->mq_freeze_wq))

rpm_status is set as RPM_SUSPENDING

blk_mq_unfreeze_queue() wakes up q->mq_freeze_wq and then

blk_pm_request_resume() can be executed

Hi Stanley,

I prefer the order from the patch. I think it is important to change q->rpm_status into RPM_SUSPENDING before blk_queue_enter() calls blk_queue_pm_only(). Otherwise it could happen that blk_queue_enter() calls blk_pm_request_resume() while q->rpm_status == RPM_ACTIVE, resulting in blk_queue_enter() not resuming a queue although that queue should be resumed.

Thanks,

Bart.

Stanley Chu

4 a.m.

Hi Bart,

On Tue, 2020-08-25 at 19:58 -0700, Bart Van Assche wrote:

...

On 2020-08-25 02:11, Stanley Chu wrote:

...
...
diff --git a/block/blk-pm.c b/block/blk-pm.c index b85234d758f7..17bd020268d4 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q) WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);

spin_lock_irq(&q->queue_lock);

q->rpm_status = RPM_SUSPENDING;

spin_unlock_irq(&q->queue_lock);

Has below alternative way been considered that RPM_SUSPENDING is set after blk_freeze_queue_start()?

blk_freeze_queue_start(q);

spin_lock_irq(&q->queue_lock);

q->rpm_status = RPM_SUSPENDING;

spin_unlock_irq(&q->queue_lock);

Otherwise requests can enter queue while rpm_status is RPM_SUSPENDING during a small window, i.e., before blk_set_pm_only() is invoked. This would make the definition of rpm_status ambiguous.

In this way, the racing could be also solved:

Before blk_freeze_queue_start(), any requests shall be allowed to

enter queue

blk_freeze_queue_start() freezes the queue and blocks all upcoming

requests (make them wait_event(q->mq_freeze_wq))

rpm_status is set as RPM_SUSPENDING

blk_mq_unfreeze_queue() wakes up q->mq_freeze_wq and then

blk_pm_request_resume() can be executed

Hi Stanley,

I prefer the order from the patch. I think it is important to change q->rpm_status into RPM_SUSPENDING before blk_queue_enter() calls blk_queue_pm_only(). Otherwise it could happen that blk_queue_enter() calls blk_pm_request_resume() while q->rpm_status == RPM_ACTIVE, resulting in blk_queue_enter() not resuming a queue although that queue should be resumed.

I see. Thanks for the explanation.

By the order from the patch, it is guaranteed that blk_pm_request_resume() will always trigger runtime resume flow during blk_queue_enter() after blk_set_pm_only() is called by concurrent suspend flow, i.e., blk_pre_runtime_suspend().

However it would have an ambiguous RPM_SUSPENDING rpm_status even though it may not cause issues obviously.

Acked-By: Stanley Chu stanley.chu@mediatek.com

1878

days inactive

1883

days old

linux-stable-mirror@lists.linaro.org

15 comments

participants

tags (0)

participants (3)

Alan Stern
Bart Van Assche
Stanley Chu