The normal request completion can be done before or during handling BLK_EH_RESET_TIMER, and this race may cause the request to never be completed since driver's .timeout() may always return BLK_EH_RESET_TIMER.
This issue can't be fixed completely by driver, since the normal completion can be done between returning .timeout() and handing BLK_EH_RESET_TIMER.
This patch fixes this race by introducing rq state of MQ_RQ_COMPLETE_IN_RESET, and reading/writing rq's state by holding queue lock, which can be per-request actually, but just not necessary to introduce one lock for so unusual event.
Cc: Bart Van Assche bart.vanassche@wdc.com Cc: Tejun Heo tj@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Ming Lei ming.lei@redhat.com Cc: Sagi Grimberg sagi@grimberg.me Cc: Israel Rukshin israelr@mellanox.com, Cc: Max Gurtovoy maxg@mellanox.com Cc: stable@vger.kernel.org Signed-off-by: Ming Lei ming.lei@redhat.com ---
This is another way to fix this long-time issue, and turns out this solution is much simpler.
block/blk-mq.c | 44 +++++++++++++++++++++++++++++++++++++++----- block/blk-mq.h | 1 + 2 files changed, 40 insertions(+), 5 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 0dc9e341c2a7..12e8850e3905 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq) * However, that would complicate paths which want to synchronize * against us. Let stay in sync with the issue path so that * hctx_lock() covers both issue and completion paths. + * + * Cover complete vs BLK_EH_RESET_TIMER race in slow path with + * helding queue lock. */ hctx_lock(hctx, &srcu_idx); if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) __blk_mq_complete_request(rq); + else { + unsigned long flags; + bool need_complete = false; + + spin_lock_irqsave(q->queue_lock, flags); + if (!blk_mq_rq_aborted_gstate(rq)) + need_complete = true; + else + blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_RESET); + spin_unlock_irqrestore(q->queue_lock, flags); + + if (need_complete) + __blk_mq_complete_request(rq); + } hctx_unlock(hctx, srcu_idx); } EXPORT_SYMBOL(blk_mq_complete_request); @@ -814,24 +831,41 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) { const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; + unsigned long flags;
req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
if (ops->timeout) ret = ops->timeout(req, reserved);
+again: switch (ret) { case BLK_EH_HANDLED: __blk_mq_complete_request(req); break; case BLK_EH_RESET_TIMER: /* - * As nothing prevents from completion happening while - * ->aborted_gstate is set, this may lead to ignored - * completions and further spurious timeouts. + * The normal completion may happen during handling the + * timeout, or even after returning from .timeout(), so + * once the request has been completed, we can't reset + * timer any more since this request may be handled as + * BLK_EH_RESET_TIMER in next timeout handling too, and + * it has to be completed in this situation. + * + * Holding the queue lock to cover read/write rq's + * aborted_gstate and normal state, so the race can be + * avoided completely. */ - blk_mq_rq_update_aborted_gstate(req, 0); - blk_add_timer(req); + spin_lock_irqsave(req->q->queue_lock, flags); + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) { + blk_mq_rq_update_aborted_gstate(req, 0); + blk_add_timer(req); + } else { + blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT); + ret = BLK_EH_HANDLED; + goto again; + } + spin_unlock_irqrestore(req->q->queue_lock, flags); break; case BLK_EH_NOT_HANDLED: break; diff --git a/block/blk-mq.h b/block/blk-mq.h index 88c558f71819..6dc242fc785a 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -35,6 +35,7 @@ enum mq_rq_state { MQ_RQ_IDLE = 0, MQ_RQ_IN_FLIGHT = 1, MQ_RQ_COMPLETE = 2, + MQ_RQ_COMPLETE_IN_RESET = 3,
MQ_RQ_STATE_BITS = 2, MQ_RQ_STATE_MASK = (1 << MQ_RQ_STATE_BITS) - 1,
Hello, Ming.
On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote: ...
spin_lock_irqsave(req->q->queue_lock, flags);
if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
Nothing prevents the above blk_add_timer() racing against the next recycle instance of the request, so this still leaves a small race window.
Thanks.
On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
Hello, Ming.
On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote: ...
spin_lock_irqsave(req->q->queue_lock, flags);
if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
Nothing prevents the above blk_add_timer() racing against the next recycle instance of the request, so this still leaves a small race window.
OK.
But this small race window can be avoided by running blk_add_timer(req) before blk_mq_rq_update_aborted_gstate(req, 0), can't it?
Hello,
On Thu, Apr 12, 2018 at 06:43:45AM +0800, Ming Lei wrote:
On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
Hello, Ming.
On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote: ...
spin_lock_irqsave(req->q->queue_lock, flags);
if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
Nothing prevents the above blk_add_timer() racing against the next recycle instance of the request, so this still leaves a small race window.
OK.
But this small race window can be avoided by running blk_add_timer(req) before blk_mq_rq_update_aborted_gstate(req, 0), can't it?
Not really because aborted_gstate right now doesn't have any memory barrier around it, so nothing ensures blk_add_timer() actually appears before. We can either add the matching barriers in aborted_gstate update and when it's read in the normal completion path, or we can wait for the update to be visible everywhere by waiting for rcu grace period (because the reader is rcu protected).
Thanks.
On Wed, Apr 11, 2018 at 03:47:12PM -0700, Tejun Heo wrote:
Hello,
On Thu, Apr 12, 2018 at 06:43:45AM +0800, Ming Lei wrote:
On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
Hello, Ming.
On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote: ...
spin_lock_irqsave(req->q->queue_lock, flags);
if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
Nothing prevents the above blk_add_timer() racing against the next recycle instance of the request, so this still leaves a small race window.
OK.
But this small race window can be avoided by running blk_add_timer(req) before blk_mq_rq_update_aborted_gstate(req, 0), can't it?
Not really because aborted_gstate right now doesn't have any memory barrier around it, so nothing ensures blk_add_timer() actually appears before. We can either add the matching barriers in aborted_gstate update and when it's read in the normal completion path, or we can wait for the update to be visible everywhere by waiting for rcu grace period (because the reader is rcu protected).
Seems not necessary.
Suppose it is out of order, the only side-effect is that the new recycled request is timed out as a bit late, I think that is what we can survive, right?
But it need to be documented.
-- Ming
On Thu, Apr 12, 2018 at 07:05:13AM +0800, Ming Lei wrote:
Not really because aborted_gstate right now doesn't have any memory barrier around it, so nothing ensures blk_add_timer() actually appears before. We can either add the matching barriers in aborted_gstate update and when it's read in the normal completion path, or we can wait for the update to be visible everywhere by waiting for rcu grace period (because the reader is rcu protected).
Seems not necessary.
Suppose it is out of order, the only side-effect is that the new recycled request is timed out as a bit late, I think that is what we can survive, right?
It at least can mess up the timeout duration for the next recycle instance because there can be two competing blk_add_timer() instances. I'm not sure whether there can be other consequences. When ownership isn't clear, it becomes really difficult to reason about these things and can lead to subtle failures. I think it'd be best to always establish who owns what.
Thanks.
On Thu, Apr 12, 2018 at 06:57:12AM -0700, Tejun Heo wrote:
On Thu, Apr 12, 2018 at 07:05:13AM +0800, Ming Lei wrote:
Not really because aborted_gstate right now doesn't have any memory barrier around it, so nothing ensures blk_add_timer() actually appears before. We can either add the matching barriers in aborted_gstate update and when it's read in the normal completion path, or we can wait for the update to be visible everywhere by waiting for rcu grace period (because the reader is rcu protected).
Seems not necessary.
Suppose it is out of order, the only side-effect is that the new recycled request is timed out as a bit late, I think that is what we can survive, right?
It at least can mess up the timeout duration for the next recycle instance because there can be two competing blk_add_timer() instances. I'm not sure whether there can be other consequences. When ownership isn't clear, it becomes really difficult to reason about these things and can lead to subtle failures. I think it'd be best to always establish who owns what.
Please see the code of blk_add_timer() for blk-mq:
blk_rq_set_deadline(req, jiffies + req->timeout); req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
if (!timer_pending(&q->timeout) || time_before(expiry, q->timeout.expires)) mod_timer(&q->timeout, expiry);
If this rq is recycled, blk_add_timer() only touches rq->deadline and the EXPIRED flags, and the only effect is that the timeout may be handled a bit late, but the timeout monitor won't be lost.
And this thing shouldn't be difficult to avoid, as you mentioned, synchronize_rcu() can be added between blk_add_timer() and resetting aborted gstate for avoiding it.
thanks, Ming
On Thu, 2018-04-12 at 04:55 +0800, Ming Lei wrote:
+again: switch (ret) { case BLK_EH_HANDLED: __blk_mq_complete_request(req); break; case BLK_EH_RESET_TIMER: [ ... ]
spin_lock_irqsave(req->q->queue_lock, flags);
if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
} else {
blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
ret = BLK_EH_HANDLED;
goto again;
}
spin_unlock_irqrestore(req->q->queue_lock, flags);
Does the above chunk introduce a backwards goto from inside a region around which a spinlock is held to outside that region? Can such a goto result in anything else than a deadlock?
Thanks,
Bart.
On Wed, Apr 11, 2018 at 10:49:51PM +0000, Bart Van Assche wrote:
On Thu, 2018-04-12 at 04:55 +0800, Ming Lei wrote:
+again: switch (ret) { case BLK_EH_HANDLED: __blk_mq_complete_request(req); break; case BLK_EH_RESET_TIMER: [ ... ]
spin_lock_irqsave(req->q->queue_lock, flags);
if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
} else {
blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
ret = BLK_EH_HANDLED;
goto again;
}
spin_unlock_irqrestore(req->q->queue_lock, flags);
Does the above chunk introduce a backwards goto from inside a region around which a spinlock is held to outside that region? Can such a goto result in anything else than a deadlock?
Yes, it is being fixed in my local V2, :-)
linux-stable-mirror@lists.linaro.org