On Mon, Aug 17, 2020 at 12:15:39PM +0200, Christoph Hellwig wrote:
On Mon, Aug 17, 2020 at 06:01:15PM +0800, Ming Lei wrote:
SCHED_RESTART code path is relied to re-run queue for dispatch requests in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding requests to hctx->dispatch.
memory barriers have to be used for ordering the following two pair of OPs:
- adding requests to hctx->dispatch and checking SCHED_RESTART in
blk_mq_dispatch_rq_list()
- clearing SCHED_RESTART and checking if there is request in hctx->dispatch
in blk_mq_sched_restart().
Without the added memory barrier, either:
- blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in dispatch side
or
- blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
in dispatch side, meantime checking if there is request in hctx->dispatch from blk_mq_sched_restart() is missed.
IO hang in ltp/fs_fill test is reported by kernel test robot:
https://lkml.org/lkml/2020/7/26/77
Turns out it is caused by the above out-of-order OPs. And the IO hang can't be observed any more after applying this patch.
Cc: Bart Van Assche bvanassche@acm.org Cc: Christoph Hellwig hch@lst.de Cc: David Jeffery djeffery@redhat.com Reported-by: kernel test robot rong.a.chen@intel.com Cc: stable@vger.kernel.org Signed-off-by: Ming Lei ming.lei@redhat.com
Can you add a Fixes: tag so that the commit gets backported?
Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Thanks, Ming