On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote:
Ah, yeah, I was moving it out of add_timer but forgot to actully add it to the issue path. Fixed patch below.
BTW, no matter what we do w/ the request handover between normal and timeout paths, we'd need something similar. Otherwise, while we can reduce the window, we can't get rid of it.
(+Martin Steigerwald)
Hello Tejun,
Thank you for having shared this patch. It looks interesting to me. What I know about the blk-mq timeout handling is as follows: * Nobody who has reviewed the blk-mq timeout handling code with this patch applied has reported any shortcomings for that code. * However, several people have reported kernel crashes that disappear when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts memory upon timeout" (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html) and also to a "RIP: scsi_times_out+0x17" crash during boot (https://bugzilla.kernel.org/show_bug.cgi?id=199077).
So we have the choice between two approaches: (1) apply the patch from your previous e-mail and root-cause and fix the crashes referred to above. (2) apply a patch that makes the crashes reported against v4.16 disappear and remove the atomic instructions introduced by such a patch at a later time.
Since crashes have been reported for kernel v4.16 I think we should follow approach (2). That will remove the time pressure from root-causing and fixing the crashes reported for the NVMeOF initiator and SCSI initiator drivers.
Thanks,
Bart.