Bart Van Assche - 11.04.18, 14:50:
On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote:
Ah, yeah, I was moving it out of add_timer but forgot to actully add it to the issue path. Fixed patch below.
BTW, no matter what we do w/ the request handover between normal and timeout paths, we'd need something similar. Otherwise, while we can reduce the window, we can't get rid of it.
(+Martin Steigerwald)
[…]
Thank you for having shared this patch. It looks interesting to me. What I know about the blk-mq timeout handling is as follows:
- Nobody who has reviewed the blk-mq timeout handling code with this
patch applied has reported any shortcomings for that code.
- However, several people have reported kernel crashes that disappear
when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts memory upon timeout" (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848 .html) and also to a "RIP: scsi_times_out+0x17" crash during boot (https://bugzilla.kernel.org/show_bug.cgi?id=199077).
Yes, with the three patches:
- '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a request.mbox'
- '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into rcu_read_{lock,unlock}().mbox'
- '[PATCH v4] blk-mq_Fix race conditions in request timeout handling.mbox'
the occasional hangs on some boots / resumes from hibernation appear to be gone.
Also it appears that the error loading SMART data issue is gone as well (see my bug report). However it is still to early to say for sure. I think I need at least 2-3 days of additional testing with this kernel to be sufficiently sure about it.
However… I could also test another patch, but from reading the rest of this thread so far I have no clear on whether to try one of the new patches and if so which one and whether adding it on top of some of the patches I already applied or using it as a replacement of it.
So while doing a training this and next week I can apply a patch here and then, but I won´t have much time to read the complete discussion to figure out what to apply.
Personally as a stable kernel has been released with those issues, I think its good to fix it up soon. On the other hand it may take quite some time til popular distros carry 4.16 for regular users. And I have no idea how frequent the reported issues are, i.e. how many users would be affected.
Thanks,