On Sat, Dec 08, 2018 at 12:18:53PM -0500, Theodore Y. Ts'o wrote:
On Sat, Dec 08, 2018 at 12:56:29PM +0100, Greg KH wrote:
A nice step forward would have been if someone could have at least _told_ the stable maintainer (i.e. me) that there was such a serious bug out there. That didn't happen here and I only found out about it accidentally by happening to talk to a developer who was on the bugzilla thread at a totally random meeting last Wednesday.
There was also not an email thread that I could find once I found out about the issue. By that time the bug was fixed and all I could do was wait for it to hit Linus's tree (and even then, I had to wait for the fix to the fix...) If I had known about it earlier, I would have reverted the change that caused this.
So to be fair, the window between when we *know* what was the change that required reverting and the fix actually being available was very narrow. For most of the 3-4 weeks when we were trying to track it down --- and the bug had been present in Linus's tree since 4.19-rc1(!) --- we had no idea exactly how big the problem was.
If you want to know about these sorts of things early --- at the moment the moment I and others at $WORK have been trying to track down a problem on a 4.14.x kernel which has symptoms that look ***eerily*** similar to Bugzilla #201685. There was another bug causing mysterious file system corruptions that may also be related that was noticed on an Ubuntu 4.13.x kernel which forced another team to fall back to a 4.4 kernel. Both of these have caused file system corruptions that resulted in customer visible disruptions. Ming Lei has now said that there is a theoretical bug which he now believes might be present in blk-mq starting in 4.11.
To make life even more annoying, starting in 4.14.63, disabling blk-mq is no longer even an *option* for virtio-scsi thanks to commit b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity"), which was backported to 4.14 as of 70b522f163bbb32. We might try reverting that commit and then disabling blk-mq to see if it makes the problem go away. But the problem happens very rarely --- maybe once a week across a population of 2500 or so VM's, so it would take a long time before we could be certain that any change would fix it in absence of a detailed root cause analysis or a clean repro that can be run in a test environment.
So now you know --- but it's not clear it's going to be helpful. Commit b5b6e8c8d3b4 was fixing another bug, so reverting it isn't necessarily the right thing, especially since we can't yet prove it's the cause of the problem. It was "interesting" that we forced virtio-scsi to use blk-mq in the middle of a LTS kernel series, though.
Yes, this all was very helpful, thank you for the information I appreciate it.
And I will watch out for these issues now. It's a bit sad that these are showing up in 4.14, but it seems that distros are only now starting to really use that kernel version (or at least are only now starting to report things from it), as it is a year old. Oh well, can't do much about that, I am more worried about the 4.19 issues like Laura was talking about as that is the "canary" we need to watch out for more.
P.P.P.S. If I were king, I'd be asking for a huge number of kunit tests for block-mq to be developed, and then running them under a Thread Sanitizer.
Isn't that what xfs and fio is? Aren't we running this all the time and reporting those issues? How did this bug not show up on those tests, is it just because they didn't run long enough?
Because of those test suites, I was thinking that the block and filesystem paths were one of the more well-tested things we had at the moment, is this not true?
thanks,
greg k-h