On Sat, Dec 08, 2018 at 12:56:29PM +0100, Greg KH wrote:
A nice step forward would have been if someone could have at least _told_ the stable maintainer (i.e. me) that there was such a serious bug out there. That didn't happen here and I only found out about it accidentally by happening to talk to a developer who was on the bugzilla thread at a totally random meeting last Wednesday.
There was also not an email thread that I could find once I found out about the issue. By that time the bug was fixed and all I could do was wait for it to hit Linus's tree (and even then, I had to wait for the fix to the fix...) If I had known about it earlier, I would have reverted the change that caused this.
So to be fair, the window between when we *know* what was the change that required reverting and the fix actually being available was very narrow. For most of the 3-4 weeks when we were trying to track it down --- and the bug had been present in Linus's tree since 4.19-rc1(!) --- we had no idea exactly how big the problem was.
If you want to know about these sorts of things early --- at the moment the moment I and others at $WORK have been trying to track down a problem on a 4.14.x kernel which has symptoms that look ***eerily*** similar to Bugzilla #201685. There was another bug causing mysterious file system corruptions that may also be related that was noticed on an Ubuntu 4.13.x kernel which forced another team to fall back to a 4.4 kernel. Both of these have caused file system corruptions that resulted in customer visible disruptions. Ming Lei has now said that there is a theoretical bug which he now believes might be present in blk-mq starting in 4.11.
To make life even more annoying, starting in 4.14.63, disabling blk-mq is no longer even an *option* for virtio-scsi thanks to commit b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity"), which was backported to 4.14 as of 70b522f163bbb32. We might try reverting that commit and then disabling blk-mq to see if it makes the problem go away. But the problem happens very rarely --- maybe once a week across a population of 2500 or so VM's, so it would take a long time before we could be certain that any change would fix it in absence of a detailed root cause analysis or a clean repro that can be run in a test environment.
So now you know --- but it's not clear it's going to be helpful. Commit b5b6e8c8d3b4 was fixing another bug, so reverting it isn't necessarily the right thing, especially since we can't yet prove it's the cause of the problem. It was "interesting" that we forced virtio-scsi to use blk-mq in the middle of a LTS kernel series, though.
I would start by looking at how we at least notify people of major issues like this. Yes it was complex and originally blamed on both btrfs and ext4 changes, and it was dependant on using a brand-new .config file which no kernel developers use (and it seems no distro uses either, which protected Fedora and others at the least!)
Ubuntu's bleeding edge kernel uses the config, so that's where we got a lot of reports of bug #201685 initially. At first it wasn't even obvious whether it was a kernel<->userspace versioning issue (ala the dm userspace gotcha a month or two ago). And I never even heard that btrfs was being blamed. That was probably on a different thread that I didn't see? I wish I had, since at for the first 2-3 weeks all of the reports I saw were from ext4 users, and because it was so easy to have false negative and false positives reports, one user bisected it to a change in the middle of the RCU pull in 4.19-rc1, and another claimed that after reverting all ext4 changes between 4.18 and 4.19, the problem went away. Both conclusions, ultimately, were false of course.
So before we have root cause, and a clean reproduction that *developers* could actually use, if you had seen the early reports, would you have wanted to revert the RCU pull for the 4.19 merge window? Or the ext4 pull? Unfortunately, there are no easy solutions here.
There will always be bugs and exceptions and personally I think that the rarity of this one was such that it is a rare event and adding the requirement that I have to maintain more than one set of stable trees for longer isn't going to happen (yeah, I know you said you didn't expect that, but I know others mentioned it to me...)
So I don't know what to say here other than please tell me about major issues like this and don't rely on me getting lucky and hearing about it on my own.
Well, now you know about one of the issues that I'm trying to debug. It's not at all clear how actionable that information happens to be, though. I didn't bug you about it for that reason.
- Ted
P.S. The fact that Jens is planning on ripping out the legacy block I/O path in 4.21, and force everyone to use blk-mq, is not filling me with a lot of joy and gladness. I understand why he's doing it; maintaining two code paths is not easy. But apparently there was another discard bug recently that would have been found if blktests were being run more frequently by developers, so I'm not feeling very trusting of the block layer at the moment, especially invariably people always blame the file system code first.
P.P.S. Sorry if it sounds like I'm grumpy; it's probably because I am.
P.P.P.S. If I were king, I'd be asking for a huge number of kunit tests for block-mq to be developed, and then running them under a Thread Sanitizer.