The latest file system corruption issue (Nominally fixed by ffe81d45322c ("blk-mq: fix corruption with direct issue") later fixed by c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list")) brought a lot of rightfully concerned users asking about release schedules. 4.18 went EOL on Nov 21 and Fedora rebased to 4.19.3 on Nov 23. When the issue started getting visibility, users were left with the option of running known EOL 4.18.x kernels or running a 4.19 series that could corrupt their data. Admittedly, the risk of running the EOL kernel was pretty low given how recent it was, but it's still not a great look to tell people to run something marked EOL.
I'm wondering if there's anything we can do to make things easier on kernel consumers. Bugs will certainly happen but it really makes it hard to push the "always run the latest stable" narrative if there isn't a good fallback when things go seriously wrong. I don't actually have a great proposal for a solution here other than retroactively bringing back 4.18 (which I don't think Greg would like) but I figured I should at least bring it up.
Thanks, Laura
Hi Laura,
On Fri, Dec 07, 2018 at 04:33:10PM -0800, Laura Abbott wrote:
The latest file system corruption issue (Nominally fixed by ffe81d45322c ("blk-mq: fix corruption with direct issue") later fixed by c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list")) brought a lot of rightfully concerned users asking about release schedules. 4.18 went EOL on Nov 21 and Fedora rebased to 4.19.3 on Nov 23. When the issue started getting visibility, users were left with the option of running known EOL 4.18.x kernels or running a 4.19 series that could corrupt their data. Admittedly, the risk of running the EOL kernel was pretty low given how recent it was, but it's still not a great look to tell people to run something marked EOL.
I'm wondering if there's anything we can do to make things easier on kernel consumers. Bugs will certainly happen but it really makes it hard to push the "always run the latest stable" narrative if there isn't a good fallback when things go seriously wrong. I don't actually have a great proposal for a solution here other than retroactively bringing back 4.18 (which I don't think Greg would like) but I figured I should at least bring it up.
This type of problem may happen once in a while but fortunately is extremely rare, so I guess it can be addressed with unusual methods.
For my use cases, I always make sure that the last two LTS branches work fine. Since there's some great maintenance overlap between LTS branches, I can quickly switch to 4.14.x (or even 4.9.x) if this happens. In our products we make sure that our toolchain is built with support for the previous kernel as well "just in case". We've never switched back and will probably never do, but at least it serves us a lot to compare strange behaviours between two kernels.
I think that if your distro is functionally and technically compatible with the previous LTS branch, it could be an acceptable escape for users who are concerned about their data and their security at the same time. After all, previous LTS branches are there for those who can't upgrade. In my opinion this situation perfectly qualifies.
But it requires some preparation like I mentioned. It might be that some components in the distro rely on features from the very latest kernels. At the very least it might deserve a bit of inspection to know if such dependencies exist, and/or what is lost in case of such a fall back, to warn users.
Just my two cents, Willy
On Fri, Dec 07, 2018 at 04:33:10PM -0800, Laura Abbott wrote:
The latest file system corruption issue (Nominally fixed by ffe81d45322c ("blk-mq: fix corruption with direct issue") later fixed by c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list")) brought a lot of rightfully concerned users asking about release schedules. 4.18 went EOL on Nov 21 and Fedora rebased to 4.19.3 on Nov 23. When the issue started getting visibility, users were left with the option of running known EOL 4.18.x kernels or running a 4.19 series that could corrupt their data. Admittedly, the risk of running the EOL kernel was pretty low given how recent it was, but it's still not a great look to tell people to run something marked EOL.
I'm wondering if there's anything we can do to make things easier on kernel consumers. Bugs will certainly happen but it really makes it hard to push the "always run the latest stable" narrative if there isn't a good fallback when things go seriously wrong. I don't actually have a great proposal for a solution here other than retroactively bringing back 4.18 (which I don't think Greg would like) but I figured I should at least bring it up.
A nice step forward would have been if someone could have at least _told_ the stable maintainer (i.e. me) that there was such a serious bug out there. That didn't happen here and I only found out about it accidentally by happening to talk to a developer who was on the bugzilla thread at a totally random meeting last Wednesday.
There was also not an email thread that I could find once I found out about the issue. By that time the bug was fixed and all I could do was wait for it to hit Linus's tree (and even then, I had to wait for the fix to the fix...) If I had known about it earlier, I would have reverted the change that caused this.
I would start by looking at how we at least notify people of major issues like this. Yes it was complex and originally blamed on both btrfs and ext4 changes, and it was dependant on using a brand-new .config file which no kernel developers use (and it seems no distro uses either, which protected Fedora and others at the least!)
There will always be bugs and exceptions and personally I think that the rarity of this one was such that it is a rare event and adding the requirement that I have to maintain more than one set of stable trees for longer isn't going to happen (yeah, I know you said you didn't expect that, but I know others mentioned it to me...)
So I don't know what to say here other than please tell me about major issues like this and don't rely on me getting lucky and hearing about it on my own.
thanks,
greg k-h
On Sat, Dec 08, 2018 at 12:56:29PM +0100, Greg KH wrote:
A nice step forward would have been if someone could have at least _told_ the stable maintainer (i.e. me) that there was such a serious bug out there. That didn't happen here and I only found out about it accidentally by happening to talk to a developer who was on the bugzilla thread at a totally random meeting last Wednesday.
There was also not an email thread that I could find once I found out about the issue. By that time the bug was fixed and all I could do was wait for it to hit Linus's tree (and even then, I had to wait for the fix to the fix...) If I had known about it earlier, I would have reverted the change that caused this.
So to be fair, the window between when we *know* what was the change that required reverting and the fix actually being available was very narrow. For most of the 3-4 weeks when we were trying to track it down --- and the bug had been present in Linus's tree since 4.19-rc1(!) --- we had no idea exactly how big the problem was.
If you want to know about these sorts of things early --- at the moment the moment I and others at $WORK have been trying to track down a problem on a 4.14.x kernel which has symptoms that look ***eerily*** similar to Bugzilla #201685. There was another bug causing mysterious file system corruptions that may also be related that was noticed on an Ubuntu 4.13.x kernel which forced another team to fall back to a 4.4 kernel. Both of these have caused file system corruptions that resulted in customer visible disruptions. Ming Lei has now said that there is a theoretical bug which he now believes might be present in blk-mq starting in 4.11.
To make life even more annoying, starting in 4.14.63, disabling blk-mq is no longer even an *option* for virtio-scsi thanks to commit b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity"), which was backported to 4.14 as of 70b522f163bbb32. We might try reverting that commit and then disabling blk-mq to see if it makes the problem go away. But the problem happens very rarely --- maybe once a week across a population of 2500 or so VM's, so it would take a long time before we could be certain that any change would fix it in absence of a detailed root cause analysis or a clean repro that can be run in a test environment.
So now you know --- but it's not clear it's going to be helpful. Commit b5b6e8c8d3b4 was fixing another bug, so reverting it isn't necessarily the right thing, especially since we can't yet prove it's the cause of the problem. It was "interesting" that we forced virtio-scsi to use blk-mq in the middle of a LTS kernel series, though.
I would start by looking at how we at least notify people of major issues like this. Yes it was complex and originally blamed on both btrfs and ext4 changes, and it was dependant on using a brand-new .config file which no kernel developers use (and it seems no distro uses either, which protected Fedora and others at the least!)
Ubuntu's bleeding edge kernel uses the config, so that's where we got a lot of reports of bug #201685 initially. At first it wasn't even obvious whether it was a kernel<->userspace versioning issue (ala the dm userspace gotcha a month or two ago). And I never even heard that btrfs was being blamed. That was probably on a different thread that I didn't see? I wish I had, since at for the first 2-3 weeks all of the reports I saw were from ext4 users, and because it was so easy to have false negative and false positives reports, one user bisected it to a change in the middle of the RCU pull in 4.19-rc1, and another claimed that after reverting all ext4 changes between 4.18 and 4.19, the problem went away. Both conclusions, ultimately, were false of course.
So before we have root cause, and a clean reproduction that *developers* could actually use, if you had seen the early reports, would you have wanted to revert the RCU pull for the 4.19 merge window? Or the ext4 pull? Unfortunately, there are no easy solutions here.
There will always be bugs and exceptions and personally I think that the rarity of this one was such that it is a rare event and adding the requirement that I have to maintain more than one set of stable trees for longer isn't going to happen (yeah, I know you said you didn't expect that, but I know others mentioned it to me...)
So I don't know what to say here other than please tell me about major issues like this and don't rely on me getting lucky and hearing about it on my own.
Well, now you know about one of the issues that I'm trying to debug. It's not at all clear how actionable that information happens to be, though. I didn't bug you about it for that reason.
- Ted
P.S. The fact that Jens is planning on ripping out the legacy block I/O path in 4.21, and force everyone to use blk-mq, is not filling me with a lot of joy and gladness. I understand why he's doing it; maintaining two code paths is not easy. But apparently there was another discard bug recently that would have been found if blktests were being run more frequently by developers, so I'm not feeling very trusting of the block layer at the moment, especially invariably people always blame the file system code first.
P.P.S. Sorry if it sounds like I'm grumpy; it's probably because I am.
P.P.P.S. If I were king, I'd be asking for a huge number of kunit tests for block-mq to be developed, and then running them under a Thread Sanitizer.
On Sat, Dec 08, 2018 at 12:18:53PM -0500, Theodore Y. Ts'o wrote:
On Sat, Dec 08, 2018 at 12:56:29PM +0100, Greg KH wrote:
A nice step forward would have been if someone could have at least _told_ the stable maintainer (i.e. me) that there was such a serious bug out there. That didn't happen here and I only found out about it accidentally by happening to talk to a developer who was on the bugzilla thread at a totally random meeting last Wednesday.
There was also not an email thread that I could find once I found out about the issue. By that time the bug was fixed and all I could do was wait for it to hit Linus's tree (and even then, I had to wait for the fix to the fix...) If I had known about it earlier, I would have reverted the change that caused this.
So to be fair, the window between when we *know* what was the change that required reverting and the fix actually being available was very narrow. For most of the 3-4 weeks when we were trying to track it down --- and the bug had been present in Linus's tree since 4.19-rc1(!) --- we had no idea exactly how big the problem was.
If you want to know about these sorts of things early --- at the moment the moment I and others at $WORK have been trying to track down a problem on a 4.14.x kernel which has symptoms that look ***eerily*** similar to Bugzilla #201685. There was another bug causing mysterious file system corruptions that may also be related that was noticed on an Ubuntu 4.13.x kernel which forced another team to fall back to a 4.4 kernel. Both of these have caused file system corruptions that resulted in customer visible disruptions. Ming Lei has now said that there is a theoretical bug which he now believes might be present in blk-mq starting in 4.11.
To make life even more annoying, starting in 4.14.63, disabling blk-mq is no longer even an *option* for virtio-scsi thanks to commit b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity"), which was backported to 4.14 as of 70b522f163bbb32. We might try reverting that commit and then disabling blk-mq to see if it makes the problem go away. But the problem happens very rarely --- maybe once a week across a population of 2500 or so VM's, so it would take a long time before we could be certain that any change would fix it in absence of a detailed root cause analysis or a clean repro that can be run in a test environment.
So now you know --- but it's not clear it's going to be helpful. Commit b5b6e8c8d3b4 was fixing another bug, so reverting it isn't necessarily the right thing, especially since we can't yet prove it's the cause of the problem. It was "interesting" that we forced virtio-scsi to use blk-mq in the middle of a LTS kernel series, though.
Yes, this all was very helpful, thank you for the information I appreciate it.
And I will watch out for these issues now. It's a bit sad that these are showing up in 4.14, but it seems that distros are only now starting to really use that kernel version (or at least are only now starting to report things from it), as it is a year old. Oh well, can't do much about that, I am more worried about the 4.19 issues like Laura was talking about as that is the "canary" we need to watch out for more.
P.P.P.S. If I were king, I'd be asking for a huge number of kunit tests for block-mq to be developed, and then running them under a Thread Sanitizer.
Isn't that what xfs and fio is? Aren't we running this all the time and reporting those issues? How did this bug not show up on those tests, is it just because they didn't run long enough?
Because of those test suites, I was thinking that the block and filesystem paths were one of the more well-tested things we had at the moment, is this not true?
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org