On Fri, Dec 07, 2018 at 04:33:10PM -0800, Laura Abbott wrote:
The latest file system corruption issue (Nominally fixed by ffe81d45322c ("blk-mq: fix corruption with direct issue") later fixed by c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list")) brought a lot of rightfully concerned users asking about release schedules. 4.18 went EOL on Nov 21 and Fedora rebased to 4.19.3 on Nov 23. When the issue started getting visibility, users were left with the option of running known EOL 4.18.x kernels or running a 4.19 series that could corrupt their data. Admittedly, the risk of running the EOL kernel was pretty low given how recent it was, but it's still not a great look to tell people to run something marked EOL.
I'm wondering if there's anything we can do to make things easier on kernel consumers. Bugs will certainly happen but it really makes it hard to push the "always run the latest stable" narrative if there isn't a good fallback when things go seriously wrong. I don't actually have a great proposal for a solution here other than retroactively bringing back 4.18 (which I don't think Greg would like) but I figured I should at least bring it up.
A nice step forward would have been if someone could have at least _told_ the stable maintainer (i.e. me) that there was such a serious bug out there. That didn't happen here and I only found out about it accidentally by happening to talk to a developer who was on the bugzilla thread at a totally random meeting last Wednesday.
There was also not an email thread that I could find once I found out about the issue. By that time the bug was fixed and all I could do was wait for it to hit Linus's tree (and even then, I had to wait for the fix to the fix...) If I had known about it earlier, I would have reverted the change that caused this.
I would start by looking at how we at least notify people of major issues like this. Yes it was complex and originally blamed on both btrfs and ext4 changes, and it was dependant on using a brand-new .config file which no kernel developers use (and it seems no distro uses either, which protected Fedora and others at the least!)
There will always be bugs and exceptions and personally I think that the rarity of this one was such that it is a rare event and adding the requirement that I have to maintain more than one set of stable trees for longer isn't going to happen (yeah, I know you said you didn't expect that, but I know others mentioned it to me...)
So I don't know what to say here other than please tell me about major issues like this and don't rely on me getting lucky and hearing about it on my own.
thanks,
greg k-h