On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote:
On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote:
On Thu, Nov 29, 2018 at 01:47:56PM +0100, Greg KH wrote:
On Thu, Nov 29, 2018 at 11:14:59PM +1100, Dave Chinner wrote:
Cherry picking only one of the 50-odd patches we've committed into late 4.19 and 4.20 kernels to fix the problems we've found really seems like asking for trouble. If you're going to back port random data corruption fixes, then you need to spend a *lot* of time validating that it doesn't make things worse than they already are...
Any reason why we can't take the 50-odd patches in their entirety? It sounds like 4.19 isn't fully fixed, but 4.20-rc1 is? If so, what do you recommend we do to make 4.19 working properly?
You coul dpull all the fixes, but then you have a QA problem. Basically, we have multiple badly broken syscalls (FICLONERANGE, FIDEDUPERANGE and copy_file_range), and even 4.20-rc4 isn't fully fixed.
There were ~5 critical dedupe/clone data corruption fixes for XFS went into 4.19-rc8.
Have any of those been tagged for stable?
None, because I have no confidence that the stable process will do the necessary QA to validate that such a significant backport is regression and data corruption free. The backport needs to be done as a complete series when we've finished the upstream work because we can't test isolated patches adequately because fsx will fall over due to all the unfixed problems and not exercise the fixes that were backported.
Further, we just had a regression reported in one of the commit that the autosel bot has selected for automatic backports. It has been uncovered by overlay which appears to do some unique things with the piece of crap that is do_splice_direct(). And Darrick just commented on #xfs that he's just noticed more bugs with FICLONERANGE and overlay.
IOWs, we're still finding broken stuff in this code and we are fixing it as fast as we can - we're still putting out fires. We most certainly don't need the added pressure of having you guys create more spot fires by breaking stable kernels with largely untested partial backports and having users exposed to whacky new data corruption issues.
So, no, it isn't tagged for stable kernels because "commit into mainline" != "this should be backported immediately". Backports of these fixes are largely going to be done largely as a function of time and resources, of which we have zero available right now. Doing backports right now is premature and ill-advised because we haven't finished finding and fixing all the bugs and regressions in this code.
Right now the XFS developers don't have the time or resources available to validate stable backports are correct and regression fre because we are focussed on ensuring the upstream fixes we've already made (and are still writing) are solid and reliable.
Ok, that's fine, so users of XFS should wait until the 4.20 release before relying on it? :)
Ok, Greg, that's *out of line*.
I should throw the CoC at you because I find that comment offensive, condescending, belittling, denegrating and insulting. Your smug and superior "I know what is right for you" attitude is completely inappropriate, and a little smiley face does not make it acceptible.
If you think your comment is funny, you've badly misjudged how much effort I've put into this (100-hour weeks for over a month now), how close I'm flying to burn out (again!), and how pissed off I am about this whole scenario.
We ended up here because we *trusted* that other people had implemented and tested their APIs and code properly before it got merged. We've been severely burnt, and we've been left to clean up the mess made by other people by ourselves.
Instead of thanks, what we get instead is "we know better" attitude and jokes implying our work is crap and we don't care about our users. That's just plain *insulting*. If anyone is looking for a demonstration of everything that is wrong with the Linux kernel development culture, then they don't need to look any further.
I understand your reluctance to want to backport anything, but it really feels like you are not even allowing for fixes that are "obviously right" to be backported either, even after they pass testing. Which isn't ok for your users.
It's worse for our users if we introduce regressions into stable kernels, which is exactly what this "obviously right" auto-backport would have done.
-Dave.