On Sat, Mar 22, 2025 at 11:54 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, 22 Mar 2025 at 05:17, Yafang Shao laoar.shao@gmail.com wrote:
At this point, XFS large folios appear to be unreliable in the 6.1.y stable kernel.
I suspect it's a bad idea to start using large folios on stable kernels.
It seems that way. Since the 6.1.y stable branch continues to enable XFS large folios after the page cache corruption issue was resolved, we considered it safe to keep the feature enabled. As a result, we did not revert the problematic commit after applying this patch series.
Even with the page cache corruption fix, 6.1 is old enough that I don't know what other fixes have happened since.
It's not like the large folio code has been _hugely_ problematic, but there has definitely been various small fixes related to it, and maybe some of them have missed stable.
So I think stable should revert the "turn on large folios" in general.
I will send a revert of commit 6795801366da ('xfs: Support large folios') to the 6.1.y stable.
That said:
We would appreciate any suggestions, such as adding debug messages to the kernel source code, to help us diagnose the root cause.
I think the first thing to do - if you can - is to make sure that a much more *current* kernel actually is ok.
Without a consistent reproducer it's going to be hard to really bisect things, but the first step should be to make sure it's not some new kind of issue that happens to be unique to what you do.
By "current" I don't necessarily mean "very latest" - 6.14 is going to be released this weekend - but certainly something much more recent than 6.1-stable.
Because while the stable trees obviously collect modern fixes, subtler issues can easily fall through if people don't realize how important a particular fix was. Sometimes the "obvious cleanup patches" end up fixing things unintentionally just by making the code more straightforward and correcting something in the process.
Without any real clues outside of "corruption", it's hard to even guess whether it's core MM or VFS code, or some XFS-specific thing. There has been large folio work in all three areas.
This issue is particularly challenging to diagnose because there are no warnings in the kernel log, and the kernel continues to function perfectly fine even after the application core dump occurs.
So I suspect unless somebody has something in mind, "bisect it" to at least partially narrowing it down would be the only thing to do. Bisecting to one particular commit obviously is the best scenario, but even narrowing it down to "the issue still happens in 6.12, but is gone in 6.13" kind of narrowing down might help give people more of a place to start looking.
Thank you for your suggestion. I will give it a try, though it might take some time since we haven’t yet found a reliable way to reproduce the issue.