On 2017-11-09 12:04 PM, Linus Torvalds wrote:
On Thu, Nov 9, 2017 at 11:51 AM, Patrick McLean chutzpah@gentoo.org wrote:
We do have CONFIG_GCC_PLUGIN_STRUCTLEAK and CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled on these boxes as well as CONFIG_GCC_PLUGIN_RANDSTRUCT as you pointed out before.
It might be worth just verifying without RANDSTRUCT in particular.
And most obviously: if there is some module or part of the kernel that got compiled with a different seed for the randstruct hashing, that will break in nasty nasty ways. Your out-of-kernel module is the obvious suspect for something like that, but honestly, it could be some missing build dependency, or simply a missing special case in the plugin itself a missing __no_randomize_layout or any number of things.
We will check our fork against the in-kernel cp201x driver to make sure we didn't miss anything, but it seems odd we would be hitting the issue so consistently in the NFS code path, rather than somewhere in USB, serial, or GPIO paths.
So since you seem to be able to reproduce this _reasonably_ easily, it's definitely worth checking that it still reproduces even without the gcc plugins.
I haven't been able to reproduce it with RANDSTRUCT disabled (and structleak enabled). I will keep trying for a little while more, but evidence seems to be pointing to that.
Something must have changed since 4.13.8 to trigger this though. This did not crop up at all until we tried 4.13.11, where it we saw it pretty quickly. We have a pretty large number of machines running 4.13.6 with RANDSTRUCT enabled and running a the same workload with many more clients, and have not seen this bug at all.