On Thu, Nov 9, 2017 at 11:51 AM, Patrick McLean chutzpah@gentoo.org wrote:
We do have CONFIG_GCC_PLUGIN_STRUCTLEAK and CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled on these boxes as well as CONFIG_GCC_PLUGIN_RANDSTRUCT as you pointed out before.
It might be worth just verifying without RANDSTRUCT in particular.
That case has probably not gotten a huge amount of testing. As Al points out, it can cause absolutely horrendous cache access pattern changes, but it might also be triggering some corruption in case there's a problem with the plugin, or with some piece of kernel code that gets confused by it.
And most obviously: if there is some module or part of the kernel that got compiled with a different seed for the randstruct hashing, that will break in nasty nasty ways. Your out-of-kernel module is the obvious suspect for something like that, but honestly, it could be some missing build dependency, or simply a missing special case in the plugin itself a missing __no_randomize_layout or any number of things.
We've hit gcc bugs many times before - and the plugins are just new opportunities to hit cases that have gotten a lot less testing than the "normal" code flow has.
The structleak plugin is much less likely to be a problem (simply because it's a much simpler plugin), but hey, something being NULL when it shouldn't possibly be might be a stray "leak initialization".
So since you seem to be able to reproduce this _reasonably_ easily, it's definitely worth checking that it still reproduces even without the gcc plugins.
Just to narrow it down a bit.
Linus