On Fri, Dec 29, 2017 at 9:32 AM, Dave Hansen dave.hansen@intel.com wrote:
From the various oopses, it looks like this happens when getting a double fault while trying to go idle. The CPU gets is probably trying to return from the double fault, but it didn't do anything useful in the fault handler so it just continues faulting, but the NMI watchdog can still get an oops out of it.
Hmm. Which oops are you looking at? The ones I see in the bugzilla don't seem to have anything interesting in them.
[ Oh. I think I see the one you think of in the gentoo bug report ]
There does seem to be a lot of odd double faults that don't make progress.
And that in turn indicates that it may be about ESPFIX64 - all other double fault cases should cause a fault printout, but ESPFIX64 has a magical silent "turn double fault into a fake #GP fault".
Maybe that one triggers over and over again?
Couple more things:
MCORE2 seems to get one oddball compiler flag (-march=core2):
cflags-$(CONFIG_MCORE2) += \ $(call cc-option,-march=core2,$(call cc-option,-mtune=generic))
It would be interesting to see if replacing the above "$(call" with:
$(call cc-option,-mtune=generic)
makes the problem go away the same way as changing the .config option.
Definitely.
The MCORE2 config option also sets CONFIG_X86_P6_NOP, which overrides the normal X86_64 noops, if I'm reading that code correctly.
Only for the ASM_NOPx nops, as far as I can tell. The actual alternative NOP rewriting seems to pick the nops based on machine, not on config options.
And I don't see anybody who actually uses the ASM_NOPx defines except for arch/x86/kernel/kprobes/opt.c, which uses ASM_NOP5.
Am I missing something? We actually have a lot of lines in arch/x86/include/asm/nops.h that set the ASM_NOPx values to the proper things, but then they are never used. We have that special "ASM_NOP5_ATOMIC" define that we are so careful about, but again, it's actually never used as far as I can tell.
Maybe there's some magic token concatenation use that I'm missing in my trivial grep, but it does seem to be dead code.
But double-checking that "-march=core2" case is definitely worth looking into. Especially since there are clear indications that it's gcc version-dependent anyway. Alexander?
Linus