On Wed, Feb 16, 2022 at 10:16 AM Dave Hansen dave.hansen@intel.com wrote:
On 2/16/22 02:05, Greg KH wrote:
How was this tested, and what do the maintainers of this subsystem think? And will you be around to fix the bugs in this when they are found?
This has been trivial to reproduce, I've used a small repro which I've put here: https://gist.github.com/bgaff/9f8cbfc8dd22e60f9492e4f0aff8f04f , I also was able to reproduce this using the protection_keys self tests on a 11th Gen Core i5-1135G7. I'm happy to commit to addressing any bugs that may appear. I'll see what the maintainers say, but there is also a smaller fix that just involves using this_cpu_read() in switch_fpu_finish() for this specific issue, although that approach isn't as clean.
Can you add the test to the in-kernel tests so that we make sure it is fixed and never comes back?
It would be great if Brian could confirm this. But, I'm 99% sure that this can be reproduced in the vm/protection_keys.c selftest, if you run it for long enough.
Hi Dave, Yes, this is reproduced by the protection keys selfs tests. If your kernel was compiled in a way which caches current_task when read via this_cpu_read_stable(), then when switching from a kernel thread to a user thread you will observe this behavior, so the only situation where it's timing related is waiting for that switch from a kernel to user thread. If your kernel is compiled in a way which does not cache the value of current_task then you will never observe this behavior. My understanding is that this is perfectly valid for the compiler to produce that code.
The symptom here is corruption of the PKRU register. I created *lots* of bugs like this during protection keys development so the selftest keeps a shadow copy of the register to specifically watch for corruption.
It's _plausible_ that no one ever ran the pkey selftests with a clang-compiled kernel for long enough to hit this issue.
For ChromeOS we use clang and when I tested a vanilla 5.10 kernel _built with clang_ it also reproduced, so I suspect you're right that the self tests just haven't been run against clang built kernels all that often.
How would you and Greg KH like to proceed with this? I'm happy to help however I can.
Brian