Is there a plan for fixing this for real? I'm wondering if there is a sane weakening of this feature that still allows things like kexec.
I'm pretty sure kexec can be fixed. I had it working at one point, I'm currently in the process of revalidating this. The issue was though that kexec only worked within the guest, not on the physical host, which I suspect is related to the need for supervisor pages to be mapped, which seems to be required before enabling SMAP (based on what I'd seen with the selftests and unittests). I was also just blindly turning on the bits without checking for support when I'd tried this, so that could have been the issue too.
I think most of the changes for just blindly enabling the bits were in relocate_kernel, secondary_startup_64, and startup_32.
So I have a naive fix for kexec which has only been tested to work under KVM. When tested on a physical host, it did not boot when SMAP or UMIP were set. Undoubtedly it's not the correct way to do this, as it skips CPU feature identification, opting instead for blindly setting the bits. The physical host I tested this on does not have UMIP so that's likely why it failed to boot when UMIP gets set blindly. Within kvm-unit-tests, the test for SMAP maps memory as supervisor pages before enabling SMAP. I suspect this is why setting SMAP blindly causes the physical host not to boot.
Within trampoline_32bit_src() if I add more instructions I get an error about "attempt to move .org backwards", which as I understand it means there are only so many instructions allowed in each of those functions.
My suspicion is that someone with more knowledge of this area has a good idea on how best to handle this. Feedback would be much appreciated.
There's no SMEP or SMAP in real mode, and real mode has basically no security mitigations at all.
We'd thought about the switch to real mode being a case where we'd want to drop pinning. However, we weren't sure how much weaker, if at all, it makes this protection.
Unless someone knows, I'll probably need to do some digging into what an exploit might look like that tries switching to real mode and switching back as a way around this protection.
TL;DR We probably shouldn't use the switch to real mode as a trigger to drop pinning.
This protection assumes that the attacker is at the point where they have the ability to write a payload for a ROP/JOP attack and gain control of execution.
For this case where we are going to switch to real mode we need to add an assumption that the attacker has a write primitive that allows them to write part of their payload to memory that will be addressable within 16 bit mode.
If the attacker has this write primitive, the attack becomes write payloads, within the first stage, switch to real mode, use stage two within real mode via JOP or just machine code (since there's we don't have to worry NX) to setup protected mode and jump back into the kernel with protections disabled.
PCID is an odd case. I see no good reason to pin it, and pinning PCID on prevents use of 32-bit mode.
Maybe it makes sense to default to the values we have, but allow host userspace to overwrite the allowed values, in case some other guest OS wants to do something that Linux doesn't with PCID or other bits.
In the next version of this patchset I've made it so that the default allowed values are WP, SMEP, SMAP, and UMIP. However, a write to the allowed MSR from the host VMM (QEMU) can change which bits are allowed.