On 11/9/21 10:58 AM, Brian Geffon wrote:
Hi Andy,
On Tue, Nov 9, 2021 at 9:58 AM Andy Lutomirski luto@kernel.org wrote:
Here's an excerpt from an old email that I, perhaps unwisely, sent to Dave but not to a public list:
static inline void write_pkru(u32 pkru) { struct pkru_state *pk;
if (!boot_cpu_has(X86_FEATURE_OSPKE)) return; pk = get_xsave_addr(¤t->thread.fpu.state.xsave,
XFEATURE_PKRU);
/* * The PKRU value in xstate needs to be in sync with the value
that is * written to the CPU. The FPU restore on return to userland would * otherwise load the previous value again. */ fpregs_lock(); if (pk) pk->pkru = pkru;
^^^ else we just write to the PKRU register but leave XINUSE[PKRU] clear on return to usermode? That seems... unwise.
__write_pkru(pkru); fpregs_unlock();
}
I bet you're hitting exactly this bug. The fix ended up being a whole series of patches, but the gist of it is that the write_pkru() slow path needs to set the xfeature bit in the xsave buffer and then do the write. It should be possible to make a little patch to do just this in a couple lines of code.
I think you've got the right idea, the following patch does seem to fix the problem on this CPU, this is based on 5.13. It seems the changes to asm/pgtable.h were not enough, I also had to modify fpu/internal.h to get it working properly.
From e5e184d68ac6ca93c3cd2cc88d61af3260d1c014 Mon Sep 17 00:00:00 2001
From: Brian Geffon bgeffon@google.com Date: Tue, 9 Nov 2021 17:08:25 +0000 Subject: [PATCH] x86/fpu: Set XFEATURE_PKRU when writing to pkru
On kernels prior to 5.14 the write_pkru path could end up writing to the pkru register without updating the corresponding state.
Signed-off-by: Brian Geffon bgeffon@google.com
arch/x86/include/asm/fpu/internal.h | 22 ++++++++++------------ arch/x86/include/asm/pgtable.h | 5 +++-- 2 files changed, 13 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 16bf4d4a8159..ed2ce7d1afeb 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -564,18 +564,16 @@ static inline void switch_fpu_finish(struct fpu *new_fpu) * PKRU state is switched eagerly because it needs to be valid before we * return to userland e.g. for a copy_to_user() operation. */
if (!(current->flags & PF_KTHREAD)) {
/*
* If the PKRU bit in xsave.header.xfeatures is not set,
* then the PKRU component was in init state, which means
* XRSTOR will set PKRU to 0. If the bit is not set then
* get_xsave_addr() will return NULL because the PKRU value
* in memory is not valid. This means pkru_val has to be
* set to 0 and not to init_pkru_value.
*/
pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
pkru_val = pk ? pk->pkru : 0;
}
/*
* If the PKRU bit in xsave.header.xfeatures is not set,
* then the PKRU component was in init state, which means
* XRSTOR will set PKRU to 0. If the bit is not set then
* get_xsave_addr() will return NULL because the PKRU value
* in memory is not valid. This means pkru_val has to be
* set to 0 and not to init_pkru_value.
*/
pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
pkru_val = pk ? pk->pkru : 0; __write_pkru(pkru_val);
}
This hunk doesn't make any sense to me. new_fpu should be for 'current', and if 'current' is a PF_KTHREAD, it shouldn't be using PKRU.
Why does this hunk matter?
index b1099f2d9800..d00fc2df4cfe 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -137,18 +137,19 @@ static inline u32 read_pkru(void) static inline void write_pkru(u32 pkru) { struct pkru_state *pk;
struct fpu *fpu = ¤t->thread.fpu; if (!boot_cpu_has(X86_FEATURE_OSPKE)) return;
pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU);
/* * The PKRU value in xstate needs to be in sync with the value that is * written to the CPU. The FPU restore on return to userland would * otherwise load the previous value again. */
fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU;
This needs to be inside fpregs_lock(). This task can get preempted at any time and the xfeatures bit is not stable.
fpregs_lock();
pk = get_xsave_addr(&fpu->state.xsave, XFEATURE_PKRU); if (pk) pk->pkru = pkru; __write_pkru(pkru);
I still don't think this quite matches up with the symptoms that you are seeing. This fix would ensure that we don't forget to update the XSAVE buffer when it is in the init state. But, if we did that, we would see PKRU *going* to the init state: all 0's. What you were seeing instead was it going from 0x55555550 to 0x55555554, not 0x0.
I don't doubt that this makes the symptoms away, I just don't think this really explains what's going on thoroughly enough.