Currently, Linux probes for X86_BUG_NULL_SEL unconditionally which makes it unsafe to migrate in a virtualised environment as the properties across the migration pool might differ.
To be specific, the case which goes wrong is:
1. Zen1 (or earlier) and Zen2 (or later) in a migration pool 2. Linux boots on Zen2, probes and finds the absence of X86_BUG_NULL_SEL 3. Linux is then migrated to Zen1
Linux is now running on a X86_BUG_NULL_SEL-impacted CPU while believing that the bug is fixed.
The only way to address the problem is to fully trust the "no longer affected" CPUID bit when virtualised, because in the above case it would be clear deliberately to indicate the fact "you might migrate to somewhere which has this behaviour".
Zen3 adds the NullSelectorClearsBase bit to indicate that loading a NULL segment selector zeroes the base and limit fields, as well as just attributes. Zen2 also has this behaviour but doesn't have the NSCB bit.
Signed-off-by: Jane Malalane jane.malalane@citrix.com --- CC: x86@kernel.org CC: Thomas Gleixner tglx@linutronix.de CC: Ingo Molnar mingo@redhat.com CC: Borislav Petkov bp@alien8.de CC: "H. Peter Anvin" hpa@zytor.com CC: Pu Wen puwen@hygon.cn CC: Paolo Bonzini pbonzini@redhat.com CC: Sean Christopherson seanjc@google.com CC: Peter Zijlstra peterz@infradead.org CC: Andrew Cooper andrew.cooper3@citrix.com CC: Yazen Ghannam Yazen.Ghannam@amd.com CC: Brijesh Singh brijesh.singh@amd.com CC: Huang Rui ray.huang@amd.com CC: Andy Lutomirski luto@kernel.org CC: Kim Phillips kim.phillips@amd.com CC: stable@vger.kernel.org
v2: * Deliberately not __init. early_init_*() not __init functions * Fixed whitespace error flagged by scripts/checkpatch.pl --- arch/x86/kernel/cpu/amd.c | 22 ++++++++++++++++++++++ arch/x86/kernel/cpu/common.c | 8 +++----- arch/x86/kernel/cpu/cpu.h | 1 + arch/x86/kernel/cpu/hygon.c | 22 ++++++++++++++++++++++ 4 files changed, 48 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 2131af9f2fa2..1abfb0ae1f74 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -625,6 +625,7 @@ static void early_init_amd(struct cpuinfo_x86 *c) { u64 value; u32 dummy; + bool nscb = false;
early_init_amd_mc(c);
@@ -650,6 +651,27 @@ static void early_init_amd(struct cpuinfo_x86 *c) if (c->x86_power & BIT(14)) set_cpu_cap(c, X86_FEATURE_RAPL);
+ /* + * Zen1 and earlier CPUs don't clear segment base/limits when + * loading a NULL selector. This has been designated + * X86_BUG_NULL_SEG. + * + * Zen3 CPUs advertise Null Selector Clears Base in CPUID. + * Zen2 CPUs also have this behaviour, but no CPUID bit. + * + * A hypervisor may sythesize the bit, but may also hide it + * for migration safety, so we must not probe for model + * specific behaviour when virtualised. + */ + if (c->extended_cpuid_level >= 0x80000021 && cpuid_eax(0x80000021) & BIT(6)) + nscb = true; + + if (!cpu_has(c, X86_FEATURE_HYPERVISOR) && !nscb && c->x86 == 0x17) + nscb = check_null_seg_clears_base(c); + + if (!nscb) + set_cpu_bug(c, X86_BUG_NULL_SEG); + #ifdef CONFIG_X86_64 set_cpu_cap(c, X86_FEATURE_SYSCALL32); #else diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 0f8885949e8c..2ca4afb97247 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1395,7 +1395,7 @@ void __init early_cpu_init(void) early_identify_cpu(&boot_cpu_data); }
-static void detect_null_seg_behavior(struct cpuinfo_x86 *c) +bool check_null_seg_clears_base(struct cpuinfo_x86 *c) { #ifdef CONFIG_X86_64 /* @@ -1418,10 +1418,10 @@ static void detect_null_seg_behavior(struct cpuinfo_x86 *c) wrmsrl(MSR_FS_BASE, 1); loadsegment(fs, 0); rdmsrl(MSR_FS_BASE, tmp); - if (tmp != 0) - set_cpu_bug(c, X86_BUG_NULL_SEG); wrmsrl(MSR_FS_BASE, old_base); + return tmp == 0; #endif + return true; }
static void generic_identify(struct cpuinfo_x86 *c) @@ -1457,8 +1457,6 @@ static void generic_identify(struct cpuinfo_x86 *c)
get_model_name(c); /* Default name */
- detect_null_seg_behavior(c); - /* * ESPFIX is a strange bug. All real CPUs have it. Paravirt * systems that run Linux at CPL > 0 may or may not have the diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h index 95521302630d..ad88bce508fa 100644 --- a/arch/x86/kernel/cpu/cpu.h +++ b/arch/x86/kernel/cpu/cpu.h @@ -75,6 +75,7 @@ extern int detect_extended_topology_early(struct cpuinfo_x86 *c); extern int detect_extended_topology(struct cpuinfo_x86 *c); extern int detect_ht_early(struct cpuinfo_x86 *c); extern void detect_ht(struct cpuinfo_x86 *c); +extern bool check_null_seg_clears_base(struct cpuinfo_x86 *c);
unsigned int aperfmperf_get_khz(int cpu);
diff --git a/arch/x86/kernel/cpu/hygon.c b/arch/x86/kernel/cpu/hygon.c index 6d50136f7ab9..49bdb55efe52 100644 --- a/arch/x86/kernel/cpu/hygon.c +++ b/arch/x86/kernel/cpu/hygon.c @@ -240,6 +240,7 @@ static void bsp_init_hygon(struct cpuinfo_x86 *c) static void early_init_hygon(struct cpuinfo_x86 *c) { u32 dummy; + bool nscb = false;
early_init_hygon_mc(c);
@@ -264,6 +265,27 @@ static void early_init_hygon(struct cpuinfo_x86 *c) if (c->x86_power & BIT(14)) set_cpu_cap(c, X86_FEATURE_RAPL);
+ /* + * Zen1 and earlier CPUs don't clear segment base/limits when + * loading a NULL selector. This has been designated + * X86_BUG_NULL_SEG. + * + * Zen3 CPUs advertise Null Selector Clears Base in CPUID. + * Zen2 CPUs also have this behaviour, but no CPUID bit. + * + * A hypervisor may sythesize the bit, but may also hide it + * for migration safety, so we must not probe for model + * specific behaviour when virtualised. + */ + if (c->extended_cpuid_level >= 0x80000021 && cpuid_eax(0x80000021) & BIT(6)) + nscb = true; + + if (!cpu_has(c, X86_FEATURE_HYPERVISOR) && !nscb && c->x86 == 0x18) + nscb = check_null_seg_clears_base(c); + + if (!nscb) + set_cpu_bug(c, X86_BUG_NULL_SEG); + #ifdef CONFIG_X86_64 set_cpu_cap(c, X86_FEATURE_SYSCALL32); #endif
On Wed, Oct 13, 2021 at 03:22:30PM +0100, Jane Malalane wrote:
Currently, Linux probes for X86_BUG_NULL_SEL unconditionally which makes it unsafe to migrate in a virtualised environment as the properties across the migration pool might differ.
To be specific, the case which goes wrong is:
- Zen1 (or earlier) and Zen2 (or later) in a migration pool
- Linux boots on Zen2, probes and finds the absence of X86_BUG_NULL_SEL
- Linux is then migrated to Zen1
Linux is now running on a X86_BUG_NULL_SEL-impacted CPU while believing that the bug is fixed.
The only way to address the problem is to fully trust the "no longer affected" CPUID bit when virtualised, because in the above case it would be clear deliberately to indicate the fact "you might migrate to somewhere which has this behaviour".
Zen3 adds the NullSelectorClearsBase bit to indicate that loading a NULL segment selector zeroes the base and limit fields, as well as just attributes. Zen2 also has this behaviour but doesn't have the NSCB bit.
Signed-off-by: Jane Malalane jane.malalane@citrix.com
CC: x86@kernel.org CC: Thomas Gleixner tglx@linutronix.de CC: Ingo Molnar mingo@redhat.com CC: Borislav Petkov bp@alien8.de CC: "H. Peter Anvin" hpa@zytor.com CC: Pu Wen puwen@hygon.cn CC: Paolo Bonzini pbonzini@redhat.com CC: Sean Christopherson seanjc@google.com CC: Peter Zijlstra peterz@infradead.org CC: Andrew Cooper andrew.cooper3@citrix.com CC: Yazen Ghannam Yazen.Ghannam@amd.com CC: Brijesh Singh brijesh.singh@amd.com CC: Huang Rui ray.huang@amd.com CC: Andy Lutomirski luto@kernel.org CC: Kim Phillips kim.phillips@amd.com CC: stable@vger.kernel.org
These need to go above the --- line, otherwise they are cut off when the patch is applied and you will loose the cc: stable@ tag.
thanks,
greg k-h
On Wed, Oct 13, 2021 at 03:22:30PM +0100, Jane Malalane wrote:
@@ -650,6 +651,27 @@ static void early_init_amd(struct cpuinfo_x86 *c) if (c->x86_power & BIT(14)) set_cpu_cap(c, X86_FEATURE_RAPL);
- /*
* Zen1 and earlier CPUs don't clear segment base/limits when
* loading a NULL selector. This has been designated
* X86_BUG_NULL_SEG.
*
* Zen3 CPUs advertise Null Selector Clears Base in CPUID.
* Zen2 CPUs also have this behaviour, but no CPUID bit.
*
* A hypervisor may sythesize the bit, but may also hide it
* for migration safety, so we must not probe for model
* specific behaviour when virtualised.
*/
- if (c->extended_cpuid_level >= 0x80000021 && cpuid_eax(0x80000021) & BIT(6))
nscb = true;
- if (!cpu_has(c, X86_FEATURE_HYPERVISOR) && !nscb && c->x86 == 0x17)
nscb = check_null_seg_clears_base(c);
- if (!nscb)
set_cpu_bug(c, X86_BUG_NULL_SEG);
#ifdef CONFIG_X86_64 set_cpu_cap(c, X86_FEATURE_SYSCALL32); #else
Can we do something like this?
It is carved out into a separate function which you can simply call from early_init_amd() and early_init_hygon().
I guess you can put that function in arch/x86/kernel/cpu/common.c or so.
Then, you should put the comments right over the code like I've done below so that one can follow what's going on with each particular check.
I've also flipped the logic a bit and it is simpler this way.
Totally untested of course.
static void early_probe_null_seg_clearing_base(struct cpuinfo_x86 *c) { /* * A hypervisor may sythesize the bit, but may also hide it * for migration safety, so do not probe for model-specific * behaviour when virtualised. */ if (cpu_has(c, X86_FEATURE_HYPERVISOR)) return;
/* Zen3 CPUs advertise Null Selector Clears Base in CPUID. */ if (c->extended_cpuid_level >= 0x80000021 && cpuid_eax(0x80000021) & BIT(6)) return;
/* Zen2 CPUs also have this behaviour, but no CPUID bit. */ if (c->x86 == 0x17 && check_null_seg_clears_base(c)) return;
/* All the remaining ones are affected */ set_cpu_bug(c, X86_BUG_NULL_SEG); }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 0f8885949e8c..2ca4afb97247 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1395,7 +1395,7 @@ void __init early_cpu_init(void) early_identify_cpu(&boot_cpu_data); } -static void detect_null_seg_behavior(struct cpuinfo_x86 *c) +bool check_null_seg_clears_base(struct cpuinfo_x86 *c) { #ifdef CONFIG_X86_64 /* @@ -1418,10 +1418,10 @@ static void detect_null_seg_behavior(struct cpuinfo_x86 *c) wrmsrl(MSR_FS_BASE, 1); loadsegment(fs, 0); rdmsrl(MSR_FS_BASE, tmp);
- if (tmp != 0)
wrmsrl(MSR_FS_BASE, old_base);set_cpu_bug(c, X86_BUG_NULL_SEG);
- return tmp == 0;
#endif
- return true;
} static void generic_identify(struct cpuinfo_x86 *c) @@ -1457,8 +1457,6 @@ static void generic_identify(struct cpuinfo_x86 *c) get_model_name(c); /* Default name */
- detect_null_seg_behavior(c);
- /*
- ESPFIX is a strange bug. All real CPUs have it. Paravirt
- systems that run Linux at CPL > 0 may or may not have the
So this function is called on all x86 CPUs. Are you sure others besides AMD and Hygon do not have the same issue?
IOW, I wouldn't remove that call here.
But then this is the identify() phase in the boot process and you've moved it to early_identify() by putting it in the ->c_early_init() function pointer on AMD and Hygon.
Is there any particular reasoning for this or can that detection remain in ->c_identify()?
Because if this null seg behavior detection should happen on all CPUs - and I think it should, because, well, it has been that way until now - then the vendor specific identification minus what detect_null_seg_behavior() does should run first and then after ->c_identify() is done, you should do something like:
if (!cpu_has_bug(c, X86_BUG_NULL_SEG)) { if (!check_null_seg_clears_base(c)) set_cpu_bug(c, X86_BUG_NULL_SEG); }
so that it still takes place on all CPUs.
I.e., you should split the detection.
I hope I'm making sense ...
Ah, btw, that @c parameter to detect_null_seg_behavior() is unused - you should remove it in a pre-patch.
Thx.
AFAIK no Intel CPU has ever had that behavior, and always cleared the segments; I don't Intel has any plans of supporting such a CPUID bit (although I'd certainly be willing to take such a request back to the CPU teams on request.)
That being said, this sounds like an ideal use for the hypervisor CPU feature flag. Maybe we should consider a migration hypervisor flag too to explicitly tell the kernel not to rely on hardware probing that breaks migration in general.
Now, with a CPUID but being introduced, the right thing would be to use the CPUID bit as a feature instead of using a bug flag, and add whitelisting in the vendor-specific code as applicable.
On October 18, 2021 11:17:30 AM PDT, Borislav Petkov bp@alien8.de wrote:
On Wed, Oct 13, 2021 at 03:22:30PM +0100, Jane Malalane wrote:
@@ -650,6 +651,27 @@ static void early_init_amd(struct cpuinfo_x86 *c) if (c->x86_power & BIT(14)) set_cpu_cap(c, X86_FEATURE_RAPL);
- /*
* Zen1 and earlier CPUs don't clear segment base/limits when
* loading a NULL selector. This has been designated
* X86_BUG_NULL_SEG.
*
* Zen3 CPUs advertise Null Selector Clears Base in CPUID.
* Zen2 CPUs also have this behaviour, but no CPUID bit.
*
* A hypervisor may sythesize the bit, but may also hide it
* for migration safety, so we must not probe for model
* specific behaviour when virtualised.
*/
- if (c->extended_cpuid_level >= 0x80000021 && cpuid_eax(0x80000021) & BIT(6))
nscb = true;
- if (!cpu_has(c, X86_FEATURE_HYPERVISOR) && !nscb && c->x86 == 0x17)
nscb = check_null_seg_clears_base(c);
- if (!nscb)
set_cpu_bug(c, X86_BUG_NULL_SEG);
#ifdef CONFIG_X86_64 set_cpu_cap(c, X86_FEATURE_SYSCALL32); #else
Can we do something like this?
It is carved out into a separate function which you can simply call from early_init_amd() and early_init_hygon().
I guess you can put that function in arch/x86/kernel/cpu/common.c or so.
Then, you should put the comments right over the code like I've done below so that one can follow what's going on with each particular check.
I've also flipped the logic a bit and it is simpler this way.
Totally untested of course.
static void early_probe_null_seg_clearing_base(struct cpuinfo_x86 *c) { /* * A hypervisor may sythesize the bit, but may also hide it * for migration safety, so do not probe for model-specific * behaviour when virtualised. */ if (cpu_has(c, X86_FEATURE_HYPERVISOR)) return;
/* Zen3 CPUs advertise Null Selector Clears Base in CPUID. */ if (c->extended_cpuid_level >= 0x80000021 && cpuid_eax(0x80000021) & BIT(6)) return;
/* Zen2 CPUs also have this behaviour, but no CPUID bit. */ if (c->x86 == 0x17 && check_null_seg_clears_base(c)) return;
/* All the remaining ones are affected */ set_cpu_bug(c, X86_BUG_NULL_SEG); }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 0f8885949e8c..2ca4afb97247 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1395,7 +1395,7 @@ void __init early_cpu_init(void) early_identify_cpu(&boot_cpu_data); } -static void detect_null_seg_behavior(struct cpuinfo_x86 *c) +bool check_null_seg_clears_base(struct cpuinfo_x86 *c) { #ifdef CONFIG_X86_64 /* @@ -1418,10 +1418,10 @@ static void detect_null_seg_behavior(struct cpuinfo_x86 *c) wrmsrl(MSR_FS_BASE, 1); loadsegment(fs, 0); rdmsrl(MSR_FS_BASE, tmp);
- if (tmp != 0)
wrmsrl(MSR_FS_BASE, old_base);set_cpu_bug(c, X86_BUG_NULL_SEG);
- return tmp == 0;
#endif
- return true;
} static void generic_identify(struct cpuinfo_x86 *c) @@ -1457,8 +1457,6 @@ static void generic_identify(struct cpuinfo_x86 *c) get_model_name(c); /* Default name */
- detect_null_seg_behavior(c);
- /*
- ESPFIX is a strange bug. All real CPUs have it. Paravirt
- systems that run Linux at CPL > 0 may or may not have the
So this function is called on all x86 CPUs. Are you sure others besides AMD and Hygon do not have the same issue?
IOW, I wouldn't remove that call here.
But then this is the identify() phase in the boot process and you've moved it to early_identify() by putting it in the ->c_early_init() function pointer on AMD and Hygon.
Is there any particular reasoning for this or can that detection remain in ->c_identify()?
Because if this null seg behavior detection should happen on all CPUs - and I think it should, because, well, it has been that way until now - then the vendor specific identification minus what detect_null_seg_behavior() does should run first and then after ->c_identify() is done, you should do something like:
if (!cpu_has_bug(c, X86_BUG_NULL_SEG)) { if (!check_null_seg_clears_base(c)) set_cpu_bug(c, X86_BUG_NULL_SEG); }
so that it still takes place on all CPUs.
I.e., you should split the detection.
I hope I'm making sense ...
Ah, btw, that @c parameter to detect_null_seg_behavior() is unused - you should remove it in a pre-patch.
Thx.
On Mon, Oct 18, 2021 at 12:10:20PM -0700, H. Peter Anvin wrote:
AFAIK no Intel CPU has ever had that behavior, and always cleared the segments; I don't Intel has any plans of supporting such a CPUID bit (although I'd certainly be willing to take such a request back to the CPU teams on request.)
No need - we can always set or clear a flag on Intel, depending on what we do.
That being said, this sounds like an ideal use for the hypervisor CPU feature flag.
Yap, it uses it.
Maybe we should consider a migration hypervisor flag too to explicitly tell the kernel not to rely on hardware probing that breaks migration in general.
Meh, migration-specific flag calls for all kinds of nasty when each HV would want different things to happen in the guest, for migration. And then the patch flood will come. I mean, we already do a bunch of X86_FEATURE_HYPERVISOR all over the place and apparently it is enough here too...
Now, with a CPUID but being introduced, the right thing would be to use the CPUID bit as a feature instead of using a bug flag, and add whitelisting in the vendor-specific code as applicable.
I guess we can flip all that logic checking X86_BUG_NULL_SEG... it sounds like a lot of churn to me, though and I don't see a pressing need for it unless someone is bored and wants to do some kernel patching exercises but whatever...
On Mon, Oct 18, 2021, Borislav Petkov wrote:
On Wed, Oct 13, 2021 at 03:22:30PM +0100, Jane Malalane wrote: Totally untested of course.
static void early_probe_null_seg_clearing_base(struct cpuinfo_x86 *c) { /* * A hypervisor may sythesize the bit, but may also hide it * for migration safety, so do not probe for model-specific * behaviour when virtualised. */ if (cpu_has(c, X86_FEATURE_HYPERVISOR))
This isn't correct. When running as a guest, the intended behavior is to fully trust the CPUID.0x80000021 bit. If bit 6 is set, yay, the hypervisor has told the kernel that it will only ever run on hardware without the bug. If bit 6 is clear and HYPERVISOR is true, then the FMS crud can't be trusted because the kernel _may_ run on affected hardware in the future even if the current underlying hardware is not affected.
return;
/* Zen3 CPUs advertise Null Selector Clears Base in CPUID. */ if (c->extended_cpuid_level >= 0x80000021 && cpuid_eax(0x80000021) & BIT(6)) return;
/* Zen2 CPUs also have this behaviour, but no CPUID bit. */ if (c->x86 == 0x17 && check_null_seg_clears_base(c)) return;
/* All the remaining ones are affected */ set_cpu_bug(c, X86_BUG_NULL_SEG); }
...
@@ -1457,8 +1457,6 @@ static void generic_identify(struct cpuinfo_x86 *c) get_model_name(c); /* Default name */
- detect_null_seg_behavior(c);
- /*
- ESPFIX is a strange bug. All real CPUs have it. Paravirt
- systems that run Linux at CPL > 0 may or may not have the
So this function is called on all x86 CPUs. Are you sure others besides AMD and Hygon do not have the same issue?
IOW, I wouldn't remove that call here.
I agree. If the argument for this patch is that the kernel can be migrated to older hardware, then it stands to reason that the kernel could also be migrated to a different CPU vendor entirely. E.g. start on Intel, migrate to Zen1, kaboom.
On Mon, Oct 18, 2021 at 07:29:41PM +0000, Sean Christopherson wrote:
This isn't correct. When running as a guest, the intended behavior is to fully trust the CPUID.0x80000021 bit.
Really? Because I'm coming from an SEV-SNP mail thread where we don't trust the HV at all and we even hand in a CPUID page into the guest...
:-P
If bit 6 is set, yay, the hypervisor has told the kernel that it will only ever run on hardware without the bug. If bit 6 is clear and HYPERVISOR is true, then the FMS crud can't be trusted because the kernel _may_ run on affected hardware in the future even if the current underlying hardware is not affected.
Ok, I see, then the CPUID check needs to go first, makes sense.
I agree. If the argument for this patch is that the kernel can be migrated to older hardware, then it stands to reason that the kernel could also be migrated to a different CPU vendor entirely. E.g. start on Intel, migrate to Zen1, kaboom.
Migration across vendors? Really, that works?
I'll believe it only when I see it with my own eyes.
:-)
On Mon, Oct 18, 2021, Borislav Petkov wrote:
On Mon, Oct 18, 2021 at 07:29:41PM +0000, Sean Christopherson wrote:
I agree. If the argument for this patch is that the kernel can be migrated to older hardware, then it stands to reason that the kernel could also be migrated to a different CPU vendor entirely. E.g. start on Intel, migrate to Zen1, kaboom.
Migration across vendors? Really, that works?
I'll believe it only when I see it with my own eyes.
There are plenty of caveats, but it is feasible. KVM even has a few patches that came about specifically to support cross-vendor migration, e.g. commit adc2a23734ac ("KVM: nSVM: improve SYSENTER emulation on AMD").
On 18/10/2021 21:05, Sean Christopherson wrote:
On Mon, Oct 18, 2021, Borislav Petkov wrote:
On Mon, Oct 18, 2021 at 07:29:41PM +0000, Sean Christopherson wrote:
I agree. If the argument for this patch is that the kernel can be migrated to older hardware, then it stands to reason that the kernel could also be migrated to a different CPU vendor entirely. E.g. start on Intel, migrate to Zen1, kaboom.
Migration across vendors? Really, that works?
I'll believe it only when I see it with my own eyes.
There are plenty of caveats, but it is feasible. KVM even has a few patches that came about specifically to support cross-vendor migration, e.g. commit adc2a23734ac ("KVM: nSVM: improve SYSENTER emulation on AMD").
http://developer.amd.com/wordpress/media/2012/10/CrossVendorMigration.pdf
Yeah - it really did work back in the day. For Xen PV guests, it still largely works today, because the most obvious vendor specifics were already abstracted away in the PV ABI.
Of course, none of this has remotely survived the speculation apocalypse. A cross-vendor VM has 0 chance of getting speculation safety working.
~Andrew
On 18/10/2021 19:17, Borislav Petkov wrote:
On Wed, Oct 13, 2021 at 03:22:30PM +0100, Jane Malalane wrote:
@@ -650,6 +651,27 @@ static void early_init_amd(struct cpuinfo_x86 *c) if (c->x86_power & BIT(14)) set_cpu_cap(c, X86_FEATURE_RAPL);
- /*
* Zen1 and earlier CPUs don't clear segment base/limits when
* loading a NULL selector. This has been designated
* X86_BUG_NULL_SEG.
*
* Zen3 CPUs advertise Null Selector Clears Base in CPUID.
* Zen2 CPUs also have this behaviour, but no CPUID bit.
*
* A hypervisor may sythesize the bit, but may also hide it
* for migration safety, so we must not probe for model
* specific behaviour when virtualised.
*/
- if (c->extended_cpuid_level >= 0x80000021 && cpuid_eax(0x80000021) & BIT(6))
nscb = true;
- if (!cpu_has(c, X86_FEATURE_HYPERVISOR) && !nscb && c->x86 == 0x17)
nscb = check_null_seg_clears_base(c);
- if (!nscb)
set_cpu_bug(c, X86_BUG_NULL_SEG);
#ifdef CONFIG_X86_64 set_cpu_cap(c, X86_FEATURE_SYSCALL32); #else
Can we do something like this?
It is carved out into a separate function which you can simply call from early_init_amd() and early_init_hygon().
I guess you can put that function in arch/x86/kernel/cpu/common.c or so.
Then, you should put the comments right over the code like I've done below so that one can follow what's going on with each particular check.
I've also flipped the logic a bit and it is simpler this way.
:) Sadly...
Totally untested of course.
static void early_probe_null_seg_clearing_base(struct cpuinfo_x86 *c) { /* * A hypervisor may sythesize the bit, but may also hide it * for migration safety, so do not probe for model-specific * behaviour when virtualised. */ if (cpu_has(c, X86_FEATURE_HYPERVISOR)) return;
/* Zen3 CPUs advertise Null Selector Clears Base in CPUID. */ if (c->extended_cpuid_level >= 0x80000021 && cpuid_eax(0x80000021) & BIT(6)) return;
/* Zen2 CPUs also have this behaviour, but no CPUID bit. */ if (c->x86 == 0x17 && check_null_seg_clears_base(c)) return;
... this is 0x18 for Hygon, and ...
/* All the remaining ones are affected */ set_cpu_bug(c, X86_BUG_NULL_SEG);
... hypervisor && !ncsb still needs to set BUG_NULL_SEG, so you really can't exit early.
}
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 0f8885949e8c..2ca4afb97247 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1395,7 +1395,7 @@ void __init early_cpu_init(void) early_identify_cpu(&boot_cpu_data); } -static void detect_null_seg_behavior(struct cpuinfo_x86 *c) +bool check_null_seg_clears_base(struct cpuinfo_x86 *c) { #ifdef CONFIG_X86_64 /* @@ -1418,10 +1418,10 @@ static void detect_null_seg_behavior(struct cpuinfo_x86 *c) wrmsrl(MSR_FS_BASE, 1); loadsegment(fs, 0); rdmsrl(MSR_FS_BASE, tmp);
- if (tmp != 0)
wrmsrl(MSR_FS_BASE, old_base);set_cpu_bug(c, X86_BUG_NULL_SEG);
- return tmp == 0;
#endif
- return true;
} static void generic_identify(struct cpuinfo_x86 *c) @@ -1457,8 +1457,6 @@ static void generic_identify(struct cpuinfo_x86 *c) get_model_name(c); /* Default name */
- detect_null_seg_behavior(c);
- /*
- ESPFIX is a strange bug. All real CPUs have it. Paravirt
- systems that run Linux at CPL > 0 may or may not have the
So this function is called on all x86 CPUs. Are you sure others besides AMD and Hygon do not have the same issue?
No other CPU vendors are known to have this issue. (And by "issue", even this is complicated. Back in the 32bit days, it was a plausible perf improvement, but it backfired massively for AMD64 where there was a possibility/expectation to use NULL segments.)
Andy only put the check in unilaterally just in case, and even that was fine-ish until AMD went and fixed it silently in Zen2.
IOW, I wouldn't remove that call here.
But then this is the identify() phase in the boot process and you've moved it to early_identify() by putting it in the ->c_early_init() function pointer on AMD and Hygon.
Is there any particular reasoning for this or can that detection remain in ->c_identify()?
No particular reason. It needs resolving before init gets created, but any time before that is fine.
Because if this null seg behavior detection should happen on all CPUs - and I think it should, because, well, it has been that way until now - then the vendor specific identification minus what detect_null_seg_behavior() does should run first and then after ->c_identify() is done, you should do something like:
if (!cpu_has_bug(c, X86_BUG_NULL_SEG)) { if (!check_null_seg_clears_base(c)) set_cpu_bug(c, X86_BUG_NULL_SEG); }
so that it still takes place on all CPUs.
This would only really work for boot cpu and setup_force_cap(), because no CPU is going to have X86_BUG_NULL_SEG set by default, but this still misses the point of the bugfix which is "check_null_seg_clears_base() must not be called when cpu_has_hypervisor".
In practice, the BSP is good enough. The behaviour predates the K8, which was the first CPU where it became observable without SMM/PUSHALL/etc, and quite possibly goes back to the dawn of time, and you can't mix a Zen1 and Zen2 in a 2-socket system.
I.e., you should split the detection.
I hope I'm making sense ...
Ah, btw, that @c parameter to detect_null_seg_behavior() is unused - you should remove it in a pre-patch.
It is made unused by this patch, so can't be pulled out earlier, but should be adjusted.
~Andrew
On Mon, Oct 18, 2021 at 09:06:07PM +0100, Andrew Cooper wrote:
... this is 0x18 for Hygon, and ...
Sure, whatever :)
/* All the remaining ones are affected */ set_cpu_bug(c, X86_BUG_NULL_SEG);
... hypervisor && !ncsb still needs to set BUG_NULL_SEG, so you really can't exit early.
Yeah, we had a session on IRC, we came up with this rough version, more polishing tomorrow:
static void early_probe_null_seg_clearing_base(struct cpuinfo_x86 *c) {
/* Zen3 CPUs advertise Null Selector Clears Base in CPUID. */ if (c->extended_cpuid_level >= 0x80000021 && cpuid_eax(0x80000021) & BIT(6)) return;
/* * CPUID bit above wasn't set. If this kernel is still running as a HV guest, * then the HV has decided not to advertize that CPUID bit for whatever reason. * For example, one member of the migration pool might be vulnerable. * Which means, the bug is present: set the BUG flag and return. */ if (cpu_has(c, X86_FEATURE_HYPERVISOR)) { set_cpu_bug(c, X86_BUG_NULL_SEG); return; }
/* Zen2 CPUs also have this behaviour, but no CPUID bit. 0x18 for Hygon. */ if ((c->x86 == 0x17 || c->x86 == 0x18) && check_null_seg_clears_base(c)) return;
/* All the remaining ones are affected */ set_cpu_bug(c, X86_BUG_NULL_SEG); }
So I really want to have those comments explaining each step in the complex check because we will forget why this crazy dance is being done and as I said in a previous thread, we're not all virtualizers. :)
No other CPU vendors are known to have this issue.
How do you know? Or should there be a comment along the lines of "Cooper says that..."
:-)
(And by "issue", even this is complicated. Back in the 32bit days, it was a plausible perf improvement, but it backfired massively for AMD64 where there was a possibility/expectation to use NULL segments.)
Andy only put the check in unilaterally just in case, and even that was fine-ish until AMD went and fixed it silently in Zen2.
Yeah, there's the context switch overhead too but that's for another thread.
Because if this null seg behavior detection should happen on all CPUs - and I think it should, because, well, it has been that way until now - then the vendor specific identification minus what detect_null_seg_behavior() does should run first and then after ->c_identify() is done, you should do something like:
if (!cpu_has_bug(c, X86_BUG_NULL_SEG)) { if (!check_null_seg_clears_base(c)) set_cpu_bug(c, X86_BUG_NULL_SEG); }
so that it still takes place on all CPUs.
This would only really work for boot cpu and setup_force_cap(), because no CPU is going to have X86_BUG_NULL_SEG set by default, but this still misses the point of the bugfix which is "check_null_seg_clears_base() must not be called when cpu_has_hypervisor".
In practice, the BSP is good enough. The behaviour predates the K8, which was the first CPU where it became observable without SMM/PUSHALL/etc, and quite possibly goes back to the dawn of time, and you can't mix a Zen1 and Zen2 in a 2-socket system.
Oh, I didn't express myself properly "should happen on all CPUs" was supposed to mean, if this detection should happen on all vendors like it does now. Not BSP vs AP.
It is made unused by this patch, so can't be pulled out earlier, but should be adjusted.
Right.
linux-stable-mirror@lists.linaro.org