Hi,
As part of the mitigation for the iTLB multihit vulnerability, KVM creates a worker thread in KVM_CREATE_VM ioctl(). This thread calls cgroup_attach_task_all() which takes cgroup_threadgroup_rwsem for writing which may incur 100ms+ latency since upstream commit 6a010a49b63ac8465851a79185d8deff966f8e1a.
However, if the CPU is not vulnerable to iTLB multihit one could just disable the mitigation (and the worker thread creation) with the newly added KVM module parameter nx_huge_pages=never. This avoids the issue altogether.
While there's an alternative solution for this issue already supported in 6.1-stable (ie. cgroup's favordynmods), disabling the mitigation in KVM is probably preferable if the workload is not impacted by dynamic cgroup operations since one doesn't need to decide between the trade-off in using favordynmods, the thread creation code path is avoided at KVM_CREATE_VM and you avoid creating a thread which does nothing.
Tests performed:
* Measured KVM_CREATE_VM latency and confirmed it goes down to less than 1ms * We've been performing latency measurements internally w/ this parameter for some weeks now
Christophe JAILLET (1): KVM: x86/mmu: Use kstrtobool() instead of strtobool()
Sean Christopherson (1): KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages
arch/x86/kvm/mmu/mmu.c | 42 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 5 deletions(-)
From: Christophe JAILLET christophe.jaillet@wanadoo.fr
Commit 11b36fe7d4500c8ef73677c087f302fd713101c2 upstream.
strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel.
In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name.
While at it, include the corresponding header file (<linux/kstrtox.h>)
Signed-off-by: Christophe JAILLET christophe.jaillet@wanadoo.fr Link: https://lore.kernel.org/r/670882aa04dbdd171b46d3b20ffab87158454616.167368913... Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Luiz Capitulino luizcap@amazon.com --- arch/x86/kvm/mmu/mmu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index beca03556379..c089242008b3 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -42,6 +42,7 @@ #include <linux/uaccess.h> #include <linux/hash.h> #include <linux/kern_levels.h> +#include <linux/kstrtox.h> #include <linux/kthread.h>
#include <asm/page.h> @@ -6667,7 +6668,7 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) new_val = 1; else if (sysfs_streq(val, "auto")) new_val = get_nx_auto_mode(); - else if (strtobool(val, &new_val) < 0) + else if (kstrtobool(val, &new_val) < 0) return -EINVAL;
__set_nx_huge_pages(new_val);
On Fri, Sep 01, 2023, Luiz Capitulino wrote:
From: Christophe JAILLET christophe.jaillet@wanadoo.fr
Commit 11b36fe7d4500c8ef73677c087f302fd713101c2 upstream.
strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel.
In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name.
While at it, include the corresponding header file (<linux/kstrtox.h>)
Signed-off-by: Christophe JAILLET christophe.jaillet@wanadoo.fr Link: https://lore.kernel.org/r/670882aa04dbdd171b46d3b20ffab87158454616.167368913... Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Luiz Capitulino luizcap@amazon.com
Acked-by: Sean Christopherson seanjc@google.com
From: Sean Christopherson seanjc@google.com
Commit 0b210faf337314e4bc88e796218bc70c72a51209 upstream.
[ Resolved a small conflict in arch/x86/kvm/mmu/mmu.c::kvm_mmu_post_init_vm() which is due kvm_nx_lpage_recovery_worker() being renamed in upstream commit 55c510e26ab6181c132327a8b90c864e6193ce27 ]
Add a "never" option to the nx_huge_pages module param to allow userspace to do a one-way hard disabling of the mitigation, and don't create the per-VM recovery threads when the mitigation is hard disabled. Letting userspace pinky swear that userspace doesn't want to enable NX mitigation (without reloading KVM) allows certain use cases to avoid the latency problems associated with spawning a kthread for each VM.
E.g. in FaaS use cases, the guest kernel is trusted and the host may create 100+ VMs per logical CPU, which can result in 100ms+ latencies when a burst of VMs is created.
Reported-by: Li RongQing lirongqing@baidu.com Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@bai... Cc: Yong He zhuangel570@gmail.com Cc: Robert Hoo robert.hoo.linux@gmail.com Cc: Kai Huang kai.huang@intel.com Reviewed-by: Robert Hoo robert.hoo.linux@gmail.com Acked-by: Kai Huang kai.huang@intel.com Tested-by: Luiz Capitulino luizcap@amazon.com Reviewed-by: Li RongQing lirongqing@baidu.com Link: https://lore.kernel.org/r/20230602005859.784190-1-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Luiz Capitulino luizcap@amazon.com --- arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 36 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c089242008b3..7a6df4b62c1b 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -56,6 +56,8 @@
extern bool itlb_multihit_kvm_mitigation;
+static bool nx_hugepage_mitigation_hard_disabled; + int __read_mostly nx_huge_pages = -1; static uint __read_mostly nx_huge_pages_recovery_period_ms; #ifdef CONFIG_PREEMPT_RT @@ -65,12 +67,13 @@ static uint __read_mostly nx_huge_pages_recovery_ratio = 0; static uint __read_mostly nx_huge_pages_recovery_ratio = 60; #endif
+static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp); static int set_nx_huge_pages(const char *val, const struct kernel_param *kp); static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel_param *kp);
static const struct kernel_param_ops nx_huge_pages_ops = { .set = set_nx_huge_pages, - .get = param_get_bool, + .get = get_nx_huge_pages, };
static const struct kernel_param_ops nx_huge_pages_recovery_param_ops = { @@ -6645,6 +6648,14 @@ static void mmu_destroy_caches(void) kmem_cache_destroy(mmu_page_header_cache); }
+static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp) +{ + if (nx_hugepage_mitigation_hard_disabled) + return sprintf(buffer, "never\n"); + + return param_get_bool(buffer, kp); +} + static bool get_nx_auto_mode(void) { /* Return true when CPU has the bug, and mitigations are ON */ @@ -6661,15 +6672,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) bool old_val = nx_huge_pages; bool new_val;
+ if (nx_hugepage_mitigation_hard_disabled) + return -EPERM; + /* In "auto" mode deploy workaround only if CPU has the bug. */ - if (sysfs_streq(val, "off")) + if (sysfs_streq(val, "off")) { new_val = 0; - else if (sysfs_streq(val, "force")) + } else if (sysfs_streq(val, "force")) { new_val = 1; - else if (sysfs_streq(val, "auto")) + } else if (sysfs_streq(val, "auto")) { new_val = get_nx_auto_mode(); - else if (kstrtobool(val, &new_val) < 0) + } else if (sysfs_streq(val, "never")) { + new_val = 0; + + mutex_lock(&kvm_lock); + if (!list_empty(&vm_list)) { + mutex_unlock(&kvm_lock); + return -EBUSY; + } + nx_hugepage_mitigation_hard_disabled = true; + mutex_unlock(&kvm_lock); + } else if (kstrtobool(val, &new_val) < 0) { return -EINVAL; + }
__set_nx_huge_pages(new_val);
@@ -6800,6 +6825,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel uint old_period, new_period; int err;
+ if (nx_hugepage_mitigation_hard_disabled) + return -EPERM; + was_recovery_enabled = calc_nx_huge_pages_recovery_period(&old_period);
err = param_set_uint(val, kp); @@ -6923,6 +6951,9 @@ int kvm_mmu_post_init_vm(struct kvm *kvm) { int err;
+ if (nx_hugepage_mitigation_hard_disabled) + return 0; + err = kvm_vm_create_worker_thread(kvm, kvm_nx_lpage_recovery_worker, 0, "kvm-nx-lpage-recovery", &kvm->arch.nx_lpage_recovery_thread);
On Fri, Sep 01, 2023, Luiz Capitulino wrote:
From: Sean Christopherson seanjc@google.com
Commit 0b210faf337314e4bc88e796218bc70c72a51209 upstream.
[ Resolved a small conflict in arch/x86/kvm/mmu/mmu.c::kvm_mmu_post_init_vm() which is due kvm_nx_lpage_recovery_worker() being renamed in upstream commit 55c510e26ab6181c132327a8b90c864e6193ce27 ]
Add a "never" option to the nx_huge_pages module param to allow userspace to do a one-way hard disabling of the mitigation, and don't create the per-VM recovery threads when the mitigation is hard disabled. Letting userspace pinky swear that userspace doesn't want to enable NX mitigation (without reloading KVM) allows certain use cases to avoid the latency problems associated with spawning a kthread for each VM.
E.g. in FaaS use cases, the guest kernel is trusted and the host may create 100+ VMs per logical CPU, which can result in 100ms+ latencies when a burst of VMs is created.
Reported-by: Li RongQing lirongqing@baidu.com Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@bai... Cc: Yong He zhuangel570@gmail.com Cc: Robert Hoo robert.hoo.linux@gmail.com Cc: Kai Huang kai.huang@intel.com Reviewed-by: Robert Hoo robert.hoo.linux@gmail.com Acked-by: Kai Huang kai.huang@intel.com Tested-by: Luiz Capitulino luizcap@amazon.com Reviewed-by: Li RongQing lirongqing@baidu.com Link: https://lore.kernel.org/r/20230602005859.784190-1-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Luiz Capitulino luizcap@amazon.com
Acked-by: Sean Christopherson seanjc@google.com
On Fri, Sep 01, 2023 at 06:34:51PM +0000, Luiz Capitulino wrote:
Hi,
As part of the mitigation for the iTLB multihit vulnerability, KVM creates a worker thread in KVM_CREATE_VM ioctl(). This thread calls cgroup_attach_task_all() which takes cgroup_threadgroup_rwsem for writing which may incur 100ms+ latency since upstream commit 6a010a49b63ac8465851a79185d8deff966f8e1a.
However, if the CPU is not vulnerable to iTLB multihit one could just disable the mitigation (and the worker thread creation) with the newly added KVM module parameter nx_huge_pages=never. This avoids the issue altogether.
While there's an alternative solution for this issue already supported in 6.1-stable (ie. cgroup's favordynmods), disabling the mitigation in KVM is probably preferable if the workload is not impacted by dynamic cgroup operations since one doesn't need to decide between the trade-off in using favordynmods, the thread creation code path is avoided at KVM_CREATE_VM and you avoid creating a thread which does nothing.
Tests performed:
- Measured KVM_CREATE_VM latency and confirmed it goes down to less than 1ms
- We've been performing latency measurements internally w/ this parameter for some weeks now
What about the 6.4.y kernel for these changes? Anyone moving from 6.1 to 6.4 will have a regression, right?
Or you can wait a week or so for 6.4.y to go end-of-life, your choice :)
thanks,
greg k-h
On 2023-09-02 03:27, Greg KH wrote:
On Fri, Sep 01, 2023 at 06:34:51PM +0000, Luiz Capitulino wrote:
Hi,
As part of the mitigation for the iTLB multihit vulnerability, KVM creates a worker thread in KVM_CREATE_VM ioctl(). This thread calls cgroup_attach_task_all() which takes cgroup_threadgroup_rwsem for writing which may incur 100ms+ latency since upstream commit 6a010a49b63ac8465851a79185d8deff966f8e1a.
However, if the CPU is not vulnerable to iTLB multihit one could just disable the mitigation (and the worker thread creation) with the newly added KVM module parameter nx_huge_pages=never. This avoids the issue altogether.
While there's an alternative solution for this issue already supported in 6.1-stable (ie. cgroup's favordynmods), disabling the mitigation in KVM is probably preferable if the workload is not impacted by dynamic cgroup operations since one doesn't need to decide between the trade-off in using favordynmods, the thread creation code path is avoided at KVM_CREATE_VM and you avoid creating a thread which does nothing.
Tests performed:
- Measured KVM_CREATE_VM latency and confirmed it goes down to less than 1ms
- We've been performing latency measurements internally w/ this parameter for some weeks now
What about the 6.4.y kernel for these changes? Anyone moving from 6.1 to 6.4 will have a regression, right?
Or you can wait a week or so for 6.4.y to go end-of-life, your choice :)
I can do this backport for 6.4.y if that's better for stable users. Will submit the patches next week.
- Luiz
thanks,
greg k-h
On Fri, Sep 01, 2023 at 06:34:51PM +0000, Luiz Capitulino wrote:
Hi,
As part of the mitigation for the iTLB multihit vulnerability, KVM creates a worker thread in KVM_CREATE_VM ioctl(). This thread calls cgroup_attach_task_all() which takes cgroup_threadgroup_rwsem for writing which may incur 100ms+ latency since upstream commit 6a010a49b63ac8465851a79185d8deff966f8e1a.
However, if the CPU is not vulnerable to iTLB multihit one could just disable the mitigation (and the worker thread creation) with the newly added KVM module parameter nx_huge_pages=never. This avoids the issue altogether.
While there's an alternative solution for this issue already supported in 6.1-stable (ie. cgroup's favordynmods), disabling the mitigation in KVM is probably preferable if the workload is not impacted by dynamic cgroup operations since one doesn't need to decide between the trade-off in using favordynmods, the thread creation code path is avoided at KVM_CREATE_VM and you avoid creating a thread which does nothing.
Tests performed:
- Measured KVM_CREATE_VM latency and confirmed it goes down to less than 1ms
- We've been performing latency measurements internally w/ this parameter for some weeks now
ALl now queued up, thanks.
greg k-h
linux-stable-mirror@lists.linaro.org