The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e19182c0fff451e3744c1107d98f072e7ca377a0 Mon Sep 17 00:00:00 2001
From: Jeff Mahoney <jeffm(a)suse.com>
Date: Mon, 4 Dec 2017 13:11:45 -0500
Subject: [PATCH] btrfs: fix missing error return in btrfs_drop_snapshot
If btrfs_del_root fails in btrfs_drop_snapshot, we'll pick up the
error but then return 0 anyway due to mixing err and ret.
Fixes: 79787eaab4612 ("btrfs: replace many BUG_ONs with proper error handling")
Cc: <stable(a)vger.kernel.org> # v3.4+
Signed-off-by: Jeff Mahoney <jeffm(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 784d41e95ed9..16e46ee3cd16 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9220,6 +9220,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
ret = btrfs_del_root(trans, fs_info, &root->root_key);
if (ret) {
btrfs_abort_transaction(trans, ret);
+ err = ret;
goto out_end_trans;
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e19182c0fff451e3744c1107d98f072e7ca377a0 Mon Sep 17 00:00:00 2001
From: Jeff Mahoney <jeffm(a)suse.com>
Date: Mon, 4 Dec 2017 13:11:45 -0500
Subject: [PATCH] btrfs: fix missing error return in btrfs_drop_snapshot
If btrfs_del_root fails in btrfs_drop_snapshot, we'll pick up the
error but then return 0 anyway due to mixing err and ret.
Fixes: 79787eaab4612 ("btrfs: replace many BUG_ONs with proper error handling")
Cc: <stable(a)vger.kernel.org> # v3.4+
Signed-off-by: Jeff Mahoney <jeffm(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 784d41e95ed9..16e46ee3cd16 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9220,6 +9220,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
ret = btrfs_del_root(trans, fs_info, &root->root_key);
if (ret) {
btrfs_abort_transaction(trans, ret);
+ err = ret;
goto out_end_trans;
}
The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f775b13eedee2f7f3c6fdd4e90fb79090ce5d339 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel(a)redhat.com>
Date: Tue, 14 Nov 2017 16:54:23 -0500
Subject: [PATCH] x86,kvm: move qemu/guest FPU switching out to vcpu_run
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable(a)vger.kernel.org
Signed-off-by: Rik van Riel <riel(a)redhat.com>
Suggested-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar(a)redhat.com>
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 977de5fb968b..62527e053ee4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -536,7 +536,20 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
+ /*
+ * QEMU userspace and the guest each have their own FPU state.
+ * In vcpu_run, we switch between the user and guest FPU contexts.
+ * While running a VCPU, the VCPU thread will have the guest FPU
+ * context.
+ *
+ * Note that while the PKRU state lives inside the fpu registers,
+ * it is switched out separately at VMENTER and VMEXIT time. The
+ * "guest_fpu" state here contains the guest FPU context, with the
+ * host PRKU bits.
+ */
+ struct fpu user_fpu;
struct fpu guest_fpu;
+
u64 xcr0;
u64 guest_supported_xcr0;
u32 guest_xstate_size;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..c8da1680a7d6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2937,7 +2937,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
srcu_read_unlock(&vcpu->kvm->srcu, idx);
pagefault_enable();
kvm_x86_ops->vcpu_put(vcpu);
- kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
}
@@ -5254,13 +5253,10 @@ static void emulator_halt(struct x86_emulate_ctxt *ctxt)
static void emulator_get_fpu(struct x86_emulate_ctxt *ctxt)
{
- preempt_disable();
- kvm_load_guest_fpu(emul_to_vcpu(ctxt));
}
static void emulator_put_fpu(struct x86_emulate_ctxt *ctxt)
{
- preempt_enable();
}
static int emulator_intercept(struct x86_emulate_ctxt *ctxt,
@@ -6952,7 +6948,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
preempt_disable();
kvm_x86_ops->prepare_guest_switch(vcpu);
- kvm_load_guest_fpu(vcpu);
/*
* Disable IRQs before setting IN_GUEST_MODE. Posted interrupt
@@ -7297,12 +7292,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
}
}
+ kvm_load_guest_fpu(vcpu);
+
if (unlikely(vcpu->arch.complete_userspace_io)) {
int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
vcpu->arch.complete_userspace_io = NULL;
r = cui(vcpu);
if (r <= 0)
- goto out;
+ goto out_fpu;
} else
WARN_ON(vcpu->arch.pio.count || vcpu->mmio_needed);
@@ -7311,6 +7308,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
else
r = vcpu_run(vcpu);
+out_fpu:
+ kvm_put_guest_fpu(vcpu);
out:
post_kvm_run_save(vcpu);
kvm_sigset_deactivate(vcpu);
@@ -7704,32 +7703,25 @@ static void fx_init(struct kvm_vcpu *vcpu)
vcpu->arch.cr0 |= X86_CR0_ET;
}
+/* Swap (qemu) user FPU context for the guest FPU context. */
void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
{
- if (vcpu->guest_fpu_loaded)
- return;
-
- /*
- * Restore all possible states in the guest,
- * and assume host would use all available bits.
- * Guest xcr0 would be loaded later.
- */
- vcpu->guest_fpu_loaded = 1;
- __kernel_fpu_begin();
+ preempt_disable();
+ copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
/* PKRU is separately restored in kvm_x86_ops->run. */
__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
~XFEATURE_MASK_PKRU);
+ preempt_enable();
trace_kvm_fpu(1);
}
+/* When vcpu_run ends, restore user space FPU context. */
void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
{
- if (!vcpu->guest_fpu_loaded)
- return;
-
- vcpu->guest_fpu_loaded = 0;
+ preempt_disable();
copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
- __kernel_fpu_end();
+ copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
+ preempt_enable();
++vcpu->stat.fpu_reload;
trace_kvm_fpu(0);
}
@@ -7846,7 +7838,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
* To avoid have the INIT path from kvm_apic_has_events() that be
* called with loaded FPU and does not let userspace fix the state.
*/
- kvm_put_guest_fpu(vcpu);
+ if (init_event)
+ kvm_put_guest_fpu(vcpu);
mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
XFEATURE_MASK_BNDREGS);
if (mpx_state_buffer)
@@ -7855,6 +7848,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
XFEATURE_MASK_BNDCSR);
if (mpx_state_buffer)
memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
+ if (init_event)
+ kvm_load_guest_fpu(vcpu);
}
if (!init_event) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 893d6d606cd0..6bdd4b9f6611 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -232,7 +232,7 @@ struct kvm_vcpu {
struct mutex mutex;
struct kvm_run *run;
- int guest_fpu_loaded, guest_xcr0_loaded;
+ int guest_xcr0_loaded;
struct swait_queue_head wq;
struct pid __rcu *pid;
int sigset_active;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f775b13eedee2f7f3c6fdd4e90fb79090ce5d339 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel(a)redhat.com>
Date: Tue, 14 Nov 2017 16:54:23 -0500
Subject: [PATCH] x86,kvm: move qemu/guest FPU switching out to vcpu_run
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable(a)vger.kernel.org
Signed-off-by: Rik van Riel <riel(a)redhat.com>
Suggested-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar(a)redhat.com>
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 977de5fb968b..62527e053ee4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -536,7 +536,20 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
+ /*
+ * QEMU userspace and the guest each have their own FPU state.
+ * In vcpu_run, we switch between the user and guest FPU contexts.
+ * While running a VCPU, the VCPU thread will have the guest FPU
+ * context.
+ *
+ * Note that while the PKRU state lives inside the fpu registers,
+ * it is switched out separately at VMENTER and VMEXIT time. The
+ * "guest_fpu" state here contains the guest FPU context, with the
+ * host PRKU bits.
+ */
+ struct fpu user_fpu;
struct fpu guest_fpu;
+
u64 xcr0;
u64 guest_supported_xcr0;
u32 guest_xstate_size;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..c8da1680a7d6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2937,7 +2937,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
srcu_read_unlock(&vcpu->kvm->srcu, idx);
pagefault_enable();
kvm_x86_ops->vcpu_put(vcpu);
- kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
}
@@ -5254,13 +5253,10 @@ static void emulator_halt(struct x86_emulate_ctxt *ctxt)
static void emulator_get_fpu(struct x86_emulate_ctxt *ctxt)
{
- preempt_disable();
- kvm_load_guest_fpu(emul_to_vcpu(ctxt));
}
static void emulator_put_fpu(struct x86_emulate_ctxt *ctxt)
{
- preempt_enable();
}
static int emulator_intercept(struct x86_emulate_ctxt *ctxt,
@@ -6952,7 +6948,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
preempt_disable();
kvm_x86_ops->prepare_guest_switch(vcpu);
- kvm_load_guest_fpu(vcpu);
/*
* Disable IRQs before setting IN_GUEST_MODE. Posted interrupt
@@ -7297,12 +7292,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
}
}
+ kvm_load_guest_fpu(vcpu);
+
if (unlikely(vcpu->arch.complete_userspace_io)) {
int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
vcpu->arch.complete_userspace_io = NULL;
r = cui(vcpu);
if (r <= 0)
- goto out;
+ goto out_fpu;
} else
WARN_ON(vcpu->arch.pio.count || vcpu->mmio_needed);
@@ -7311,6 +7308,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
else
r = vcpu_run(vcpu);
+out_fpu:
+ kvm_put_guest_fpu(vcpu);
out:
post_kvm_run_save(vcpu);
kvm_sigset_deactivate(vcpu);
@@ -7704,32 +7703,25 @@ static void fx_init(struct kvm_vcpu *vcpu)
vcpu->arch.cr0 |= X86_CR0_ET;
}
+/* Swap (qemu) user FPU context for the guest FPU context. */
void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
{
- if (vcpu->guest_fpu_loaded)
- return;
-
- /*
- * Restore all possible states in the guest,
- * and assume host would use all available bits.
- * Guest xcr0 would be loaded later.
- */
- vcpu->guest_fpu_loaded = 1;
- __kernel_fpu_begin();
+ preempt_disable();
+ copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
/* PKRU is separately restored in kvm_x86_ops->run. */
__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
~XFEATURE_MASK_PKRU);
+ preempt_enable();
trace_kvm_fpu(1);
}
+/* When vcpu_run ends, restore user space FPU context. */
void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
{
- if (!vcpu->guest_fpu_loaded)
- return;
-
- vcpu->guest_fpu_loaded = 0;
+ preempt_disable();
copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
- __kernel_fpu_end();
+ copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
+ preempt_enable();
++vcpu->stat.fpu_reload;
trace_kvm_fpu(0);
}
@@ -7846,7 +7838,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
* To avoid have the INIT path from kvm_apic_has_events() that be
* called with loaded FPU and does not let userspace fix the state.
*/
- kvm_put_guest_fpu(vcpu);
+ if (init_event)
+ kvm_put_guest_fpu(vcpu);
mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
XFEATURE_MASK_BNDREGS);
if (mpx_state_buffer)
@@ -7855,6 +7848,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
XFEATURE_MASK_BNDCSR);
if (mpx_state_buffer)
memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
+ if (init_event)
+ kvm_load_guest_fpu(vcpu);
}
if (!init_event) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 893d6d606cd0..6bdd4b9f6611 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -232,7 +232,7 @@ struct kvm_vcpu {
struct mutex mutex;
struct kvm_run *run;
- int guest_fpu_loaded, guest_xcr0_loaded;
+ int guest_xcr0_loaded;
struct swait_queue_head wq;
struct pid __rcu *pid;
int sigset_active;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f775b13eedee2f7f3c6fdd4e90fb79090ce5d339 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel(a)redhat.com>
Date: Tue, 14 Nov 2017 16:54:23 -0500
Subject: [PATCH] x86,kvm: move qemu/guest FPU switching out to vcpu_run
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable(a)vger.kernel.org
Signed-off-by: Rik van Riel <riel(a)redhat.com>
Suggested-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar(a)redhat.com>
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 977de5fb968b..62527e053ee4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -536,7 +536,20 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
+ /*
+ * QEMU userspace and the guest each have their own FPU state.
+ * In vcpu_run, we switch between the user and guest FPU contexts.
+ * While running a VCPU, the VCPU thread will have the guest FPU
+ * context.
+ *
+ * Note that while the PKRU state lives inside the fpu registers,
+ * it is switched out separately at VMENTER and VMEXIT time. The
+ * "guest_fpu" state here contains the guest FPU context, with the
+ * host PRKU bits.
+ */
+ struct fpu user_fpu;
struct fpu guest_fpu;
+
u64 xcr0;
u64 guest_supported_xcr0;
u32 guest_xstate_size;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..c8da1680a7d6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2937,7 +2937,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
srcu_read_unlock(&vcpu->kvm->srcu, idx);
pagefault_enable();
kvm_x86_ops->vcpu_put(vcpu);
- kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
}
@@ -5254,13 +5253,10 @@ static void emulator_halt(struct x86_emulate_ctxt *ctxt)
static void emulator_get_fpu(struct x86_emulate_ctxt *ctxt)
{
- preempt_disable();
- kvm_load_guest_fpu(emul_to_vcpu(ctxt));
}
static void emulator_put_fpu(struct x86_emulate_ctxt *ctxt)
{
- preempt_enable();
}
static int emulator_intercept(struct x86_emulate_ctxt *ctxt,
@@ -6952,7 +6948,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
preempt_disable();
kvm_x86_ops->prepare_guest_switch(vcpu);
- kvm_load_guest_fpu(vcpu);
/*
* Disable IRQs before setting IN_GUEST_MODE. Posted interrupt
@@ -7297,12 +7292,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
}
}
+ kvm_load_guest_fpu(vcpu);
+
if (unlikely(vcpu->arch.complete_userspace_io)) {
int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
vcpu->arch.complete_userspace_io = NULL;
r = cui(vcpu);
if (r <= 0)
- goto out;
+ goto out_fpu;
} else
WARN_ON(vcpu->arch.pio.count || vcpu->mmio_needed);
@@ -7311,6 +7308,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
else
r = vcpu_run(vcpu);
+out_fpu:
+ kvm_put_guest_fpu(vcpu);
out:
post_kvm_run_save(vcpu);
kvm_sigset_deactivate(vcpu);
@@ -7704,32 +7703,25 @@ static void fx_init(struct kvm_vcpu *vcpu)
vcpu->arch.cr0 |= X86_CR0_ET;
}
+/* Swap (qemu) user FPU context for the guest FPU context. */
void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
{
- if (vcpu->guest_fpu_loaded)
- return;
-
- /*
- * Restore all possible states in the guest,
- * and assume host would use all available bits.
- * Guest xcr0 would be loaded later.
- */
- vcpu->guest_fpu_loaded = 1;
- __kernel_fpu_begin();
+ preempt_disable();
+ copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
/* PKRU is separately restored in kvm_x86_ops->run. */
__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
~XFEATURE_MASK_PKRU);
+ preempt_enable();
trace_kvm_fpu(1);
}
+/* When vcpu_run ends, restore user space FPU context. */
void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
{
- if (!vcpu->guest_fpu_loaded)
- return;
-
- vcpu->guest_fpu_loaded = 0;
+ preempt_disable();
copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
- __kernel_fpu_end();
+ copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
+ preempt_enable();
++vcpu->stat.fpu_reload;
trace_kvm_fpu(0);
}
@@ -7846,7 +7838,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
* To avoid have the INIT path from kvm_apic_has_events() that be
* called with loaded FPU and does not let userspace fix the state.
*/
- kvm_put_guest_fpu(vcpu);
+ if (init_event)
+ kvm_put_guest_fpu(vcpu);
mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
XFEATURE_MASK_BNDREGS);
if (mpx_state_buffer)
@@ -7855,6 +7848,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
XFEATURE_MASK_BNDCSR);
if (mpx_state_buffer)
memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
+ if (init_event)
+ kvm_load_guest_fpu(vcpu);
}
if (!init_event) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 893d6d606cd0..6bdd4b9f6611 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -232,7 +232,7 @@ struct kvm_vcpu {
struct mutex mutex;
struct kvm_run *run;
- int guest_fpu_loaded, guest_xcr0_loaded;
+ int guest_xcr0_loaded;
struct swait_queue_head wq;
struct pid __rcu *pid;
int sigset_active;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f775b13eedee2f7f3c6fdd4e90fb79090ce5d339 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel(a)redhat.com>
Date: Tue, 14 Nov 2017 16:54:23 -0500
Subject: [PATCH] x86,kvm: move qemu/guest FPU switching out to vcpu_run
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable(a)vger.kernel.org
Signed-off-by: Rik van Riel <riel(a)redhat.com>
Suggested-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar(a)redhat.com>
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 977de5fb968b..62527e053ee4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -536,7 +536,20 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
+ /*
+ * QEMU userspace and the guest each have their own FPU state.
+ * In vcpu_run, we switch between the user and guest FPU contexts.
+ * While running a VCPU, the VCPU thread will have the guest FPU
+ * context.
+ *
+ * Note that while the PKRU state lives inside the fpu registers,
+ * it is switched out separately at VMENTER and VMEXIT time. The
+ * "guest_fpu" state here contains the guest FPU context, with the
+ * host PRKU bits.
+ */
+ struct fpu user_fpu;
struct fpu guest_fpu;
+
u64 xcr0;
u64 guest_supported_xcr0;
u32 guest_xstate_size;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..c8da1680a7d6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2937,7 +2937,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
srcu_read_unlock(&vcpu->kvm->srcu, idx);
pagefault_enable();
kvm_x86_ops->vcpu_put(vcpu);
- kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
}
@@ -5254,13 +5253,10 @@ static void emulator_halt(struct x86_emulate_ctxt *ctxt)
static void emulator_get_fpu(struct x86_emulate_ctxt *ctxt)
{
- preempt_disable();
- kvm_load_guest_fpu(emul_to_vcpu(ctxt));
}
static void emulator_put_fpu(struct x86_emulate_ctxt *ctxt)
{
- preempt_enable();
}
static int emulator_intercept(struct x86_emulate_ctxt *ctxt,
@@ -6952,7 +6948,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
preempt_disable();
kvm_x86_ops->prepare_guest_switch(vcpu);
- kvm_load_guest_fpu(vcpu);
/*
* Disable IRQs before setting IN_GUEST_MODE. Posted interrupt
@@ -7297,12 +7292,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
}
}
+ kvm_load_guest_fpu(vcpu);
+
if (unlikely(vcpu->arch.complete_userspace_io)) {
int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
vcpu->arch.complete_userspace_io = NULL;
r = cui(vcpu);
if (r <= 0)
- goto out;
+ goto out_fpu;
} else
WARN_ON(vcpu->arch.pio.count || vcpu->mmio_needed);
@@ -7311,6 +7308,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
else
r = vcpu_run(vcpu);
+out_fpu:
+ kvm_put_guest_fpu(vcpu);
out:
post_kvm_run_save(vcpu);
kvm_sigset_deactivate(vcpu);
@@ -7704,32 +7703,25 @@ static void fx_init(struct kvm_vcpu *vcpu)
vcpu->arch.cr0 |= X86_CR0_ET;
}
+/* Swap (qemu) user FPU context for the guest FPU context. */
void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
{
- if (vcpu->guest_fpu_loaded)
- return;
-
- /*
- * Restore all possible states in the guest,
- * and assume host would use all available bits.
- * Guest xcr0 would be loaded later.
- */
- vcpu->guest_fpu_loaded = 1;
- __kernel_fpu_begin();
+ preempt_disable();
+ copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
/* PKRU is separately restored in kvm_x86_ops->run. */
__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
~XFEATURE_MASK_PKRU);
+ preempt_enable();
trace_kvm_fpu(1);
}
+/* When vcpu_run ends, restore user space FPU context. */
void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
{
- if (!vcpu->guest_fpu_loaded)
- return;
-
- vcpu->guest_fpu_loaded = 0;
+ preempt_disable();
copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
- __kernel_fpu_end();
+ copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
+ preempt_enable();
++vcpu->stat.fpu_reload;
trace_kvm_fpu(0);
}
@@ -7846,7 +7838,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
* To avoid have the INIT path from kvm_apic_has_events() that be
* called with loaded FPU and does not let userspace fix the state.
*/
- kvm_put_guest_fpu(vcpu);
+ if (init_event)
+ kvm_put_guest_fpu(vcpu);
mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
XFEATURE_MASK_BNDREGS);
if (mpx_state_buffer)
@@ -7855,6 +7848,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
XFEATURE_MASK_BNDCSR);
if (mpx_state_buffer)
memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
+ if (init_event)
+ kvm_load_guest_fpu(vcpu);
}
if (!init_event) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 893d6d606cd0..6bdd4b9f6611 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -232,7 +232,7 @@ struct kvm_vcpu {
struct mutex mutex;
struct kvm_run *run;
- int guest_fpu_loaded, guest_xcr0_loaded;
+ int guest_xcr0_loaded;
struct swait_queue_head wq;
struct pid __rcu *pid;
int sigset_active;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4dca6ea1d9432052afb06baf2e3ae78188a4410b Mon Sep 17 00:00:00 2001
From: Eric Biggers <ebiggers(a)google.com>
Date: Fri, 8 Dec 2017 15:13:27 +0000
Subject: [PATCH] KEYS: add missing permission check for request_key()
destination
When the request_key() syscall is not passed a destination keyring, it
links the requested key (if constructed) into the "default" request-key
keyring. This should require Write permission to the keyring. However,
there is actually no permission check.
This can be abused to add keys to any keyring to which only Search
permission is granted. This is because Search permission allows joining
the keyring. keyctl_set_reqkey_keyring(KEY_REQKEY_DEFL_SESSION_KEYRING)
then will set the default request-key keyring to the session keyring.
Then, request_key() can be used to add keys to the keyring.
Both negatively and positively instantiated keys can be added using this
method. Adding negative keys is trivial. Adding a positive key is a
bit trickier. It requires that either /sbin/request-key positively
instantiates the key, or that another thread adds the key to the process
keyring at just the right time, such that request_key() misses it
initially but then finds it in construct_alloc_key().
Fix this bug by checking for Write permission to the keyring in
construct_get_dest_keyring() when the default keyring is being used.
We don't do the permission check for non-default keyrings because that
was already done by the earlier call to lookup_user_key(). Also,
request_key_and_link() is currently passed a 'struct key *' rather than
a key_ref_t, so the "possessed" bit is unavailable.
We also don't do the permission check for the "requestor keyring", to
continue to support the use case described by commit 8bbf4976b59f
("KEYS: Alter use of key instantiation link-to-keyring argument") where
/sbin/request-key recursively calls request_key() to add keys to the
original requestor's destination keyring. (I don't know of any users
who actually do that, though...)
Fixes: 3e30148c3d52 ("[PATCH] Keys: Make request-key create an authorisation key")
Cc: <stable(a)vger.kernel.org> # v2.6.13+
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: David Howells <dhowells(a)redhat.com>
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index c6880af8b411..114f7408feee 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -251,11 +251,12 @@ static int construct_key(struct key *key, const void *callout_info,
* The keyring selected is returned with an extra reference upon it which the
* caller must release.
*/
-static void construct_get_dest_keyring(struct key **_dest_keyring)
+static int construct_get_dest_keyring(struct key **_dest_keyring)
{
struct request_key_auth *rka;
const struct cred *cred = current_cred();
struct key *dest_keyring = *_dest_keyring, *authkey;
+ int ret;
kenter("%p", dest_keyring);
@@ -264,6 +265,8 @@ static void construct_get_dest_keyring(struct key **_dest_keyring)
/* the caller supplied one */
key_get(dest_keyring);
} else {
+ bool do_perm_check = true;
+
/* use a default keyring; falling through the cases until we
* find one that we actually have */
switch (cred->jit_keyring) {
@@ -278,8 +281,10 @@ static void construct_get_dest_keyring(struct key **_dest_keyring)
dest_keyring =
key_get(rka->dest_keyring);
up_read(&authkey->sem);
- if (dest_keyring)
+ if (dest_keyring) {
+ do_perm_check = false;
break;
+ }
}
case KEY_REQKEY_DEFL_THREAD_KEYRING:
@@ -314,11 +319,29 @@ static void construct_get_dest_keyring(struct key **_dest_keyring)
default:
BUG();
}
+
+ /*
+ * Require Write permission on the keyring. This is essential
+ * because the default keyring may be the session keyring, and
+ * joining a keyring only requires Search permission.
+ *
+ * However, this check is skipped for the "requestor keyring" so
+ * that /sbin/request-key can itself use request_key() to add
+ * keys to the original requestor's destination keyring.
+ */
+ if (dest_keyring && do_perm_check) {
+ ret = key_permission(make_key_ref(dest_keyring, 1),
+ KEY_NEED_WRITE);
+ if (ret) {
+ key_put(dest_keyring);
+ return ret;
+ }
+ }
}
*_dest_keyring = dest_keyring;
kleave(" [dk %d]", key_serial(dest_keyring));
- return;
+ return 0;
}
/*
@@ -444,11 +467,15 @@ static struct key *construct_key_and_link(struct keyring_search_context *ctx,
if (ctx->index_key.type == &key_type_keyring)
return ERR_PTR(-EPERM);
- user = key_user_lookup(current_fsuid());
- if (!user)
- return ERR_PTR(-ENOMEM);
+ ret = construct_get_dest_keyring(&dest_keyring);
+ if (ret)
+ goto error;
- construct_get_dest_keyring(&dest_keyring);
+ user = key_user_lookup(current_fsuid());
+ if (!user) {
+ ret = -ENOMEM;
+ goto error_put_dest_keyring;
+ }
ret = construct_alloc_key(ctx, dest_keyring, flags, user, &key);
key_user_put(user);
@@ -463,7 +490,7 @@ static struct key *construct_key_and_link(struct keyring_search_context *ctx,
} else if (ret == -EINPROGRESS) {
ret = 0;
} else {
- goto couldnt_alloc_key;
+ goto error_put_dest_keyring;
}
key_put(dest_keyring);
@@ -473,8 +500,9 @@ static struct key *construct_key_and_link(struct keyring_search_context *ctx,
construction_failed:
key_negate_and_link(key, key_negative_timeout, NULL, NULL);
key_put(key);
-couldnt_alloc_key:
+error_put_dest_keyring:
key_put(dest_keyring);
+error:
kleave(" = %d", ret);
return ERR_PTR(ret);
}
The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e0058f3a874ebb48b25be7ff79bc3b4e59929f90 Mon Sep 17 00:00:00 2001
From: Eric Biggers <ebiggers(a)google.com>
Date: Fri, 8 Dec 2017 15:13:27 +0000
Subject: [PATCH] ASN.1: fix out-of-bounds read when parsing indefinite length
item
In asn1_ber_decoder(), indefinitely-sized ASN.1 items were being passed
to the action functions before their lengths had been computed, using
the bogus length of 0x80 (ASN1_INDEFINITE_LENGTH). This resulted in
reading data past the end of the input buffer, when given a specially
crafted message.
Fix it by rearranging the code so that the indefinite length is resolved
before the action is called.
This bug was originally found by fuzzing the X.509 parser in userspace
using libFuzzer from the LLVM project.
KASAN report (cleaned up slightly):
BUG: KASAN: slab-out-of-bounds in memcpy ./include/linux/string.h:341 [inline]
BUG: KASAN: slab-out-of-bounds in x509_fabricate_name.constprop.1+0x1a4/0x940 crypto/asymmetric_keys/x509_cert_parser.c:366
Read of size 128 at addr ffff880035dd9eaf by task keyctl/195
CPU: 1 PID: 195 Comm: keyctl Not tainted 4.14.0-09238-g1d3b78bbc6e9 #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0xd1/0x175 lib/dump_stack.c:53
print_address_description+0x78/0x260 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x23f/0x350 mm/kasan/report.c:409
memcpy+0x1f/0x50 mm/kasan/kasan.c:302
memcpy ./include/linux/string.h:341 [inline]
x509_fabricate_name.constprop.1+0x1a4/0x940 crypto/asymmetric_keys/x509_cert_parser.c:366
asn1_ber_decoder+0xb4a/0x1fd0 lib/asn1_decoder.c:447
x509_cert_parse+0x1c7/0x620 crypto/asymmetric_keys/x509_cert_parser.c:89
x509_key_preparse+0x61/0x750 crypto/asymmetric_keys/x509_public_key.c:174
asymmetric_key_preparse+0xa4/0x150 crypto/asymmetric_keys/asymmetric_type.c:388
key_create_or_update+0x4d4/0x10a0 security/keys/key.c:850
SYSC_add_key security/keys/keyctl.c:122 [inline]
SyS_add_key+0xe8/0x290 security/keys/keyctl.c:62
entry_SYSCALL_64_fastpath+0x1f/0x96
Allocated by task 195:
__do_kmalloc_node mm/slab.c:3675 [inline]
__kmalloc_node+0x47/0x60 mm/slab.c:3682
kvmalloc ./include/linux/mm.h:540 [inline]
SYSC_add_key security/keys/keyctl.c:104 [inline]
SyS_add_key+0x19e/0x290 security/keys/keyctl.c:62
entry_SYSCALL_64_fastpath+0x1f/0x96
Fixes: 42d5ec27f873 ("X.509: Add an ASN.1 decoder")
Reported-by: Alexander Potapenko <glider(a)google.com>
Cc: <stable(a)vger.kernel.org> # v3.7+
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: David Howells <dhowells(a)redhat.com>
diff --git a/lib/asn1_decoder.c b/lib/asn1_decoder.c
index 1ef0cec38d78..d77cdfc4b554 100644
--- a/lib/asn1_decoder.c
+++ b/lib/asn1_decoder.c
@@ -313,42 +313,47 @@ int asn1_ber_decoder(const struct asn1_decoder *decoder,
/* Decide how to handle the operation */
switch (op) {
- case ASN1_OP_MATCH_ANY_ACT:
- case ASN1_OP_MATCH_ANY_ACT_OR_SKIP:
- case ASN1_OP_COND_MATCH_ANY_ACT:
- case ASN1_OP_COND_MATCH_ANY_ACT_OR_SKIP:
- ret = actions[machine[pc + 1]](context, hdr, tag, data + dp, len);
- if (ret < 0)
- return ret;
- goto skip_data;
-
- case ASN1_OP_MATCH_ACT:
- case ASN1_OP_MATCH_ACT_OR_SKIP:
- case ASN1_OP_COND_MATCH_ACT_OR_SKIP:
- ret = actions[machine[pc + 2]](context, hdr, tag, data + dp, len);
- if (ret < 0)
- return ret;
- goto skip_data;
-
case ASN1_OP_MATCH:
case ASN1_OP_MATCH_OR_SKIP:
+ case ASN1_OP_MATCH_ACT:
+ case ASN1_OP_MATCH_ACT_OR_SKIP:
case ASN1_OP_MATCH_ANY:
case ASN1_OP_MATCH_ANY_OR_SKIP:
+ case ASN1_OP_MATCH_ANY_ACT:
+ case ASN1_OP_MATCH_ANY_ACT_OR_SKIP:
case ASN1_OP_COND_MATCH_OR_SKIP:
+ case ASN1_OP_COND_MATCH_ACT_OR_SKIP:
case ASN1_OP_COND_MATCH_ANY:
case ASN1_OP_COND_MATCH_ANY_OR_SKIP:
- skip_data:
+ case ASN1_OP_COND_MATCH_ANY_ACT:
+ case ASN1_OP_COND_MATCH_ANY_ACT_OR_SKIP:
+
if (!(flags & FLAG_CONS)) {
if (flags & FLAG_INDEFINITE_LENGTH) {
+ size_t tmp = dp;
+
ret = asn1_find_indefinite_length(
- data, datalen, &dp, &len, &errmsg);
+ data, datalen, &tmp, &len, &errmsg);
if (ret < 0)
goto error;
- } else {
- dp += len;
}
pr_debug("- LEAF: %zu\n", len);
}
+
+ if (op & ASN1_OP_MATCH__ACT) {
+ unsigned char act;
+
+ if (op & ASN1_OP_MATCH__ANY)
+ act = machine[pc + 1];
+ else
+ act = machine[pc + 2];
+ ret = actions[act](context, hdr, tag, data + dp, len);
+ if (ret < 0)
+ return ret;
+ }
+
+ if (!(flags & FLAG_CONS))
+ dp += len;
pc += asn1_op_lengths[op];
goto next_op;