KVM currently updates PC (and the corresponding exception state) using a two phase approach: first by setting a set of flags, then by converting these flags into a state update when the vcpu is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread returns there with any exception/PC flag set. In this case, the exposed context is wrong, as userpsace doesn't have access to these flags (they aren't architectural). It also means that these flags are preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the exception state on vcpu exit to userspace. As an optimisation for nVHE systems, only perform this when there is something pending.
Reported-by: Zenghui Yu yuzenghui@huawei.com Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org # 5.11 --- arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/kvm/arm.c | 10 ++++++++++ arch/arm64/kvm/hyp/exception.c | 4 ++-- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 8 ++++++++ 4 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index d5b11037401d..5e9b33cbac51 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -63,6 +63,7 @@ #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18 #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19 #define __KVM_HOST_SMCCC_FUNC___pkvm_mark_hyp 20 +#define __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc 21
#ifndef __ASSEMBLY__
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 1cb39c0803a4..d62a7041ebd1 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -897,6 +897,16 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
kvm_sigset_deactivate(vcpu);
+ /* + * In the unlikely event that we are returning to userspace + * with pending exceptions or PC adjustment, commit these + * adjustments in order to give userspace a consistent view of + * the vcpu state. + */ + if (unlikely(vcpu->arch.flags & (KVM_ARM64_PENDING_EXCEPTION | + KVM_ARM64_EXCEPT_MASK))) + kvm_call_hyp(__kvm_adjust_pc, vcpu); + vcpu_put(vcpu); return ret; } diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c index 0812a496725f..11541b94b328 100644 --- a/arch/arm64/kvm/hyp/exception.c +++ b/arch/arm64/kvm/hyp/exception.c @@ -331,8 +331,8 @@ static void kvm_inject_exception(struct kvm_vcpu *vcpu) }
/* - * Adjust the guest PC on entry, depending on flags provided by EL1 - * for the purpose of emulation (MMIO, sysreg) or exception injection. + * Adjust the guest PC (and potentially exception state) depending on + * flags provided by the emulation code. */ void __kvm_adjust_pc(struct kvm_vcpu *vcpu) { diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index f36420a80474..1632f001f4ed 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -28,6 +28,13 @@ static void handle___kvm_vcpu_run(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __kvm_vcpu_run(kern_hyp_va(vcpu)); }
+static void handle___kvm_adjust_pc(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(struct kvm_vcpu *, vcpu, host_ctxt, 1); + + __kvm_adjust_pc(kern_hyp_va(vcpu)); +} + static void handle___kvm_flush_vm_context(struct kvm_cpu_context *host_ctxt) { __kvm_flush_vm_context(); @@ -170,6 +177,7 @@ typedef void (*hcall_t)(struct kvm_cpu_context *);
static const hcall_t host_hcall[] = { HANDLE_FUNC(__kvm_vcpu_run), + HANDLE_FUNC(__kvm_adjust_pc), HANDLE_FUNC(__kvm_flush_vm_context), HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa), HANDLE_FUNC(__kvm_tlb_flush_vmid),
Hi Marc,
On 5/10/21 10:49 AM, Marc Zyngier wrote:
KVM currently updates PC (and the corresponding exception state) using a two phase approach: first by setting a set of flags, then by converting these flags into a state update when the vcpu is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread returns there with any exception/PC flag set. In this case, the exposed
The code seems to handle only the KVM_ARM64_PENDING_EXCEPTION flag. Is the "PC flag" a reference to the KVM_ARM64_INCREMENT_PC flag?
context is wrong, as userpsace doesn't have access to these flags
s/userpsace/userspace
(they aren't architectural). It also means that these flags are preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the exception state on vcpu exit to userspace. As an optimisation for nVHE systems, only perform this when there is something pending.
Reported-by: Zenghui Yu yuzenghui@huawei.com Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org # 5.11
arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/kvm/arm.c | 10 ++++++++++ arch/arm64/kvm/hyp/exception.c | 4 ++-- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 8 ++++++++ 4 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index d5b11037401d..5e9b33cbac51 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -63,6 +63,7 @@ #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18 #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19 #define __KVM_HOST_SMCCC_FUNC___pkvm_mark_hyp 20 +#define __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc 21 #ifndef __ASSEMBLY__ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 1cb39c0803a4..d62a7041ebd1 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -897,6 +897,16 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) kvm_sigset_deactivate(vcpu);
- /*
* In the unlikely event that we are returning to userspace
* with pending exceptions or PC adjustment, commit these
I'm going to assume "PC adjustment" means the KVM_ARM64_INCREMENT_PC flag. Please correct me if that's not true, but if that's the case, then the flag isn't handled below.
* adjustments in order to give userspace a consistent view of
* the vcpu state.
*/
- if (unlikely(vcpu->arch.flags & (KVM_ARM64_PENDING_EXCEPTION |
KVM_ARM64_EXCEPT_MASK)))
The condition seems to suggest that it is valid to set KVM_ARM64_EXCEPT_{AA32,AA64}_* without setting KVM_ARM64_PENDING_EXCEPTION, which looks rather odd to me. Is that a valid use of the KVM_ARM64_EXCEPT_MASK bits? If it's not (the existing code always sets the exception type with the KVM_ARM64_PENDING_EXCEPTION), that I was thinking that checking only the KVM_ARM64_PENDING_EXCEPTION flag would make the intention clearer.
Thanks,
Alex
kvm_call_hyp(__kvm_adjust_pc, vcpu);
- vcpu_put(vcpu); return ret;
} diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c index 0812a496725f..11541b94b328 100644 --- a/arch/arm64/kvm/hyp/exception.c +++ b/arch/arm64/kvm/hyp/exception.c @@ -331,8 +331,8 @@ static void kvm_inject_exception(struct kvm_vcpu *vcpu) } /*
- Adjust the guest PC on entry, depending on flags provided by EL1
- for the purpose of emulation (MMIO, sysreg) or exception injection.
- Adjust the guest PC (and potentially exception state) depending on
*/
- flags provided by the emulation code.
void __kvm_adjust_pc(struct kvm_vcpu *vcpu) { diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index f36420a80474..1632f001f4ed 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -28,6 +28,13 @@ static void handle___kvm_vcpu_run(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __kvm_vcpu_run(kern_hyp_va(vcpu)); } +static void handle___kvm_adjust_pc(struct kvm_cpu_context *host_ctxt) +{
- DECLARE_REG(struct kvm_vcpu *, vcpu, host_ctxt, 1);
- __kvm_adjust_pc(kern_hyp_va(vcpu));
+}
static void handle___kvm_flush_vm_context(struct kvm_cpu_context *host_ctxt) { __kvm_flush_vm_context(); @@ -170,6 +177,7 @@ typedef void (*hcall_t)(struct kvm_cpu_context *); static const hcall_t host_hcall[] = { HANDLE_FUNC(__kvm_vcpu_run),
- HANDLE_FUNC(__kvm_adjust_pc), HANDLE_FUNC(__kvm_flush_vm_context), HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa), HANDLE_FUNC(__kvm_tlb_flush_vmid),
On Mon, 10 May 2021 15:55:28 +0100, Alexandru Elisei alexandru.elisei@arm.com wrote:
Hi Marc,
On 5/10/21 10:49 AM, Marc Zyngier wrote:
KVM currently updates PC (and the corresponding exception state) using a two phase approach: first by setting a set of flags, then by converting these flags into a state update when the vcpu is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread returns there with any exception/PC flag set. In this case, the exposed
The code seems to handle only the KVM_ARM64_PENDING_EXCEPTION flag. Is the "PC flag" a reference to the KVM_ARM64_INCREMENT_PC flag?
No, it does handle both exception and PC increment, unless I have completely bodged something (entirely possible).
context is wrong, as userpsace doesn't have access to these flags
s/userpsace/userspace
(they aren't architectural). It also means that these flags are preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the exception state on vcpu exit to userspace. As an optimisation for nVHE systems, only perform this when there is something pending.
Reported-by: Zenghui Yu yuzenghui@huawei.com Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org # 5.11
arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/kvm/arm.c | 10 ++++++++++ arch/arm64/kvm/hyp/exception.c | 4 ++-- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 8 ++++++++ 4 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index d5b11037401d..5e9b33cbac51 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -63,6 +63,7 @@ #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18 #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19 #define __KVM_HOST_SMCCC_FUNC___pkvm_mark_hyp 20 +#define __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc 21 #ifndef __ASSEMBLY__ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 1cb39c0803a4..d62a7041ebd1 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -897,6 +897,16 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) kvm_sigset_deactivate(vcpu);
- /*
* In the unlikely event that we are returning to userspace
* with pending exceptions or PC adjustment, commit these
I'm going to assume "PC adjustment" means the KVM_ARM64_INCREMENT_PC flag. Please correct me if that's not true, but if that's the case, then the flag isn't handled below.
* adjustments in order to give userspace a consistent view of
* the vcpu state.
*/
- if (unlikely(vcpu->arch.flags & (KVM_ARM64_PENDING_EXCEPTION |
KVM_ARM64_EXCEPT_MASK)))
The condition seems to suggest that it is valid to set KVM_ARM64_EXCEPT_{AA32,AA64}_* without setting KVM_ARM64_PENDING_EXCEPTION, which looks rather odd to me. Is that a valid use of the KVM_ARM64_EXCEPT_MASK bits? If it's not (the existing code always sets the exception type with the KVM_ARM64_PENDING_EXCEPTION), that I was thinking that checking only the KVM_ARM64_PENDING_EXCEPTION flag would make the intention clearer.
No, you are missing this (subtle) comment in kvm_host.h:
<quote> /* * Overlaps with KVM_ARM64_EXCEPT_MASK on purpose so that it can't be * set together with an exception... */ #define KVM_ARM64_INCREMENT_PC (1 << 9) /* Increment PC */ </quote>
So (KVM_ARM64_PENDING_EXCEPTION | KVM_ARM64_EXCEPT_MASK) checks for *both* an exception and a PC increment.
Thanks,
M.
Hi Marc,
On 5/10/21 4:04 PM, Marc Zyngier wrote:
On Mon, 10 May 2021 15:55:28 +0100, Alexandru Elisei alexandru.elisei@arm.com wrote:
Hi Marc,
On 5/10/21 10:49 AM, Marc Zyngier wrote:
KVM currently updates PC (and the corresponding exception state) using a two phase approach: first by setting a set of flags, then by converting these flags into a state update when the vcpu is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread returns there with any exception/PC flag set. In this case, the exposed
The code seems to handle only the KVM_ARM64_PENDING_EXCEPTION flag. Is the "PC flag" a reference to the KVM_ARM64_INCREMENT_PC flag?
No, it does handle both exception and PC increment, unless I have completely bodged something (entirely possible).
The message is correct, my bad.
context is wrong, as userpsace doesn't have access to these flags
s/userpsace/userspace
(they aren't architectural). It also means that these flags are preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the exception state on vcpu exit to userspace. As an optimisation for nVHE systems, only perform this when there is something pending.
Reported-by: Zenghui Yu yuzenghui@huawei.com Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org # 5.11
arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/kvm/arm.c | 10 ++++++++++ arch/arm64/kvm/hyp/exception.c | 4 ++-- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 8 ++++++++ 4 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index d5b11037401d..5e9b33cbac51 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -63,6 +63,7 @@ #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18 #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19 #define __KVM_HOST_SMCCC_FUNC___pkvm_mark_hyp 20 +#define __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc 21 #ifndef __ASSEMBLY__ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 1cb39c0803a4..d62a7041ebd1 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -897,6 +897,16 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) kvm_sigset_deactivate(vcpu);
- /*
* In the unlikely event that we are returning to userspace
* with pending exceptions or PC adjustment, commit these
I'm going to assume "PC adjustment" means the KVM_ARM64_INCREMENT_PC flag. Please correct me if that's not true, but if that's the case, then the flag isn't handled below.
* adjustments in order to give userspace a consistent view of
* the vcpu state.
*/
- if (unlikely(vcpu->arch.flags & (KVM_ARM64_PENDING_EXCEPTION |
KVM_ARM64_EXCEPT_MASK)))
The condition seems to suggest that it is valid to set KVM_ARM64_EXCEPT_{AA32,AA64}_* without setting KVM_ARM64_PENDING_EXCEPTION, which looks rather odd to me. Is that a valid use of the KVM_ARM64_EXCEPT_MASK bits? If it's not (the existing code always sets the exception type with the KVM_ARM64_PENDING_EXCEPTION), that I was thinking that checking only the KVM_ARM64_PENDING_EXCEPTION flag would make the intention clearer.
No, you are missing this (subtle) comment in kvm_host.h:
<quote> /* * Overlaps with KVM_ARM64_EXCEPT_MASK on purpose so that it can't be * set together with an exception... */ #define KVM_ARM64_INCREMENT_PC (1 << 9) /* Increment PC */ </quote>
So (KVM_ARM64_PENDING_EXCEPTION | KVM_ARM64_EXCEPT_MASK) checks for *both* an exception and a PC increment.
Then how about explicitly checking for the KVM_ARM64_PENDING_EXCEPTION and KVM_ARM64_INCREMENT_PC flags, like it's done in __kvm_adjust_pc? That would certainly make the code easier to understand, as it's not immediately obvious that the EXCEPT mask includes the INCREMENT_PC flag.
Thanks,
Alex
On Mon, 10 May 2021 16:14:37 +0100, Alexandru Elisei alexandru.elisei@arm.com wrote:
Hi Marc,
On 5/10/21 4:04 PM, Marc Zyngier wrote:
On Mon, 10 May 2021 15:55:28 +0100, Alexandru Elisei alexandru.elisei@arm.com wrote:
Hi Marc,
On 5/10/21 10:49 AM, Marc Zyngier wrote:
KVM currently updates PC (and the corresponding exception state) using a two phase approach: first by setting a set of flags, then by converting these flags into a state update when the vcpu is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread returns there with any exception/PC flag set. In this case, the exposed
The code seems to handle only the KVM_ARM64_PENDING_EXCEPTION flag. Is the "PC flag" a reference to the KVM_ARM64_INCREMENT_PC flag?
No, it does handle both exception and PC increment, unless I have completely bodged something (entirely possible).
The message is correct, my bad.
context is wrong, as userpsace doesn't have access to these flags
s/userpsace/userspace
(they aren't architectural). It also means that these flags are preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the exception state on vcpu exit to userspace. As an optimisation for nVHE systems, only perform this when there is something pending.
Reported-by: Zenghui Yu yuzenghui@huawei.com Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org # 5.11
arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/kvm/arm.c | 10 ++++++++++ arch/arm64/kvm/hyp/exception.c | 4 ++-- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 8 ++++++++ 4 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index d5b11037401d..5e9b33cbac51 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -63,6 +63,7 @@ #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18 #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19 #define __KVM_HOST_SMCCC_FUNC___pkvm_mark_hyp 20 +#define __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc 21 #ifndef __ASSEMBLY__ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 1cb39c0803a4..d62a7041ebd1 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -897,6 +897,16 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) kvm_sigset_deactivate(vcpu);
- /*
* In the unlikely event that we are returning to userspace
* with pending exceptions or PC adjustment, commit these
I'm going to assume "PC adjustment" means the KVM_ARM64_INCREMENT_PC flag. Please correct me if that's not true, but if that's the case, then the flag isn't handled below.
* adjustments in order to give userspace a consistent view of
* the vcpu state.
*/
- if (unlikely(vcpu->arch.flags & (KVM_ARM64_PENDING_EXCEPTION |
KVM_ARM64_EXCEPT_MASK)))
The condition seems to suggest that it is valid to set KVM_ARM64_EXCEPT_{AA32,AA64}_* without setting KVM_ARM64_PENDING_EXCEPTION, which looks rather odd to me. Is that a valid use of the KVM_ARM64_EXCEPT_MASK bits? If it's not (the existing code always sets the exception type with the KVM_ARM64_PENDING_EXCEPTION), that I was thinking that checking only the KVM_ARM64_PENDING_EXCEPTION flag would make the intention clearer.
No, you are missing this (subtle) comment in kvm_host.h:
<quote> /* * Overlaps with KVM_ARM64_EXCEPT_MASK on purpose so that it can't be * set together with an exception... */ #define KVM_ARM64_INCREMENT_PC (1 << 9) /* Increment PC */ </quote>
So (KVM_ARM64_PENDING_EXCEPTION | KVM_ARM64_EXCEPT_MASK) checks for *both* an exception and a PC increment.
Then how about explicitly checking for the KVM_ARM64_PENDING_EXCEPTION and KVM_ARM64_INCREMENT_PC flags, like it's done in __kvm_adjust_pc? That would certainly make the code easier to understand, as it's not immediately obvious that the EXCEPT mask includes the INCREMENT_PC flag.
Fair enough. I'll fix that in v2.
Another thing I wondered about: we now rely on __kvm_adjust_pc() to be preemption safe. That's always the case in nVHE (we're at EL2), but VHE can be preempted at any point. The code we call is preemption safe, but it takes some effort to be convinced of it.
Do you have a good suggestion on how to express this requirement? I could throw a preempt_disable()/enable() at the call for the sake of being in the same context between VHE and nVHE, but that's not strictly necessary for now.
Thanks,
M.
Hi Marc,
On 5/11/21 8:44 AM, Marc Zyngier wrote:
On Mon, 10 May 2021 16:14:37 +0100, Alexandru Elisei alexandru.elisei@arm.com wrote:
Hi Marc,
On 5/10/21 4:04 PM, Marc Zyngier wrote:
On Mon, 10 May 2021 15:55:28 +0100, Alexandru Elisei alexandru.elisei@arm.com wrote:
Hi Marc,
On 5/10/21 10:49 AM, Marc Zyngier wrote:
KVM currently updates PC (and the corresponding exception state) using a two phase approach: first by setting a set of flags, then by converting these flags into a state update when the vcpu is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread returns there with any exception/PC flag set. In this case, the exposed
The code seems to handle only the KVM_ARM64_PENDING_EXCEPTION flag. Is the "PC flag" a reference to the KVM_ARM64_INCREMENT_PC flag?
No, it does handle both exception and PC increment, unless I have completely bodged something (entirely possible).
The message is correct, my bad.
context is wrong, as userpsace doesn't have access to these flags
s/userpsace/userspace
(they aren't architectural). It also means that these flags are preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the exception state on vcpu exit to userspace. As an optimisation for nVHE systems, only perform this when there is something pending.
Reported-by: Zenghui Yu yuzenghui@huawei.com Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org # 5.11
arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/kvm/arm.c | 10 ++++++++++ arch/arm64/kvm/hyp/exception.c | 4 ++-- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 8 ++++++++ 4 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index d5b11037401d..5e9b33cbac51 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -63,6 +63,7 @@ #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18 #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19 #define __KVM_HOST_SMCCC_FUNC___pkvm_mark_hyp 20 +#define __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc 21 #ifndef __ASSEMBLY__ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 1cb39c0803a4..d62a7041ebd1 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -897,6 +897,16 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) kvm_sigset_deactivate(vcpu);
- /*
* In the unlikely event that we are returning to userspace
* with pending exceptions or PC adjustment, commit these
I'm going to assume "PC adjustment" means the KVM_ARM64_INCREMENT_PC flag. Please correct me if that's not true, but if that's the case, then the flag isn't handled below.
* adjustments in order to give userspace a consistent view of
* the vcpu state.
*/
- if (unlikely(vcpu->arch.flags & (KVM_ARM64_PENDING_EXCEPTION |
KVM_ARM64_EXCEPT_MASK)))
The condition seems to suggest that it is valid to set KVM_ARM64_EXCEPT_{AA32,AA64}_* without setting KVM_ARM64_PENDING_EXCEPTION, which looks rather odd to me. Is that a valid use of the KVM_ARM64_EXCEPT_MASK bits? If it's not (the existing code always sets the exception type with the KVM_ARM64_PENDING_EXCEPTION), that I was thinking that checking only the KVM_ARM64_PENDING_EXCEPTION flag would make the intention clearer.
No, you are missing this (subtle) comment in kvm_host.h:
<quote> /* * Overlaps with KVM_ARM64_EXCEPT_MASK on purpose so that it can't be * set together with an exception... */ #define KVM_ARM64_INCREMENT_PC (1 << 9) /* Increment PC */ </quote>
So (KVM_ARM64_PENDING_EXCEPTION | KVM_ARM64_EXCEPT_MASK) checks for *both* an exception and a PC increment.
Then how about explicitly checking for the KVM_ARM64_PENDING_EXCEPTION and KVM_ARM64_INCREMENT_PC flags, like it's done in __kvm_adjust_pc? That would certainly make the code easier to understand, as it's not immediately obvious that the EXCEPT mask includes the INCREMENT_PC flag.
Fair enough. I'll fix that in v2.
Another thing I wondered about: we now rely on __kvm_adjust_pc() to be preemption safe. That's always the case in nVHE (we're at EL2), but VHE can be preempted at any point. The code we call is preemption safe, but it takes some effort to be convinced of it.
Do you have a good suggestion on how to express this requirement? I could throw a preempt_disable()/enable() at the call for the sake of being in the same context between VHE and nVHE, but that's not strictly necessary for now.
I've had another look at it, and it does look to me that __kvm_adjust_pc() is safe to call with preemption enabled on VHE systems. I was a bit worried when I saw that it touches some EL1 registers directly, but I then I realized that kvm_arch_vcpu_put/load() will make sure that the values of those registers are correct.
I do agree that it's not immediately obvious that __kvm_adjust_pc() is called with preemption enabled. I'm a bit worried that someone will change something and make the function unsafe from preemption. Since the patch already touches the comment for __kvm_adjust_pc(), maybe mention that it can be called from contexts with preemption enabled and why it's safe to do it? Something along the lines of "The function is called with preemption enabled, and it's safe to do so because even though it touches the hardware EL1 registers in the VHE case, those registers are saved and restored by a vcpu_put/load pair. In the NVHE case, the code is running at EL2 and it cannot be preempted by the host". Or something else that you fancy.
Thanks,
Alex
Hi Marc,
KVM: arm64: Commit pending PC adjustemnts before returning to userspace
s/adjustments/adjustments
On Mon, May 10, 2021 at 10:49 AM Marc Zyngier maz@kernel.org wrote:
KVM currently updates PC (and the corresponding exception state) using a two phase approach: first by setting a set of flags, then by converting these flags into a state update when the vcpu is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread returns there with any exception/PC flag set. In this case, the exposed context is wrong, as userpsace doesn't have access to these flags (they aren't architectural). It also means that these flags are preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the exception state on vcpu exit to userspace. As an optimisation for nVHE systems, only perform this when there is something pending.
I've tested this with a few nvhe and vhe tests that exercise both __kvm_adjust_pc call paths (__kvm_vcpu_run and kvm_arch_vcpu_ioctl_run), and the tests ran as expected. I'll do the same for v2 when you send it out.
Cheers, /fuad
Hi Fuad,
On Tue, 11 May 2021 09:03:40 +0100, Fuad Tabba tabba@google.com wrote:
Hi Marc,
KVM: arm64: Commit pending PC adjustemnts before returning to userspace
s/adjustments/adjustments
Looks like Gmail refuses to let you mimic my spelling mistakes! :D
On Mon, May 10, 2021 at 10:49 AM Marc Zyngier maz@kernel.org wrote:
KVM currently updates PC (and the corresponding exception state) using a two phase approach: first by setting a set of flags, then by converting these flags into a state update when the vcpu is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread returns there with any exception/PC flag set. In this case, the exposed context is wrong, as userpsace doesn't have access to these flags (they aren't architectural). It also means that these flags are preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the exception state on vcpu exit to userspace. As an optimisation for nVHE systems, only perform this when there is something pending.
I've tested this with a few nvhe and vhe tests that exercise both __kvm_adjust_pc call paths (__kvm_vcpu_run and kvm_arch_vcpu_ioctl_run), and the tests ran as expected. I'll do the same for v2 when you send it out.
Ah, that's interesting. Do you have tests that actually fail when hitting this bug? Given that this is pretty subtle, it'd be good to have a way to make sure it doesn't crop up again.
Thanks,
M.
Hi Marc,
On Tue, May 11, 2021 at 9:14 AM Marc Zyngier maz@kernel.org wrote:
Hi Fuad,
On Tue, 11 May 2021 09:03:40 +0100, Fuad Tabba tabba@google.com wrote:
Hi Marc,
KVM: arm64: Commit pending PC adjustemnts before returning to userspace
s/adjustments/adjustments
Looks like Gmail refuses to let you mimic my spelling mistakes! :D
On Mon, May 10, 2021 at 10:49 AM Marc Zyngier maz@kernel.org wrote:
KVM currently updates PC (and the corresponding exception state) using a two phase approach: first by setting a set of flags, then by converting these flags into a state update when the vcpu is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread returns there with any exception/PC flag set. In this case, the exposed context is wrong, as userpsace doesn't have access to these flags (they aren't architectural). It also means that these flags are preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the exception state on vcpu exit to userspace. As an optimisation for nVHE systems, only perform this when there is something pending.
I've tested this with a few nvhe and vhe tests that exercise both __kvm_adjust_pc call paths (__kvm_vcpu_run and kvm_arch_vcpu_ioctl_run), and the tests ran as expected. I'll do the same for v2 when you send it out.
Ah, that's interesting. Do you have tests that actually fail when hitting this bug? Given that this is pretty subtle, it'd be good to have a way to make sure it doesn't crop up again.
Nothing that fails, just code that generates exceptions or emulates instructions at various points. That said, I think it should be straightforward to write a selftest for this. I'll give it a go.
/fuad
Thanks,
M.
-- Without deviation from the norm, progress is not possible.
On Tue, 11 May 2021 09:22:37 +0100, Fuad Tabba tabba@google.com wrote:
Hi Marc,
On Tue, May 11, 2021 at 9:14 AM Marc Zyngier maz@kernel.org wrote:
Hi Fuad,
On Tue, 11 May 2021 09:03:40 +0100, Fuad Tabba tabba@google.com wrote:
Hi Marc,
KVM: arm64: Commit pending PC adjustemnts before returning to userspace
s/adjustments/adjustments
Looks like Gmail refuses to let you mimic my spelling mistakes! :D
On Mon, May 10, 2021 at 10:49 AM Marc Zyngier maz@kernel.org wrote:
KVM currently updates PC (and the corresponding exception state) using a two phase approach: first by setting a set of flags, then by converting these flags into a state update when the vcpu is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread returns there with any exception/PC flag set. In this case, the exposed context is wrong, as userpsace doesn't have access to these flags (they aren't architectural). It also means that these flags are preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the exception state on vcpu exit to userspace. As an optimisation for nVHE systems, only perform this when there is something pending.
I've tested this with a few nvhe and vhe tests that exercise both __kvm_adjust_pc call paths (__kvm_vcpu_run and kvm_arch_vcpu_ioctl_run), and the tests ran as expected. I'll do the same for v2 when you send it out.
Ah, that's interesting. Do you have tests that actually fail when hitting this bug? Given that this is pretty subtle, it'd be good to have a way to make sure it doesn't crop up again.
Nothing that fails, just code that generates exceptions or emulates instructions at various points. That said, I think it should be straightforward to write a selftest for this. I'll give it a go.
PC adjustment is easy-ish: have a vcpu to hit WFI with no interrupt pending, send the thread a signal to make it exit to userspace, update the PC to another address, and check that the instruction at that address is actually executed.
Exception injection is a lot more difficult: you need to force a vcpu exit to userspace right after having caused an exception to be injected by KVM. I can't think of an easy way to do that other than repeatedly executing an instruction that generates an exception while signalling the thread to force the exit. Ugly.
M.
linux-stable-mirror@lists.linaro.org