[PATCH 1/3] KVM: arm64: Fix S1PTW handling on RO memslots

List overview All Threads
Download

newer

older

+...

[GIT PULL] virtio,vhost,vdpa:...

Marc Zyngier

20 Dec 2022 20 Dec '22

8:09 p.m.

A recent development on the EFI front has resulted in guests having their page tables baked in the firmware binary, and mapped into the IPA space as part as a read-only memslot.

Not only this is legitimate, but it also results in added security, so thumbs up. However, this clashes mildly with our handling of a S1PTW as a write to correctly handle AF/DB updates to the S1 PTs, and results in the guest taking an abort it won't recover from (the PTs mapping the vectors will suffer freom the same problem...).

So clearly our handling is... wrong.

Instead, switch to a two-pronged approach:

- On S1PTW translation fault, handle the fault as a read

- On S1PTW permission fault, handle the fault as a write

This is of no consequence to SW that *writes* to its PTs (the write will trigger a non-S1PTW fault), and SW that uses RO PTs will not use AF/DB anyway, as that'd be wrong.

Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") do we end-up with two back-to-back faults (page being evicted and faulted back). I don't think this is a case worth optimising for.

Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org --- arch/arm64/include/asm/kvm_emulate.h | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index 9bdba47f7e14..fd6ad8b21f85 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -373,8 +373,26 @@ static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)

static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu) { - if (kvm_vcpu_abt_iss1tw(vcpu)) - return true; + if (kvm_vcpu_abt_iss1tw(vcpu)) { + /* + * Only a permission fault on a S1PTW should be + * considered as a write. Otherwise, page tables baked + * in a read-only memslot will result in an exception + * being delivered in the guest. + * + * The drawback is that we end-up fauling twice if the + * guest is using any of HW AF/DB: a translation fault + * to map the page containing the PT (read only at + * first), then a permission fault to allow the flags + * to be set. + */ + switch (kvm_vcpu_trap_get_fault_type(vcpu)) { + case ESR_ELx_FSC_PERM: + return true; + default: + return false; + } + }

if (kvm_vcpu_trap_is_iabt(vcpu)) return false;

-- 2.34.1

Show replies by date

Oliver Upton

20 Dec 20 Dec

9:47 p.m.

Hi Marc,

On Tue, Dec 20, 2022 at 08:09:21PM +0000, Marc Zyngier wrote:

...

A recent development on the EFI front has resulted in guests having their page tables baked in the firmware binary, and mapped into the IPA space as part as a read-only memslot.

as part of a

...

Not only this is legitimate, but it also results in added security, so thumbs up. However, this clashes mildly with our handling of a S1PTW as a write to correctly handle AF/DB updates to the S1 PTs, and results in the guest taking an abort it won't recover from (the PTs mapping the vectors will suffer freom the same problem...).

To be clear, the read-only page tables already have the AF set, right? They certainly must, or else the guest isn't getting far :)

I understand you're trying to describe _why_ we promote S1PTW to a write, but doing it inline with the context of the EFI issue makes it slightly unclear. Could you break these ideas up into two paragraphs and maybe spell out the fault conditions a bit more?

A recent development on the EFI front has resulted in guests having their page tables baked in the firmware binary, and mapped into the IPA space as part of a read-only memslot. Not only is this legitimate, but it also results in added security, so thumbs up.

It is possible to take an S1PTW translation fault if the S1 PTs are unmapped at stage-2. However, KVM unconditionally treats S1PTW as a write to correctly handle hardware AF/DB updates to the S1 PTs. Furthermore, KVM injects a data abort into the guest for S1PTW writes. In the aforementioned case this results in the guest taking an abort it won't recover from, as the S1 PTs mapping the vectors suffer from the same problem.

Dunno, maybe I stink at reading which is why I got confused in the first place.

...

So clearly our handling is... wrong.

Instead, switch to a two-pronged approach:

On S1PTW translation fault, handle the fault as a read

On S1PTW permission fault, handle the fault as a write

This is of no consequence to SW that *writes* to its PTs (the write will trigger a non-S1PTW fault), and SW that uses RO PTs will not use AF/DB anyway, as that'd be wrong.

Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") do we end-up with two back-to-back faults (page being evicted and faulted back). I don't think this is a case worth optimising for.

Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org

arch/arm64/include/asm/kvm_emulate.h | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index 9bdba47f7e14..fd6ad8b21f85 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -373,8 +373,26 @@ static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu) static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu) {
if (kvm_vcpu_abt_iss1tw(vcpu))
return true;
if (kvm_vcpu_abt_iss1tw(vcpu)) {
/*
 * Only a permission fault on a S1PTW should be
 * considered as a write. Otherwise, page tables baked
 * in a read-only memslot will result in an exception
 * being delivered in the guest.

Somewhat of a tangent, but:

Aren't we somewhat unaligned with the KVM UAPI by injecting an exception in this case? I know we've been doing it for a while, but it flies in the face of the rules outlined in the KVM_SET_USER_MEMORY_REGION documentation.

...

 * The drawback is that we end-up fauling twice if the

typo: s/fauling/faulting/

...

 * guest is using any of HW AF/DB: a translation fault

 * to map the page containing the PT (read only at

 * first), then a permission fault to allow the flags

```
 * to be set.
```
```
 */
```

switch (kvm_vcpu_trap_get_fault_type(vcpu)) {

```
case ESR_ELx_FSC_PERM:
```
```
	return true;
```
```
default:
```
```
	return false;
```
```
}
```
}

if (kvm_vcpu_trap_is_iabt(vcpu)) return false; -- 2.34.1

Besides the changelog/comment suggestions, the patch looks good to me.

Reviewed-by: Oliver Upton oliver.upton@linux.dev

-- Thanks, Oliver

Marc Zyngier

21 Dec 21 Dec

9:35 a.m.

On Tue, 20 Dec 2022 21:47:36 +0000, Oliver Upton oliver.upton@linux.dev wrote:

...

Hi Marc,

On Tue, Dec 20, 2022 at 08:09:21PM +0000, Marc Zyngier wrote:

...
A recent development on the EFI front has resulted in guests having their page tables baked in the firmware binary, and mapped into the IPA space as part as a read-only memslot.

as part of a

...
Not only this is legitimate, but it also results in added security, so thumbs up. However, this clashes mildly with our handling of a S1PTW as a write to correctly handle AF/DB updates to the S1 PTs, and results in the guest taking an abort it won't recover from (the PTs mapping the vectors will suffer freom the same problem...).

To be clear, the read-only page tables already have the AF set, right? They certainly must, or else the guest isn't getting far :)

Yes, the guest definitely has the AF set in the PT, and is not trying to use the HW-assisted AF (which obviously wouldn't work).

...

I understand you're trying to describe _why_ we promote S1PTW to a write, but doing it inline with the context of the EFI issue makes it slightly unclear. Could you break these ideas up into two paragraphs and maybe spell out the fault conditions a bit more?

A recent development on the EFI front has resulted in guests having their page tables baked in the firmware binary, and mapped into the IPA space as part of a read-only memslot. Not only is this legitimate, but it also results in added security, so thumbs up.

It is possible to take an S1PTW translation fault if the S1 PTs are unmapped at stage-2. However, KVM unconditionally treats S1PTW as a write to correctly handle hardware AF/DB updates to the S1 PTs. Furthermore, KVM injects a data abort into the guest for S1PTW writes. In the aforementioned case this results in the guest taking an abort it won't recover from, as the S1 PTs mapping the vectors suffer from the same problem.

Dunno, maybe I stink at reading which is why I got confused in the first place.

Nothing wrong with you, just that my write-up is indeed sloppy. I'll copy paste the above, thanks!

...

...
So clearly our handling is... wrong.

Instead, switch to a two-pronged approach:

On S1PTW translation fault, handle the fault as a read

On S1PTW permission fault, handle the fault as a write

This is of no consequence to SW that *writes* to its PTs (the write will trigger a non-S1PTW fault), and SW that uses RO PTs will not use AF/DB anyway, as that'd be wrong.

Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") do we end-up with two back-to-back faults (page being evicted and faulted back). I don't think this is a case worth optimising for.

Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org

arch/arm64/include/asm/kvm_emulate.h | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index 9bdba47f7e14..fd6ad8b21f85 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -373,8 +373,26 @@ static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu) static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu) {
if (kvm_vcpu_abt_iss1tw(vcpu))
return true;
if (kvm_vcpu_abt_iss1tw(vcpu)) {
/*
 * Only a permission fault on a S1PTW should be
 * considered as a write. Otherwise, page tables baked
 * in a read-only memslot will result in an exception
 * being delivered in the guest.
Somewhat of a tangent, but:

Aren't we somewhat unaligned with the KVM UAPI by injecting an exception in this case? I know we've been doing it for a while, but it flies in the face of the rules outlined in the KVM_SET_USER_MEMORY_REGION documentation.

That's an interesting point, and I certainly haven't considered that for faults introduced by page table walks.

I'm not sure what userspace can do with that though. The problem is that this is a write for which we don't have useful data: although we know it is a page-table walker access, we don't know what it was about to write. The instruction that caused the write is meaningless (it could either be a load, a store, or an instruction fetch). How do you populate the data[] field then?

If anything, this is closer to KVM_EXIT_ARM_NISV, for which we give userspace the full ESR and ask it to sort it out. I doubt it will be able to, but hey, maybe it is worth a shot. This would need to be a different exit reason though, as NISV is explicitly for non-memslot stuff.

In any case, the documentation for KVM_SET_USER_MEMORY_REGION needs to reflect the fact that KVM_EXIT_MMIO cannot represent a fault due to a S1 PTW.

...

...
 * The drawback is that we end-up fauling twice if the
typo: s/fauling/faulting/

...
 * guest is using any of HW AF/DB: a translation fault
 * to map the page containing the PT (read only at
 * first), then a permission fault to allow the flags
 * to be set.
 */
switch (kvm_vcpu_trap_get_fault_type(vcpu)) {
case ESR_ELx_FSC_PERM:
	return true;
default:
	return false;
}
}
if (kvm_vcpu_trap_is_iabt(vcpu)) return false; -- 2.34.1
Besides the changelog/comment suggestions, the patch looks good to me.

Reviewed-by: Oliver Upton oliver.upton@linux.dev

Thanks for the quick review! I'll wait a bit before respinning the series, as I'd like to get closure on the UAPI point you have raised.

Cheers,

-- Without deviation from the norm, progress is not possible.

Oliver Upton

4:50 p.m.

On Wed, Dec 21, 2022 at 09:35:06AM +0000, Marc Zyngier wrote:

[...]

...

...
...
if (kvm_vcpu_abt_iss1tw(vcpu)) {
/*
 * Only a permission fault on a S1PTW should be
 * considered as a write. Otherwise, page tables baked
 * in a read-only memslot will result in an exception
 * being delivered in the guest.
Somewhat of a tangent, but:

Aren't we somewhat unaligned with the KVM UAPI by injecting an exception in this case? I know we've been doing it for a while, but it flies in the face of the rules outlined in the KVM_SET_USER_MEMORY_REGION documentation.
That's an interesting point, and I certainly haven't considered that for faults introduced by page table walks.

I'm not sure what userspace can do with that though. The problem is that this is a write for which we don't have useful data: although we know it is a page-table walker access, we don't know what it was about to write. The instruction that caused the write is meaningless (it could either be a load, a store, or an instruction fetch). How do you populate the data[] field then?

If anything, this is closer to KVM_EXIT_ARM_NISV, for which we give userspace the full ESR and ask it to sort it out. I doubt it will be able to, but hey, maybe it is worth a shot. This would need to be a different exit reason though, as NISV is explicitly for non-memslot stuff.

In any case, the documentation for KVM_SET_USER_MEMORY_REGION needs to reflect the fact that KVM_EXIT_MMIO cannot represent a fault due to a S1 PTW.

Oh I completely agree with you here. I probably should have said before, I think the exit would be useless anyway. Getting the documentation in line with the intended behavior seems to be the best fix.

...

...
...
 * The drawback is that we end-up fauling twice if the
typo: s/fauling/faulting/

...
 * guest is using any of HW AF/DB: a translation fault
 * to map the page containing the PT (read only at
 * first), then a permission fault to allow the flags
 * to be set.
 */
switch (kvm_vcpu_trap_get_fault_type(vcpu)) {
case ESR_ELx_FSC_PERM:
	return true;
default:
	return false;
}
}
if (kvm_vcpu_trap_is_iabt(vcpu)) return false; -- 2.34.1
Besides the changelog/comment suggestions, the patch looks good to me.

Reviewed-by: Oliver Upton oliver.upton@linux.dev
Thanks for the quick review! I'll wait a bit before respinning the series, as I'd like to get closure on the UAPI point you have raised.

I'm satisfied if you are :)

-- Thanks, Oliver

Marc Zyngier

5:53 p.m.

On Wed, 21 Dec 2022 16:50:30 +0000, Oliver Upton oliver.upton@linux.dev wrote:

...

On Wed, Dec 21, 2022 at 09:35:06AM +0000, Marc Zyngier wrote:

[...]

...
...
...
if (kvm_vcpu_abt_iss1tw(vcpu)) {
/*
 * Only a permission fault on a S1PTW should be
 * considered as a write. Otherwise, page tables baked
 * in a read-only memslot will result in an exception
 * being delivered in the guest.
Somewhat of a tangent, but:

Aren't we somewhat unaligned with the KVM UAPI by injecting an exception in this case? I know we've been doing it for a while, but it flies in the face of the rules outlined in the KVM_SET_USER_MEMORY_REGION documentation.
That's an interesting point, and I certainly haven't considered that for faults introduced by page table walks.

I'm not sure what userspace can do with that though. The problem is that this is a write for which we don't have useful data: although we know it is a page-table walker access, we don't know what it was about to write. The instruction that caused the write is meaningless (it could either be a load, a store, or an instruction fetch). How do you populate the data[] field then?

If anything, this is closer to KVM_EXIT_ARM_NISV, for which we give userspace the full ESR and ask it to sort it out. I doubt it will be able to, but hey, maybe it is worth a shot. This would need to be a different exit reason though, as NISV is explicitly for non-memslot stuff.

In any case, the documentation for KVM_SET_USER_MEMORY_REGION needs to reflect the fact that KVM_EXIT_MMIO cannot represent a fault due to a S1 PTW.
Oh I completely agree with you here. I probably should have said before, I think the exit would be useless anyway. Getting the documentation in line with the intended behavior seems to be the best fix.

Right. How about something like this?

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 226b40baffb8..72abd018a618 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1381,6 +1381,14 @@ It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl. The KVM_SET_MEMORY_REGION does not allow fine grained control over memory allocation and is deprecated.

+Note: On arm64, a write generated by the page-table walker (to update +the Access and Dirty flags, for example) never results in a +KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This +is because KVM cannot provide the data that would be written by the +page-table walker, making it impossible to emulate the access. +Instead, an abort (data abort if the cause of the page-table update +was a load or a store, instruction abort if it was an instruction +fetch) is injected in the guest.

4.36 KVM_SET_TSS_ADDR ---------------------

Thanks,

-- Without deviation from the norm, progress is not possible.

Oliver Upton

6:26 p.m.

On Wed, Dec 21, 2022 at 05:53:58PM +0000, Marc Zyngier wrote:

...

On Wed, 21 Dec 2022 16:50:30 +0000, Oliver Upton oliver.upton@linux.dev wrote:

...
On Wed, Dec 21, 2022 at 09:35:06AM +0000, Marc Zyngier wrote:

[...]

...
...
...
if (kvm_vcpu_abt_iss1tw(vcpu)) {
/*
 * Only a permission fault on a S1PTW should be
 * considered as a write. Otherwise, page tables baked
 * in a read-only memslot will result in an exception
 * being delivered in the guest.
Somewhat of a tangent, but:

Aren't we somewhat unaligned with the KVM UAPI by injecting an exception in this case? I know we've been doing it for a while, but it flies in the face of the rules outlined in the KVM_SET_USER_MEMORY_REGION documentation.
That's an interesting point, and I certainly haven't considered that for faults introduced by page table walks.

I'm not sure what userspace can do with that though. The problem is that this is a write for which we don't have useful data: although we know it is a page-table walker access, we don't know what it was about to write. The instruction that caused the write is meaningless (it could either be a load, a store, or an instruction fetch). How do you populate the data[] field then?

If anything, this is closer to KVM_EXIT_ARM_NISV, for which we give userspace the full ESR and ask it to sort it out. I doubt it will be able to, but hey, maybe it is worth a shot. This would need to be a different exit reason though, as NISV is explicitly for non-memslot stuff.

In any case, the documentation for KVM_SET_USER_MEMORY_REGION needs to reflect the fact that KVM_EXIT_MMIO cannot represent a fault due to a S1 PTW.
Oh I completely agree with you here. I probably should have said before, I think the exit would be useless anyway. Getting the documentation in line with the intended behavior seems to be the best fix.
Right. How about something like this?

Looks good to me, thanks!

Reviewed-by: Oliver Upton oliver.upton@linux.dev

...

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 226b40baffb8..72abd018a618 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1381,6 +1381,14 @@ It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl. The KVM_SET_MEMORY_REGION does not allow fine grained control over memory allocation and is deprecated. +Note: On arm64, a write generated by the page-table walker (to update +the Access and Dirty flags, for example) never results in a +KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This +is because KVM cannot provide the data that would be written by the +page-table walker, making it impossible to emulate the access. +Instead, an abort (data abort if the cause of the page-table update +was a load or a store, instruction abort if it was an instruction +fetch) is injected in the guest. 4.36 KVM_SET_TSS_ADDR

-- Best, Oliver

Ard Biesheuvel

22 Dec 22 Dec

1:01 p.m.

On Tue, 20 Dec 2022 at 21:09, Marc Zyngier maz@kernel.org wrote:

...

A recent development on the EFI front has resulted in guests having their page tables baked in the firmware binary, and mapped into the IPA space as part as a read-only memslot.

Not only this is legitimate, but it also results in added security, so thumbs up. However, this clashes mildly with our handling of a S1PTW as a write to correctly handle AF/DB updates to the S1 PTs, and results in the guest taking an abort it won't recover from (the PTs mapping the vectors will suffer freom the same problem...).

So clearly our handling is... wrong.

Instead, switch to a two-pronged approach:

On S1PTW translation fault, handle the fault as a read

On S1PTW permission fault, handle the fault as a write

This is of no consequence to SW that *writes* to its PTs (the write will trigger a non-S1PTW fault), and SW that uses RO PTs will not use AF/DB anyway, as that'd be wrong.

Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") do we end-up with two back-to-back faults (page being evicted and faulted back). I don't think this is a case worth optimising for.

Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org

Reviewed-by: Ard Biesheuvel ardb@kernel.org

I have tested this patch on my TX2 with one of the EFI builds in question, and everything works as before (I never observed the issue itself)

Regression-tested-by: Ard Biesheuvel ardb@kernel.org

For the record, the EFI build in question targets QEMU/mach-virt and switches to a set of read-only page tables in emulated NOR flash straight out of reset, so it can create and populate the real page tables with MMU and caches enabled. EFI does not use virtual memory or paging so managing access flags or dirty bits in hardware is unlikely to add any value, and it is not being used at the moment. And given that this is emulated NOR flash, any ordinary write to it tears down the r/o memslot altogether, and kicks the NOR flash emulation in QEMU into programming mode, which is fully based on MMIO emulation and does not use a memslot at all. IOW, even if we could figure out what store the PTW was attempting to do, it is always going to be rejected since the r/o page tables can only be modified by 'programming' the NOR flash sector.

...

arch/arm64/include/asm/kvm_emulate.h | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)

static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu) {

```
  if (kvm_vcpu_abt_iss1tw(vcpu))
```
```
          return true;
```

```
  if (kvm_vcpu_abt_iss1tw(vcpu)) {
```
```
          /*
```

           * Only a permission fault on a S1PTW should be

           * considered as a write. Otherwise, page tables baked

           * in a read-only memslot will result in an exception

           * being delivered in the guest.

```
           *
```

           * The drawback is that we end-up fauling twice if the

           * guest is using any of HW AF/DB: a translation fault

           * to map the page containing the PT (read only at

           * first), then a permission fault to allow the flags

```
           * to be set.
```
```
           */
```

          switch (kvm_vcpu_trap_get_fault_type(vcpu)) {

```
          case ESR_ELx_FSC_PERM:
```
```
                  return true;
```
```
          default:
```
```
                  return false;
```
```
          }
```

  }

  if (kvm_vcpu_trap_is_iabt(vcpu))
          return false;

-- 2.34.1

Marc Zyngier

24 Dec 24 Dec

12:18 p.m.

On Thu, 22 Dec 2022 13:01:55 +0000, Ard Biesheuvel ardb@kernel.org wrote:

...

On Tue, 20 Dec 2022 at 21:09, Marc Zyngier maz@kernel.org wrote:

...
A recent development on the EFI front has resulted in guests having their page tables baked in the firmware binary, and mapped into the IPA space as part as a read-only memslot.

Not only this is legitimate, but it also results in added security, so thumbs up. However, this clashes mildly with our handling of a S1PTW as a write to correctly handle AF/DB updates to the S1 PTs, and results in the guest taking an abort it won't recover from (the PTs mapping the vectors will suffer freom the same problem...).

So clearly our handling is... wrong.

Instead, switch to a two-pronged approach:

On S1PTW translation fault, handle the fault as a read

On S1PTW permission fault, handle the fault as a write

This is of no consequence to SW that *writes* to its PTs (the write will trigger a non-S1PTW fault), and SW that uses RO PTs will not use AF/DB anyway, as that'd be wrong.

Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") do we end-up with two back-to-back faults (page being evicted and faulted back). I don't think this is a case worth optimising for.

Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org

Reviewed-by: Ard Biesheuvel ardb@kernel.org

I have tested this patch on my TX2 with one of the EFI builds in question, and everything works as before (I never observed the issue itself)

If you get the chance, could you try with non-4kB page sizes? Here, I could only reproduce it with 16kB pages. It was firing like clockwork on Cortex-A55 with that.

...

Regression-tested-by: Ard Biesheuvel ardb@kernel.org

For the record, the EFI build in question targets QEMU/mach-virt and switches to a set of read-only page tables in emulated NOR flash straight out of reset, so it can create and populate the real page tables with MMU and caches enabled. EFI does not use virtual memory or paging so managing access flags or dirty bits in hardware is unlikely to add any value, and it is not being used at the moment. And given that this is emulated NOR flash, any ordinary write to it tears down the r/o memslot altogether, and kicks the NOR flash emulation in QEMU into programming mode, which is fully based on MMIO emulation and does not use a memslot at all. IOW, even if we could figure out what store the PTW was attempting to do, it is always going to be rejected since the r/o page tables can only be modified by 'programming' the NOR flash sector.

Indeed, and this would be a pretty dodgy setup anyway.

Thanks for having had a look,

-- Without deviation from the norm, progress is not possible.

Ard Biesheuvel

1:09 p.m.

On Sat, 24 Dec 2022 at 13:19, Marc Zyngier maz@kernel.org wrote:

...

On Thu, 22 Dec 2022 13:01:55 +0000, Ard Biesheuvel ardb@kernel.org wrote:

...
On Tue, 20 Dec 2022 at 21:09, Marc Zyngier maz@kernel.org wrote:

...
A recent development on the EFI front has resulted in guests having their page tables baked in the firmware binary, and mapped into the IPA space as part as a read-only memslot.

Not only this is legitimate, but it also results in added security, so thumbs up. However, this clashes mildly with our handling of a S1PTW as a write to correctly handle AF/DB updates to the S1 PTs, and results in the guest taking an abort it won't recover from (the PTs mapping the vectors will suffer freom the same problem...).

So clearly our handling is... wrong.

Instead, switch to a two-pronged approach:

On S1PTW translation fault, handle the fault as a read

On S1PTW permission fault, handle the fault as a write

This is of no consequence to SW that *writes* to its PTs (the write will trigger a non-S1PTW fault), and SW that uses RO PTs will not use AF/DB anyway, as that'd be wrong.

Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") do we end-up with two back-to-back faults (page being evicted and faulted back). I don't think this is a case worth optimising for.

Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") Signed-off-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org

Reviewed-by: Ard Biesheuvel ardb@kernel.org

I have tested this patch on my TX2 with one of the EFI builds in question, and everything works as before (I never observed the issue itself)

If you get the chance, could you try with non-4kB page sizes? Here, I could only reproduce it with 16kB pages. It was firing like clockwork on Cortex-A55 with that.

I'll try on 64k but I don't have access to a 16k capable machine that runs KVM atm (I'm still enjoying working wifi and GPU etc on my M1 Macbook Air)

...

...
Regression-tested-by: Ard Biesheuvel ardb@kernel.org

For the record, the EFI build in question targets QEMU/mach-virt and switches to a set of read-only page tables in emulated NOR flash straight out of reset, so it can create and populate the real page tables with MMU and caches enabled. EFI does not use virtual memory or paging so managing access flags or dirty bits in hardware is unlikely to add any value, and it is not being used at the moment. And given that this is emulated NOR flash, any ordinary write to it tears down the r/o memslot altogether, and kicks the NOR flash emulation in QEMU into programming mode, which is fully based on MMIO emulation and does not use a memslot at all. IOW, even if we could figure out what store the PTW was attempting to do, it is always going to be rejected since the r/o page tables can only be modified by 'programming' the NOR flash sector.

Indeed, and this would be a pretty dodgy setup anyway.

Thanks for having had a look,
    M.
-- Without deviation from the norm, progress is not possible.

1094

days inactive

1098

days old

linux-stable-mirror@lists.linaro.org

8 comments

participants

tags (0)

participants (3)

Ard Biesheuvel
Marc Zyngier
Oliver Upton