From: Menglong Dong <imagedong(a)tencent.com>
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
exit:
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
The second patch is the selftests for this modification.
The third patch use C99 initializers in test_sock.c.
Changes since v3:
- add the third patch which use C99 initializers in test_sock.c
Changes since v2:
- NULL check for sk->sk_prot->put_port
Changes since v1:
- introduce 'put_port' field for 'struct proto'
- add selftests for it
Menglong Dong (3):
net: bpf: handle return value of
BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
bpf: selftests: add bind retry for post_bind{4, 6}
bpf: selftests: use C99 initializers in test_sock.c
include/net/sock.h | 1 +
net/ipv4/af_inet.c | 2 +
net/ipv4/ping.c | 1 +
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/udp.c | 1 +
net/ipv6/af_inet6.c | 2 +
net/ipv6/ping.c | 1 +
net/ipv6/tcp_ipv6.c | 1 +
net/ipv6/udp.c | 1 +
tools/testing/selftests/bpf/test_sock.c | 370 ++++++++++++++----------
10 files changed, 233 insertions(+), 148 deletions(-)
--
2.27.0
Thanks a lot for all the review comments and guidance! Hope this
version is in a good state now. :)
(Jing is temporarily leave for family reason, Yang helped work out
this version)
----
v4->v5:
- Directly call fpu core to expand fpstate buffer in kvm_check_cpuid()
and remove duplicated permission check there (Sean)
- Accordingly remove Thomas's reviewed-by as a different wrapper is
introduced now (patch-7)
- Properly queue #NM exception in nested scenario (Sean)
- Verify non-XFD related #NM usage in nested scenario (Sean)
- Hide XFD in kvm_cpu_cap on 32bit host kernels (Sean)
- Use xstate_required_size() in KVM_CAP_XSAVE2 which may be called
before any vcpu is created (Sean/Paolo)
- Replace boot_cpu_has with kvm_cpu_cap_has when disabling RDMSR
interception for xfd_err (Sean)
v3->v4:
- Verify kvm selftest for AMX (Paolo)
- Expand fpstate buffer in kvm_check_cpuid() and improve patch
description (Sean)
- Drop 'preemption' word in #NM interception patch (Sean)
- Remove 'trap_nm' flag. Replace it by: (Sean)
* Trapping #NM according to guest_fpu::xfd when write to xfd is
intercepted.
* Always trapping #NM when xfd write interception is disabled
- Use better name for #NM related functions (Sean)
- Drop '#ifdef CONFIG_X86_64' in __kvm_set_xcr (Sean)
- Update description for KVM_CAP_XSAVE2 and prevent the guest from
using the wrong ioctl (Sean)
- Replace 'xfd_out_of_sync' with a better name (Sean)
v2->v3:
- Trap #NM until write IA32_XFD with a non-zero value (Thomas)
- Revise return value in __xstate_request_perm() (Thomas)
- Revise doc for KVM_GET_SUPPORTED_CPUID (Paolo)
- Add Thomas's reviewed-by on one patch
- Reorder disabling read interception of XFD_ERR patch (Paolo)
- Move disabling r/w interception of XFD from x86.c to vmx.c (Paolo)
- Provide the API doc together with the new KVM_GET_XSAVE2 ioctl (Paolo)
- Make KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) return minimum size of struct
kvm_xsave (4K) (Paolo)
- Request permission at the start of vm_create_with_vcpus() in selftest
- Request permission conditionally when XFD is supported (Paolo)
v1->v2:
- Live migration supported and verified with a selftest
- Rebase to Thomas's new series for guest fpstate reallocation [1]
- Expand fpstate at KVM_SET_CPUID2 instead of when emulating XCR0
and IA32_XFD (Thomas/Paolo)
- Accordingly remove all exit-to-userspace stuff
- Intercept #NM to save guest XFD_ERR and restore host/guest value
at preemption on/off boundary (Thomas)
- Accordingly remove all xfd_err logic in preemption callback and
fpu_swap_kvm_fpstate()
- Reuse KVM_SET_XSAVE to handle both legacy and expanded buffer (Paolo)
- Don't return dynamic bits w/o prctl() in KVM_GET_SUPPORTED_CPUID (Paolo)
- Check guest permissions for dynamic features in CPUID[0xD] instead
of only for AMX at KVM_SET_CPUID (Paolo)
- Remove dynamic bit check for 32-bit guest in __kvm_set_xcr() (Paolo)
- Fix CPUID emulation for 0x1d and 0x1e (Paolo)
- Move "disable interception" to the end of the series (Paolo)
This series brings AMX (Advanced Matrix eXtensions) virtualization support to
KVM. The preparatory series from Thomas [1] is also included.
A large portion of the changes in this series is to deal with eXtended Feature
Disable (XFD) which allows resizing of the fpstate buffer to support
dynamically-enabled XSTATE features with large state component (e.g. 8K for AMX).
There are a lot of simplications when comparing v5 to the original proposal [2]
and the first version [3]. Thanks to Thomas, Paolo and Sean for many good
suggestions.
The support is based on following key changes:
- Guest permissions for dynamically-enabled XSAVE features
Native tasks have to request permission via prctl() before touching
a dynamic-resized XSTATE compoenent. Introduce guest permissions
for the similar purpose. Userspace VMM is expected to request guest
permission only once when the first vCPU is created.
KVM checks guest permission in KVM_SET_CPUID2. Setting XFD in guest
cpuid w/o proper permissions fails this operation. In the meantime,
unpermitted features are also excluded in KVM_GET_SUPPORTED_CPUID.
- Extend fpstate reallocation mechanism to cover guest fpu
Unlike native tasks which have reallocation triggered from #NM
handler, guest fpstate reallocation is requested by KVM when it
identifies the intention on using dynamically-enabled XSAVE
features inside guest.
Extend fpu core to allow KVM request fpstate buffer expansion
for a guest fpu containter.
- Trigger fpstate reallocation in KVM
This could be done either before guest runs or until xfd is updated
in the emulation path. According to discussion [1] we decide to
go the former option in KVM_SET_CPUID2, with fpstate buffer sized
accordingly. This spares a lot of code and also avoid imposing an
ordered restore sequence (XCR0, XFD and XSTATE) to userspace VMM.
- RDMSR/WRMSR emulation for IA32_XFD
Because fpstate expansion is completed in KVM_SET_CPUID2, emulating
r/w access to IA32_XFD simply involves the xfd field in the guest
fpu container. If write and guest fpu is currently active, the
software state (guest_fpstate::xfd and per-cpu xfd cache) is also
updated.
- RDMSR/WRMSR emulation for XFD_ERR
When XFD causes an instruction to generate #NM, XFD_ERR contains
information about which disabled state components are being accessed.
It'd be problematic if the XFD_ERR value generated in guest is
consumed/clobbered by the host before the guest itself doing so.
Intercept #NM exception to save the guest XFD_ERR value when write
IA32_XFD with a non-zero value for 1st time. There is at most one
interception per guest task given a dynamic feature.
RDMSR/WRMSR emulation uses the saved value. The host value (always
ZERO outside of the host #NM handler) is restored before enabling
interrupts. The saved guest value is restored right before entering
the guest (with interrupts disabled).
- Get/set dynamic xfeature state for migration
Introduce new capability (KVM_CAP_XSAVE2) to deal with >4KB fpstate
buffer. Reading this capability returns the size of the current
guest fpstate (e.g. after expansion). Userspace VMM uses a new ioctl
(KVM_GET_XSAVE2) to read guest fpstate from the kernel and reuses
the existing ioctl (KVM_SET_XSAVE) to update guest fpsate to the
kernel. KVM_SET_XSAVE is extended to do properly_sized memdup_user()
based on the guest fpstate.
- Expose related cpuid bits to guest
The last step is to allow exposing XFD, AMX_TILE, AMX_INT8 and
AMX_BF16 in guest cpuid. Adding those bits into kvm_cpu_caps finally
activates all previous logics in this series
- Optimization: disable interception for IA32_XFD
IA32_XFD can be frequently updated by the guest, as it is part of
the task state and swapped in context switch when prev and next have
different XFD setting. Always intercepting WRMSR can easily cause
non-negligible overhead.
Disable r/w emulation for IA32_XFD after intercepting the first
WRMSR(IA32_XFD) with a non-zero value. However MSR passthrough
implies the software state (guest_fpstate::xfd and per-cpu xfd
cache) might be out of sync with MSR. This suggests KVM needs to
re-sync them at VM-exit before preemption is enabled.
Thanks Jun Nakajima and Kevin Tian for the design suggestions when this was
being internally worked on.
[1] https://lore.kernel.org/all/20211214022825.563892248@linutronix.de/
[2] https://www.spinics.net/lists/kvm/msg259015.html
[3] https://lore.kernel.org/lkml/20211208000359.2853257-1-yang.zhong@intel.com/
Thanks,
Yang
----
Guang Zeng (1):
kvm: x86: Add support for getting/setting expanded xstate buffer
Jing Liu (11):
kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule
kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID
x86/fpu: Make XFD initialization in __fpstate_reset() a function
argument
kvm: x86: Enable dynamic xfeatures at KVM_SET_CPUID2
kvm: x86: Add emulation for IA32_XFD
x86/fpu: Prepare xfd_err in struct fpu_guest
kvm: x86: Intercept #NM for saving IA32_XFD_ERR
kvm: x86: Emulate IA32_XFD_ERR for guest
kvm: x86: Disable RDMSR interception of IA32_XFD_ERR
kvm: x86: Add XCR0 support for Intel AMX
kvm: x86: Add CPUID support for Intel AMX
Kevin Tian (2):
x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation
kvm: x86: Disable interception for IA32_XFD on demand
Sean Christopherson (1):
x86/fpu: Provide fpu_enable_guest_xfd_features() for KVM
Thomas Gleixner (5):
x86/fpu: Extend fpu_xstate_prctl() with guest permissions
x86/fpu: Prepare guest FPU for dynamically enabled FPU features
x86/fpu: Add guest support to xfd_enable_feature()
x86/fpu: Add uabi_size to guest_fpu
x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()
Wei Wang (1):
kvm: selftests: Add support for KVM_CAP_XSAVE2
Documentation/virt/kvm/api.rst | 46 +++++-
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/fpu/api.h | 11 ++
arch/x86/include/asm/fpu/types.h | 32 ++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/uapi/asm/kvm.h | 16 +-
arch/x86/include/uapi/asm/prctl.h | 26 ++--
arch/x86/kernel/fpu/core.c | 94 ++++++++++-
arch/x86/kernel/fpu/xstate.c | 147 +++++++++++-------
arch/x86/kernel/fpu/xstate.h | 15 +-
arch/x86/kernel/process.c | 2 +
arch/x86/kvm/cpuid.c | 86 +++++++---
arch/x86/kvm/cpuid.h | 2 +
arch/x86/kvm/vmx/vmcs.h | 5 +
arch/x86/kvm/vmx/vmx.c | 68 ++++++++
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 112 ++++++++++++-
include/uapi/linux/kvm.h | 4 +
tools/arch/x86/include/uapi/asm/kvm.h | 16 +-
tools/include/uapi/linux/kvm.h | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 2 +
.../selftests/kvm/include/x86_64/processor.h | 10 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 32 ++++
.../selftests/kvm/lib/x86_64/processor.c | 67 +++++++-
.../testing/selftests/kvm/x86_64/evmcs_test.c | 2 +-
tools/testing/selftests/kvm/x86_64/smm_test.c | 2 +-
.../testing/selftests/kvm/x86_64/state_test.c | 2 +-
.../kvm/x86_64/vmx_preemption_timer_test.c | 2 +-
28 files changed, 702 insertions(+), 107 deletions(-)
These patches are based on kvm/next, and are also available at:
https://github.com/mdroth/linux/commits/sev-selftests-ucall-rfc1
== BACKGROUND ==
These patches are a prerequisite for adding selftest support for SEV guests
and possibly other confidential computing implementations in the future.
They were motivated by a suggestion Paolo made in response to the initial
SEV selftest RFC:
https://lore.kernel.org/lkml/20211025035833.yqphcnf5u3lk4zgg@amd.com/T/#m95…
Since the changes touch multiple archs and ended up creating a bit more churn
than expected, I thought it would be a good idea to carve this out into a
separate standalone series for reviewers who may be more interested in the
ucall changes than anything SEV-related.
To summarize, x86 relies on a ucall based on using PIO intructions to generate
an exit to userspace and provide the GVA of a dynamically-allocated ucall
struct that resides in guest memory and contains information about how to
handle/interpret the exit. This doesn't work for SEV guests for 3 main reasons:
1) The guest memory is generally encrypted during run-time, so the guest
needs to ensure the ucall struct is allocated in shared memory.
2) The guest page table is also encrypted, so the address would need to be a
GPA instead of a GVA.
3) The guest vCPU register may also be encrypted in the case of
SEV-ES/SEV-SNP, so the approach of examining vCPU register state has
additional requirements such as requiring guest code to implement a #VC
handler that can provide the appropriate registers via a vmgexit.
To address these issues, the SEV selftest RFC1 patchset introduced a set of new
SEV-specific interfaces that closely mirrored the functionality of
ucall()/get_ucall(), but relied on a pre-allocated/static ucall buffer in
shared guest memory so it that guest code could pass messages/state to the host
by simply writing to this pre-arranged shared memory region and then generating
an exit to userspace (via a halt instruction).
Paolo suggested instead implementing support for test/guest-specific ucall
implementations that could be used as an alternative to the default PIO-based
ucall implementations as-needed based on test/guest requirements, while still
allowing for tests to use a common set interfaces like ucall()/get_ucall().
== OVERVIEW ==
This series implements the above functionality by introducing a new ucall_ops
struct that can be used to register a particular ucall implementation as need,
then re-implements x86/arm64/s390x in terms of the ucall_ops.
But for the purposes of introducing a new ucall_ops implementation appropriate
for SEV, there are a couple issues that resulted in the need for some additional
ucall interfaces as well:
a) ucall() doesn't take a pointer to the ucall struct it modifies, so to make
it work in the case of an implementation that relies a pre-allocated ucall
struct in shared guest memory some sort of global lookup functionality
would be needed to locate the appropriate ucall struct for a particular
VM/vcpu combination, and this would need to be made accessible for use by
the guest as well. guests would then need some way of determining what
VM/vcpu identifiers they need to use to do the lookup, which to do reliably
would likely require seeding the guest with those identifiers in advance,
which is possible, but much more easily achievable by simply adding a
ucall() alternative that accepts a pointer to the ucall struct for that
particular VM/vcpu.
b) get_ucall() *does* take a pointer to a ucall struct, but currently zeroes
it out and uses it to copy the guest's ucall struct into. It *could* be
re-purposed to handle the case where the pointer is an actual pointer to
the ucall struct in shared guest memory, but that could cause problems
since callers would need some idea of what the underlying ucall
implementation expects. Ideally the interfaces would be agnostic to the
ucall implementation.
So to address those issues, this series also allows ucall implementations to
optionally be extended to support a set of 'shared' ops that are used in the
following manner:
host:
uc_gva = ucall_shared_alloc()
setup_vm_args(vm, uc_gva)
guest:
ucall_shared(uc_gva, ...)
host:
uget_ucall_shared(uc_gva, ...)
and then implements a new ucall implementation, ucall_ops_halt, based around
these shared interfaces and halt instructions.
While this doesn't really meet the initial goal of re-using the existing
ucall interfaces as-is, the hope is that these *_shared interfaces are
general enough to be re-usable things other than SEV, or at least improve on
code readability over the initial SEV-specific interfaces.
Any review/comments are greatly appreciated!
----------------------------------------------------------------
Michael Roth (10):
kvm: selftests: move base kvm_util.h declarations to kvm_util_base.h
kvm: selftests: move ucall declarations into ucall_common.h
kvm: selftests: introduce ucall_ops for test/arch-specific ucall implementations
kvm: arm64: selftests: use ucall_ops to define default ucall implementation
(COMPILE-TESTED ONLY) kvm: s390: selftests: use ucall_ops to define default ucall implementation
kvm: selftests: add ucall interfaces based around shared memory
kvm: selftests: add ucall_shared ops for PIO
kvm: selftests: introduce ucall implementation based on halt instructions
kvm: selftests: add GUEST_SHARED_* macros for shared ucall implementations
kvm: selftests: add ucall_test to test various ucall functionality
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 5 +-
.../testing/selftests/kvm/include/aarch64/ucall.h | 18 +
tools/testing/selftests/kvm/include/kvm_util.h | 408 +--------------------
.../testing/selftests/kvm/include/kvm_util_base.h | 368 +++++++++++++++++++
tools/testing/selftests/kvm/include/s390x/ucall.h | 18 +
tools/testing/selftests/kvm/include/ucall_common.h | 147 ++++++++
tools/testing/selftests/kvm/include/x86_64/ucall.h | 19 +
tools/testing/selftests/kvm/lib/aarch64/ucall.c | 43 +--
tools/testing/selftests/kvm/lib/s390x/ucall.c | 45 +--
tools/testing/selftests/kvm/lib/ucall_common.c | 133 +++++++
tools/testing/selftests/kvm/lib/x86_64/ucall.c | 82 +++--
tools/testing/selftests/kvm/ucall_test.c | 182 +++++++++
13 files changed, 982 insertions(+), 487 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/aarch64/ucall.h
create mode 100644 tools/testing/selftests/kvm/include/kvm_util_base.h
create mode 100644 tools/testing/selftests/kvm/include/s390x/ucall.h
create mode 100644 tools/testing/selftests/kvm/include/ucall_common.h
create mode 100644 tools/testing/selftests/kvm/include/x86_64/ucall.h
create mode 100644 tools/testing/selftests/kvm/lib/ucall_common.c
create mode 100644 tools/testing/selftests/kvm/ucall_test.c
amt.sh test script will not work because it doesn't have execution
permission. So, it adds execution permission.
Reported-by: Hangbin Liu <liuhangbin(a)gmail.com>
Fixes: c08e8baea78e ("selftests: add amt interface selftest script")
Signed-off-by: Taehee Yoo <ap420073(a)gmail.com>
---
tools/testing/selftests/net/amt.sh | 0
1 file changed, 0 insertions(+), 0 deletions(-)
mode change 100644 => 100755 tools/testing/selftests/net/amt.sh
diff --git a/tools/testing/selftests/net/amt.sh b/tools/testing/selftests/net/amt.sh
old mode 100644
new mode 100755
--
2.17.1
From: Menglong Dong <imagedong(a)tencent.com>
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
exit:
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
The second patch is the selftests for this modification.
Changes since v2:
- NULL check for sk->sk_prot->put_port
Changes since v1:
- introduce 'put_port' field for 'struct proto'
- add selftests for it
Menglong Dong (2):
net: bpf: handle return value of
BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
bpf: selftests: add bind retry for post_bind{4, 6}
include/net/sock.h | 1 +
net/ipv4/af_inet.c | 2 +
net/ipv4/ping.c | 1 +
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/udp.c | 1 +
net/ipv6/af_inet6.c | 2 +
net/ipv6/ping.c | 1 +
net/ipv6/tcp_ipv6.c | 1 +
net/ipv6/udp.c | 1 +
tools/testing/selftests/bpf/test_sock.c | 166 +++++++++++++++++++++---
10 files changed, 157 insertions(+), 20 deletions(-)
--
2.27.0
Highly appreciate for your review. This version mostly addressed the comments
from Sean. Most comments are adopted except three which are not closed and
need more discussions:
- Move the entire xfd write emulation code to x86.c. Doing so requires
introducing a new kvm_x86_ops callback to disable msr write bitmap.
According to Paolo's earlier comment he prefers to handle it in vmx.c.
- Directly check msr_bitmap in update_exception_bitmap() (for
trapping #NM) and vcpu_enter_guest() (for syncing guest xfd after
vm-exit) instead of introducing an extra flag in the last patch. However,
doing so requires another new kvm_x86_ops callback for checking
msr_bitmap since vcpu_enter_guest() is x86 common code. Having an
extra flag sounds simpler here (at least for the initial AMX support).
It does penalize nested guest with one xfd sync per exit, but it's not
worse than a normal guest which initializes xfd but doesn't run
AMX applications at all. Those could be improved afterwards.
- Disable #NM trap for nested guest. This version still chooses to always
trap #NM (regardless in L1 or L2) as long as xfd write interception is disabled.
In reality #NM is rare if nested guest doesn't intend to run AMX applications
and always-trap is safer than dynamic trap for the basic support in case
of any oversight here.
(Jing is temporarily leave for family reason, Yang helped work out this version)
----
v3->v4:
- Verify kvm selftest for AMX (Paolo)
- Move fpstate buffer expansion from kvm_vcpu_after_set_cpuid () to
kvm_check_cpuid() and improve patch description (Sean)
- Drop 'preemption' word in #NM interception patch (Sean)
- Remove 'trap_nm' flag. Replace it by: (Sean)
* Trapping #NM according to guest_fpu::xfd when write to xfd is
intercepted.
* Always trapping #NM when xfd write interception is disabled
- Use better name for #NM related functions (Sean)
- Drop '#ifdef CONFIG_X86_64' in __kvm_set_xcr (Sean)
- Update description for KVM_CAP_XSAVE2 and prevent the guest from
using the wrong ioctl (Sean)
- Replace 'xfd_out_of_sync' with a better name (Sean)
v2->v3:
- Trap #NM until write IA32_XFD with a non-zero value (Thomas)
- Revise return value in __xstate_request_perm() (Thomas)
- Revise doc for KVM_GET_SUPPORTED_CPUID (Paolo)
- Add Thomas's reviewed-by on one patch
- Reorder disabling read interception of XFD_ERR patch (Paolo)
- Move disabling r/w interception of XFD from x86.c to vmx.c (Paolo)
- Provide the API doc together with the new KVM_GET_XSAVE2 ioctl (Paolo)
- Make KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) return minimum size of struct
kvm_xsave (4K) (Paolo)
- Request permission at the start of vm_create_with_vcpus() in selftest
- Request permission conditionally when XFD is supported (Paolo)
v1->v2:
- Live migration supported and verified with a selftest
- Rebase to Thomas's new series for guest fpstate reallocation [1]
- Expand fpstate at KVM_SET_CPUID2 instead of when emulating XCR0
and IA32_XFD (Thomas/Paolo)
- Accordingly remove all exit-to-userspace stuff
- Intercept #NM to save guest XFD_ERR and restore host/guest value
at preemption on/off boundary (Thomas)
- Accordingly remove all xfd_err logic in preemption callback and
fpu_swap_kvm_fpstate()
- Reuse KVM_SET_XSAVE to handle both legacy and expanded buffer (Paolo)
- Don't return dynamic bits w/o prctl() in KVM_GET_SUPPORTED_CPUID (Paolo)
- Check guest permissions for dynamic features in CPUID[0xD] instead
of only for AMX at KVM_SET_CPUID (Paolo)
- Remove dynamic bit check for 32-bit guest in __kvm_set_xcr() (Paolo)
- Fix CPUID emulation for 0x1d and 0x1e (Paolo)
- Move "disable interception" to the end of the series (Paolo)
This series brings AMX (Advanced Matrix eXtensions) virtualization support
to KVM. The preparatory series from Thomas [1] is also included.
A large portion of the changes in this series is to deal with eXtended
Feature Disable (XFD) which allows resizing of the fpstate buffer to
support dynamically-enabled XSTATE features with large state component
(e.g. 8K for AMX).
There are a lot of simplications when comparing v2/v3 to the original
proposal [2] and the first version [3]. Thanks to Thomas and Paolo for
many good suggestions.
The support is based on following key changes:
- Guest permissions for dynamically-enabled XSAVE features
Native tasks have to request permission via prctl() before touching
a dynamic-resized XSTATE compoenent. Introduce guest permissions
for the similar purpose. Userspace VMM is expected to request guest
permission only once when the first vCPU is created.
KVM checks guest permission in KVM_SET_CPUID2. Setting XFD in guest
cpuid w/o proper permissions fails this operation. In the meantime,
unpermitted features are also excluded in KVM_GET_SUPPORTED_CPUID.
- Extend fpstate reallocation mechanism to cover guest fpu
Unlike native tasks which have reallocation triggered from #NM
handler, guest fpstate reallocation is requested by KVM when it
identifies the intention on using dynamically-enabled XSAVE
features inside guest.
Extend fpu core to allow KVM request fpstate buffer expansion
for a guest fpu containter.
- Trigger fpstate reallocation in KVM
This could be done either statically (before guest runs) or
dynamically (in the emulation path). According to discussion [1]
we decide to statically enable all xfeatures allowed by guest perm
in KVM_SET_CPUID2, with fpstate buffer sized accordingly. This spares
a lot of code and also avoid imposing an ordered restore sequence
(XCR0, XFD and XSTATE) to userspace VMM.
- RDMSR/WRMSR emulation for IA32_XFD
Because fpstate expansion is completed in KVM_SET_CPUID2, emulating
r/w access to IA32_XFD simply involves the xfd field in the guest
fpu container. If write and guest fpu is currently active, the
software state (guest_fpstate::xfd and per-cpu xfd cache) is also
updated.
- RDMSR/WRMSR emulation for XFD_ERR
When XFD causes an instruction to generate #NM, XFD_ERR contains
information about which disabled state components are being accessed.
It'd be problematic if the XFD_ERR value generated in guest is
consumed/clobbered by the host before the guest itself doing so.
Intercept #NM exception to save the guest XFD_ERR value when write
IA32_XFD with a non-zero value for 1st time. There is at most one
interception per guest task given a dynamic feature.
RDMSR/WRMSR emulation uses the saved value. The host value (always
ZERO outside of the host #NM handler) is restored before enabling
preemption. The saved guest value is restored right before entering
the guest (with preemption disabled).
- Get/set dynamic xfeature state for migration
Introduce new capability (KVM_CAP_XSAVE2) to deal with >4KB fpstate
buffer. Reading this capability returns the size of the current
guest fpstate (e.g. after expansion). Userspace VMM uses a new ioctl
(KVM_GET_XSAVE2) to read guest fpstate from the kernel and reuses
the existing ioctl (KVM_SET_XSAVE) to update guest fpsate to the
kernel. KVM_SET_XSAVE is extended to do properly_sized memdup_user()
based on the guest fpstate.
- Expose related cpuid bits to guest
The last step is to allow exposing XFD, AMX_TILE, AMX_INT8 and
AMX_BF16 in guest cpuid. Adding those bits into kvm_cpu_caps finally
activates all previous logics in this series
- Optimization: disable interception for IA32_XFD
IA32_XFD can be frequently updated by the guest, as it is part of
the task state and swapped in context switch when prev and next have
different XFD setting. Always intercepting WRMSR can easily cause
non-negligible overhead.
Disable r/w emulation for IA32_XFD after intercepting the first
WRMSR(IA32_XFD) with a non-zero value. However MSR passthrough
implies the software state (guest_fpstate::xfd and per-cpu xfd
cache) might be out of sync with MSR. This suggests KVM needs to
re-sync them at VM-exit before preemption is enabled.
Thanks Jun Nakajima and Kevin Tian for the design suggestions when this
version is being internally worked on.
[1] https://lore.kernel.org/all/20211214022825.563892248@linutronix.de/
[2] https://www.spinics.net/lists/kvm/msg259015.html
[3] https://lore.kernel.org/lkml/20211208000359.2853257-1-yang.zhong@intel.com/
Thanks,
Yang
---
Guang Zeng (1):
kvm: x86: Add support for getting/setting expanded xstate buffer
Jing Liu (11):
kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule
kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID
x86/fpu: Make XFD initialization in __fpstate_reset() a function
argument
kvm: x86: Check and enable permitted dynamic xfeatures at
KVM_SET_CPUID2
kvm: x86: Add emulation for IA32_XFD
x86/fpu: Prepare xfd_err in struct fpu_guest
kvm: x86: Intercept #NM for saving IA32_XFD_ERR
kvm: x86: Emulate IA32_XFD_ERR for guest
kvm: x86: Disable RDMSR interception of IA32_XFD_ERR
kvm: x86: Add XCR0 support for Intel AMX
kvm: x86: Add CPUID support for Intel AMX
Kevin Tian (3):
x86/fpu: Provide fpu_update_guest_perm_features() for guest
x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation
kvm: x86: Disable interception for IA32_XFD on demand
Thomas Gleixner (5):
x86/fpu: Extend fpu_xstate_prctl() with guest permissions
x86/fpu: Prepare guest FPU for dynamically enabled FPU features
x86/fpu: Add guest support to xfd_enable_feature()
x86/fpu: Add uabi_size to guest_fpu
x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()
Wei Wang (1):
kvm: selftests: Add support for KVM_CAP_XSAVE2
Documentation/virt/kvm/api.rst | 46 +++++-
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/fpu/api.h | 11 ++
arch/x86/include/asm/fpu/types.h | 32 ++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/uapi/asm/kvm.h | 16 +-
arch/x86/include/uapi/asm/prctl.h | 26 ++--
arch/x86/kernel/fpu/core.c | 104 ++++++++++++-
arch/x86/kernel/fpu/xstate.c | 147 +++++++++++-------
arch/x86/kernel/fpu/xstate.h | 15 +-
arch/x86/kernel/process.c | 2 +
arch/x86/kvm/cpuid.c | 99 +++++++++---
arch/x86/kvm/vmx/vmcs.h | 5 +
arch/x86/kvm/vmx/vmx.c | 45 +++++-
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 105 ++++++++++++-
include/uapi/linux/kvm.h | 4 +
tools/arch/x86/include/uapi/asm/kvm.h | 16 +-
tools/include/uapi/linux/kvm.h | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 2 +
.../selftests/kvm/include/x86_64/processor.h | 10 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 32 ++++
.../selftests/kvm/lib/x86_64/processor.c | 67 +++++++-
.../testing/selftests/kvm/x86_64/evmcs_test.c | 2 +-
tools/testing/selftests/kvm/x86_64/smm_test.c | 2 +-
.../testing/selftests/kvm/x86_64/state_test.c | 2 +-
.../kvm/x86_64/vmx_preemption_timer_test.c | 2 +-
27 files changed, 691 insertions(+), 109 deletions(-)
These patches and are also available at:
https://github.com/mdroth/linux/commits/sev-selftests-v2
They are based on top of the recent RFC:
"KVM: selftests: Add support for test-selectable ucall implementations"
https://lore.kernel.org/all/20211210164620.11636-1-michael.roth@amd.com/T/https://github.com/mdroth/linux/commits/sev-selftests-ucall-rfc1
which provides a new ucall implementation that this series relies on.
Those patches were in turn based on kvm/next as of 2021-12-10.
== OVERVIEW ==
This series introduces a set of memory encryption-related parameter/hooks
in the core kselftest library, then uses the hooks to implement a small
library for creating/managing SEV, SEV-ES, and (eventually) SEV-SNP guests.
This library is then used to implement a basic boot/memory test that's run
for variants of SEV/SEV-ES guests.
- Patches 1-8 implement SEV boot tests and should run against existing
kernels
- Patch 9 is a KVM changes that's required to allow SEV-ES/SEV-SNP
guests to boot with an externally generated page table, and is a
host kernel prequisite for the remaining patches in the series.
- Patches 10-13 extend the boot tests to cover SEV-ES
Any review/comments are greatly appreciated!
v2:
- rebased on ucall_ops patchset (which is based on kvm/next 2021-12-10)
- remove SEV-SNP support for now
- provide encryption bitmap as const* to original rather than as a copy
(Mingwei, Paolo)
- drop SEV-specific synchronization helpers in favor of ucall_ops_halt (Paolo)
- don't pass around addresses with c-bit included, add them as-needed via
addr_gpa2raw() (e.g. when adding PTEs, or initializing initial
cr3/vm->pgd) (Paolo)
- rename lib/sev.c functions for better consistency (Krish)
- move more test setup code out of main test function and into
setup_test_common() (Krish)
- suppress compiler warnings due to -Waddress-of-packed-member like kernel
does
- don't require SNP support in minimum firmware version detection (Marc)
- allow SEV device path to be configured via make SEV_PATH= (Marc)
----------------------------------------------------------------
Michael Roth (13):
KVM: selftests: move vm_phy_pages_alloc() earlier in file
KVM: selftests: sparsebit: add const where appropriate
KVM: selftests: add hooks for managing encrypted guest memory
KVM: selftests: handle encryption bits in page tables
KVM: selftests: add support for encrypted vm_vaddr_* allocations
KVM: selftests: ensure ucall_shared_alloc() allocates shared memory
KVM: selftests: add library for creating/interacting with SEV guests
KVM: selftests: add SEV boot tests
KVM: SVM: include CR3 in initial VMSA state for SEV-ES guests
KVM: selftests: account for error code in #VC exception frame
KVM: selftests: add support for creating SEV-ES guests
KVM: selftests: add library for handling SEV-ES-related exits
KVM: selftests: add SEV-ES boot tests
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/svm/svm.c | 19 ++
arch/x86/kvm/vmx/vmx.c | 6 +
arch/x86/kvm/x86.c | 1 +
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 10 +-
.../testing/selftests/kvm/include/kvm_util_base.h | 10 +
tools/testing/selftests/kvm/include/sparsebit.h | 36 +--
tools/testing/selftests/kvm/include/x86_64/sev.h | 44 +++
.../selftests/kvm/include/x86_64/sev_exitlib.h | 14 +
tools/testing/selftests/kvm/include/x86_64/svm.h | 35 +++
.../selftests/kvm/include/x86_64/svm_util.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 270 ++++++++++++------
.../testing/selftests/kvm/lib/kvm_util_internal.h | 10 +
tools/testing/selftests/kvm/lib/sparsebit.c | 48 ++--
tools/testing/selftests/kvm/lib/ucall_common.c | 4 +-
tools/testing/selftests/kvm/lib/x86_64/handlers.S | 4 +-
tools/testing/selftests/kvm/lib/x86_64/processor.c | 16 +-
tools/testing/selftests/kvm/lib/x86_64/sev.c | 252 ++++++++++++++++
.../testing/selftests/kvm/lib/x86_64/sev_exitlib.c | 249 ++++++++++++++++
.../selftests/kvm/x86_64/sev_all_boot_test.c | 316 +++++++++++++++++++++
22 files changed, 1215 insertions(+), 133 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/x86_64/sev.h
create mode 100644 tools/testing/selftests/kvm/include/x86_64/sev_exitlib.h
create mode 100644 tools/testing/selftests/kvm/lib/x86_64/sev.c
create mode 100644 tools/testing/selftests/kvm/lib/x86_64/sev_exitlib.c
create mode 100644 tools/testing/selftests/kvm/x86_64/sev_all_boot_test.c
Highly appreciate for your review. We will continue working on
remaining selftest and send out later.
TODO:
- kvm selftest for AMX is still in progress;
----
v2->v3:
- Trap #NM until write IA32_XFD with a non-zero value (Thomas)
- Revise return value in __xstate_request_perm() (Thomas)
- Revise doc for KVM_GET_SUPPORTED_CPUID (Paolo)
- Add Thomas's reviewed-by on one patch
- Reorder disabling read interception of XFD_ERR patch (Paolo)
- Move disabling r/w interception of XFD from x86.c to vmx.c (Paolo)
- Provide the API doc together with the new KVM_GET_XSAVE2 ioctl (Paolo)
- Make KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) return minimum size of struct
kvm_xsave (4K) (Paolo)
- Request permission at the start of vm_create_with_vcpus() in selftest
- Request permission conditionally when XFD is supported (Paolo)
v1->v2:
- Live migration supported and verified with a selftest
- Rebase to Thomas's new series for guest fpstate reallocation [1]
- Expand fpstate at KVM_SET_CPUID2 instead of when emulating XCR0
and IA32_XFD (Thomas/Paolo)
- Accordingly remove all exit-to-userspace stuff
- Intercept #NM to save guest XFD_ERR and restore host/guest value
at preemption on/off boundary (Thomas)
- Accordingly remove all xfd_err logic in preemption callback and
fpu_swap_kvm_fpstate()
- Reuse KVM_SET_XSAVE to handle both legacy and expanded buffer (Paolo)
- Don't return dynamic bits w/o prctl() in KVM_GET_SUPPORTED_CPUID (Paolo)
- Check guest permissions for dynamic features in CPUID[0xD] instead
of only for AMX at KVM_SET_CPUID (Paolo)
- Remove dynamic bit check for 32-bit guest in __kvm_set_xcr() (Paolo)
- Fix CPUID emulation for 0x1d and 0x1e (Paolo)
- Move "disable interception" to the end of the series (Paolo)
This series brings AMX (Advanced Matrix eXtensions) virtualization
support to KVM. The preparatory series from Thomas [1] is also included.
A large portion of the changes in this series is to deal with eXtended
Feature Disable (XFD) which allows resizing of the fpstate buffer to
support dynamically-enabled XSTATE features with large state component
(e.g. 8K for AMX).
There are a lot of simplications when comparing v2/v3 to the original
proposal [2] and the first version [3]. Thanks to Thomas and Paolo for
many good suggestions.
The support is based on following key changes:
- Guest permissions for dynamically-enabled XSAVE features
Native tasks have to request permission via prctl() before touching
a dynamic-resized XSTATE compoenent. Introduce guest permissions
for the similar purpose. Userspace VMM is expected to request guest
permission only once when the first vCPU is created.
KVM checks guest permission in KVM_SET_CPUID2. Setting XFD in guest
cpuid w/o proper permissions fails this operation. In the meantime,
unpermitted features are also excluded in KVM_GET_SUPPORTED_CPUID.
- Extend fpstate reallocation mechanism to cover guest fpu
Unlike native tasks which have reallocation triggered from #NM
handler, guest fpstate reallocation is requested by KVM when it
identifies the intention on using dynamically-enabled XSAVE
features inside guest.
Extend fpu core to allow KVM request fpstate buffer expansion
for a guest fpu containter.
- Trigger fpstate reallocation in KVM
This could be done either statically (before guest runs) or
dynamically (in the emulation path). According to discussion [1]
we decide to statically enable all xfeatures allowed by guest perm
in KVM_SET_CPUID2, with fpstate buffer sized accordingly. This spares
a lot of code and also avoid imposing an ordered restore sequence
(XCR0, XFD and XSTATE) to userspace VMM.
- RDMSR/WRMSR emulation for IA32_XFD
Because fpstate expansion is completed in KVM_SET_CPUID2, emulating
r/w access to IA32_XFD simply involves the xfd field in the guest
fpu container. If write and guest fpu is currently active, the
software state (guest_fpstate::xfd and per-cpu xfd cache) is also
updated.
- RDMSR/WRMSR emulation for XFD_ERR
When XFD causes an instruction to generate #NM, XFD_ERR contains
information about which disabled state components are being accessed.
It'd be problematic if the XFD_ERR value generated in guest is
consumed/clobbered by the host before the guest itself doing so.
Intercept #NM exception to save the guest XFD_ERR value when write
IA32_XFD with a non-zero value for 1st time. There is at most one
interception per guest task given a dynamic feature.
RDMSR/WRMSR emulation uses the saved value. The host value (always
ZERO outside of the host #NM handler) is restored before enabling
preemption. The saved guest value is restored right before entering
the guest (with preemption disabled).
- Get/set dynamic xfeature state for migration
Introduce new capability (KVM_CAP_XSAVE2) to deal with >4KB fpstate
buffer. Reading this capability returns the size of the current
guest fpstate (e.g. after expansion). Userspace VMM uses a new ioctl
(KVM_GET_XSAVE2) to read guest fpstate from the kernel and reuses
the existing ioctl (KVM_SET_XSAVE) to update guest fpsate to the
kernel. KVM_SET_XSAVE is extended to do properly_sized memdup_user()
based on the guest fpstate.
- Expose related cpuid bits to guest
The last step is to allow exposing XFD, AMX_TILE, AMX_INT8 and
AMX_BF16 in guest cpuid. Adding those bits into kvm_cpu_caps finally
activates all previous logics in this series
- Optimization: disable interception for IA32_XFD
IA32_XFD can be frequently updated by the guest, as it is part of
the task state and swapped in context switch when prev and next have
different XFD setting. Always intercepting WRMSR can easily cause
non-negligible overhead.
Disable r/w emulation for IA32_XFD after intercepting the first
WRMSR(IA32_XFD) with a non-zero value. However MSR passthrough
implies the software state (guest_fpstate::xfd and per-cpu xfd
cache) might be out of sync with MSR. This suggests KVM needs to
re-sync them at VM-exit before preemption is enabled.
To verify AMX virtualization overhead on non-AMX usages, we run the
Phoronix kernel build test in the guest w/ and w/o AMX in cpuid. The
result shows no observable difference between two configurations.
Thanks Jun Nakajima and Kevin Tian for the design suggestions when
this version is being internally worked on.
[1] https://lore.kernel.org/all/20211214022825.563892248@linutronix.de/
[2] https://www.spinics.net/lists/kvm/msg259015.html
[3] https://lore.kernel.org/lkml/20211208000359.2853257-1-yang.zhong@intel.com/
Thanks,
Jing
---
Guang Zeng (1):
kvm: x86: Get/set expanded xstate buffer
Jing Liu (13):
kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule
kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID
kvm: x86: Check permitted dynamic xfeatures at KVM_SET_CPUID2
x86/fpu: Make XFD initialization in __fpstate_reset() a function
argument
kvm: x86: Enable dynamic XSAVE features at KVM_SET_CPUID2
kvm: x86: Add emulation for IA32_XFD
x86/fpu: Prepare xfd_err in struct fpu_guest
kvm: x86: Intercept #NM for saving IA32_XFD_ERR
kvm: x86: Emulate IA32_XFD_ERR for guest
kvm: x86: Disable RDMSR interception of IA32_XFD_ERR
kvm: x86: Add XCR0 support for Intel AMX
kvm: x86: Add CPUID support for Intel AMX
kvm: x86: Disable interception for IA32_XFD on demand
Kevin Tian (2):
x86/fpu: Provide fpu_update_guest_perm_features() for guest
x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation
Thomas Gleixner (5):
x86/fpu: Extend fpu_xstate_prctl() with guest permissions
x86/fpu: Prepare guest FPU for dynamically enabled FPU features
x86/fpu: Add guest support to xfd_enable_feature()
x86/fpu: Add uabi_size to guest_fpu
x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()
Wei Wang (1):
kvm: selftests: Add support for KVM_CAP_XSAVE2
Documentation/virt/kvm/api.rst | 46 +++++-
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/fpu/api.h | 11 ++
arch/x86/include/asm/fpu/types.h | 32 ++++
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/uapi/asm/kvm.h | 16 +-
arch/x86/include/uapi/asm/prctl.h | 26 ++--
arch/x86/kernel/fpu/core.c | 104 ++++++++++++-
arch/x86/kernel/fpu/xstate.c | 147 +++++++++++-------
arch/x86/kernel/fpu/xstate.h | 15 +-
arch/x86/kernel/process.c | 2 +
arch/x86/kvm/cpuid.c | 93 ++++++++---
arch/x86/kvm/vmx/vmcs.h | 5 +
arch/x86/kvm/vmx/vmx.c | 32 +++-
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 102 +++++++++++-
include/uapi/linux/kvm.h | 4 +
tools/arch/x86/include/uapi/asm/kvm.h | 16 +-
tools/include/uapi/linux/kvm.h | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 2 +
.../selftests/kvm/include/x86_64/processor.h | 10 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 32 ++++
.../selftests/kvm/lib/x86_64/processor.c | 67 +++++++-
.../testing/selftests/kvm/x86_64/evmcs_test.c | 2 +-
tools/testing/selftests/kvm/x86_64/smm_test.c | 2 +-
.../testing/selftests/kvm/x86_64/state_test.c | 2 +-
.../kvm/x86_64/vmx_preemption_timer_test.c | 2 +-
27 files changed, 668 insertions(+), 111 deletions(-)
--
2.27.0