Recently while trying to fix some unit tests I found a CVE in SVM nested code.
In 'shutdown_interception' vmexit handler we call kvm_vcpu_reset.
However if running nested and L1 doesn't intercept shutdown, we will still end
up running this function and trigger a bug in it.
The bug is that this function resets the 'vcpu->arch.hflags' without properly
leaving the nested state, which leaves the vCPU in inconsistent state, which
later triggers a kernel panic in SVM code.
The same bug can likely be triggered by sending INIT via local apic to a vCPU
which runs a nested guest.
On VMX we are lucky that the issue can't happen because VMX always intercepts
triple faults, thus triple fault in L2 will always be redirected to L1.
Plus the 'handle_triple_fault' of VMX doesn't reset the vCPU.
INIT IPI can't happen on VMX either because INIT events are masked while in
VMX mode.
First 4 patches in this series address the above issue, and are
already posted on the list with title,
('nSVM: fix L0 crash if L2 has shutdown condtion which L1 doesn't intercept')
I addressed the review feedback and also added a unit test to hit this issue.
In addition to these patches I noticed that KVM doesn't honour SHUTDOWN intercept bit
of L1 on SVM, and I included a fix to do so - its only for correctness
as a normal hypervisor should always intercept SHUTDOWN.
A unit test on the other hand might want to not do so.
I also extendted the triple_fault_test selftest to hit this issue.
Finaly I found another security issue, I found a way to
trigger a kernel non rate limited printk on SVM from the guest, and
last patch in the series fixes that.
A unit test I posted to kvm-unit-tests project hits this issue, so
no selftest was added.
Best regards,
Maxim Levitsky
Maxim Levitsky (9):
KVM: x86: nSVM: leave nested mode on vCPU free
KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while
still in use
KVM: x86: add kvm_leave_nested
KVM: x86: forcibly leave nested mode on vCPU reset
KVM: selftests: move idt_entry to header
kvm: selftests: add svm nested shutdown test
KVM: x86: allow L1 to not intercept triple fault
KVM: selftests: add svm part to triple_fault_test
KVM: x86: remove exit_int_info warning in svm_handle_exit
arch/x86/kvm/svm/nested.c | 12 +++-
arch/x86/kvm/svm/svm.c | 10 +--
arch/x86/kvm/vmx/nested.c | 4 +-
arch/x86/kvm/x86.c | 29 ++++++--
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/x86_64/processor.h | 13 ++++
.../selftests/kvm/lib/x86_64/processor.c | 13 ----
.../kvm/x86_64/svm_nested_shutdown_test.c | 71 +++++++++++++++++++
.../kvm/x86_64/triple_fault_event_test.c | 71 ++++++++++++++-----
10 files changed, 174 insertions(+), 51 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/svm_nested_shutdown_test.c
--
2.34.3
Hello,
This patch series implements IOCTL on the pagemap procfs file to get the
information about the page table entries (PTEs). The following operations
are supported in this ioctl:
- Get the information if the pages are soft-dirty, file mapped, present
or swapped.
- Clear the soft-dirty PTE bit of the pages.
- Get and clear the soft-dirty PTE bit of the pages atomically.
Soft-dirty PTE bit of the memory pages can be read by using the pagemap
procfs file. The soft-dirty PTE bit for the whole memory range of the
process can be cleared by writing to the clear_refs file. There are other
methods to mimic this information entirely in userspace with poor
performance:
- The mprotect syscall and SIGSEGV handler for bookkeeping
- The userfaultfd syscall with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty PTE bit status and clear operation
possible.
- The soft-dirty PTE bit of only a part of memory cannot be cleared.
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows. This syscall is used by games to
keep track of dirty pages to process only the dirty pages.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project[2][3]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project[2].
The IOCTL returns the addresses of the pages which match the specific masks.
The page addresses are returned in struct page_region in a compact form.
The max_pages is needed to support a use case where user only wants to get
a specific number of pages. So there is no need to find all the pages of
interest in the range when max_pages is specified. The IOCTL returns when
the maximum number of the pages are found. The max_pages is optional. If
max_pages is specified, it must be equal or greater than the vec_size.
This restriction is needed to handle worse case when one page_region only
contains info of one page and it cannot be compacted. This is needed to
emulate the Windows getWriteWatch() syscall.
Some non-dirty pages get marked as dirty because of the kernel's
internal activity (such as VMA merging as soft-dirty bit difference isn't
considered while deciding to merge VMAs). The dirty bit of the pages is
stored in the VMA flags and in the per page flags. If any of these two bits
are set, the page is considered to be soft dirty. Suppose you have cleared
the soft dirty bit of half of VMA which will be done by splitting the VMA
and clearing soft dirty bit flag in the half VMA and the pages in it. Now
kernel may decide to merge the VMAs again. So the half VMA becomes dirty
again. This splitting/merging costs performance. The application receives
a lot of pages which aren't dirty in reality but marked as dirty.
Performance is lost again here. Also sometimes user doesn't want the newly
allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
solves both the problems. It is used to not depend on the soft dirty flag
in the VMA flags. So VMA splitting and merging doesn't happen. It only
depends on the soft dirty bit of the individual pages. Thus by using this
flag, there may be a scenerio such that the new memory regions which are
just created, doesn't look dirty when seen with the IOCTL, but look dirty
when seen from procfs. This seems okay as the user of this flag know the
implication of using it.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[3] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (3):
fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
PTEs
selftests: vm: add pagemap ioctl tests
fs/proc/task_mmu.c | 400 +++++++++++-
include/uapi/linux/fs.h | 53 ++
tools/include/uapi/linux/fs.h | 53 ++
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 5 +-
tools/testing/selftests/vm/pagemap_ioctl.c | 681 +++++++++++++++++++++
6 files changed, 1160 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
--
2.30.2