As a part of the effort to start running kvm selftests nested, this patch series contains several fixes to the dirty_log_test, which allows this test to run nested very well.
I also included a mostly nop change to KVM, to reverse the order in which the PML log is read to align more closely to the hardware. It should not affect regular users of the dirty logging but it fixes a unit test specific assumption in the dirty_log_test dirty-ring mode.
Patch 4 fixes a very rare problem, which is hard to reproduce with standard test parameters, but due to some weird timing issue, it actually happened a few times on my machine which prompted me to investigate it.
The issue can be reproduced well by running the test nested (without patch 4 applied) with a very short iteration time and with a few iterations in a loop like this:
while ./dirty_log_test -i 10 -I 1 -M dirty-ring ; do true ; done
Or even better, it's possible to manually patch the test to not wait at all (effectively setting iteration time to 0), then it fails pretty fast.
Best regards, Maxim Levitsky
Maxim Levitsky (4): KVM: VMX: read the PML log in the same order as it was written KVM: selftests: dirty_log_test: Limit s390x workaround to s390x KVM: selftests: dirty_log_test: run the guest until some dirty ring entries were harvested KVM: selftests: dirty_log_test: support multiple write retires
arch/x86/kvm/vmx/vmx.c | 32 +++++--- arch/x86/kvm/vmx/vmx.h | 1 + tools/testing/selftests/kvm/dirty_log_test.c | 79 +++++++++++++++++--- 3 files changed, 91 insertions(+), 21 deletions(-)
X86 spec specifies that the CPU writes to the PML log 'backwards' or in other words, it first writes entry 511, then entry 510 and so on, until it writes entry 0, after which the 'PML log full' VM exit happens.
I also confirmed on the bare metal that the CPU indeed writes the entries in this order.
KVM on the other hand, reads the entries in the opposite order, from the last written entry and towards entry 511 and dumps them in this order to the dirty ring.
Usually this doesn't matter, except for one complex nesting case:
KVM reties the instructions that cause MMU faults. This might cause an emulated PML log entry to be visible to L1's hypervisor before the actual memory write was committed.
This happens when the L0 MMU fault is followed directly by the VM exit to L1, for example due to a pending L1 interrupt or due to the L1's 'PML log full' event.
This problem doesn't have a noticeable real-world impact because this write retry is not much different from the guest writing to the same page multiple times, which is also not reflected in the dirty log. The users of the dirty logging only rely on correct reporting of the clean pages, or in other words they assume that if a page is clean, then no writes were committed to it since the moment it was marked clean.
However KVM has a kvm_dirty_log_test selftest, a test that tests both the clean and the dirty pages vs the memory contents, and can fail if it detects a dirty page which has an old value at the offset 0 which the test writes.
To avoid failure, the test has a workaround for this specific problem:
The test skips checking memory that belongs to the last dirty ring entry, which it has seen, relying on the fact that as long as memory writes are committed in-order, only the last entry can belong to a not yet committed memory write.
However, since L1's KVM is reading the PML log in the opposite direction that L0 wrote it, the last dirty ring entry often will be not the last entry written by the L0.
To fix this, switch the order in which KVM reads the PML log.
Note that this issue is not present on the bare metal, because on the bare metal, an update of the A/D bits of a present entry, PML logging and the actual memory write are all done by the CPU without any hypervisor intervention and pending interrupt evaluation, thus once a PML log and/or vCPU kick happens, all memory writes that are in the PML log are committed to memory.
The only exception to this rule is when the guest hits a not present EPT entry, in which case KVM first reads (backward) the PML log, dumps it to the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs this to the dirty ring, thus making the entry be the last one in the dirty ring.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++----------- arch/x86/kvm/vmx/vmx.h | 1 + 2 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 0f008f5ef6f0..6fb946b58a75 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6211,31 +6211,41 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); u64 *pml_buf; - u16 pml_idx; + u16 pml_idx, pml_last_written_entry; + int i;
pml_idx = vmcs_read16(GUEST_PML_INDEX);
/* Do nothing if PML buffer is empty */ - if (pml_idx == (PML_ENTITY_NUM - 1)) + if (pml_idx == PML_LAST_ENTRY) return; + /* + * PML index always points to the next available PML buffer entity + * unless PML log has just overflowed, in which case PML index will be + * 0xFFFF. + */ + pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;
- /* PML index always points to next available PML buffer entity */ - if (pml_idx >= PML_ENTITY_NUM) - pml_idx = 0; - else - pml_idx++; - + /* + * PML log is written backwards: the CPU first writes the entity 511 + * then the entity 510, and so on, until it writes the entity 0 at which + * point the PML log full VM exit happens and the logging stops. + * + * Read the entries in the same order they were written, to ensure that + * the dirty ring is filled in the same order the CPU wrote them. + */ pml_buf = page_address(vmx->pml_pg); - for (; pml_idx < PML_ENTITY_NUM; pml_idx++) { + + for (i = PML_LAST_ENTRY; i >= pml_last_written_entry; i--) { u64 gpa;
- gpa = pml_buf[pml_idx]; + gpa = pml_buf[i]; WARN_ON(gpa & (PAGE_SIZE - 1)); kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT); }
/* reset PML index */ - vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1); + vmcs_write16(GUEST_PML_INDEX, PML_LAST_ENTRY); }
static void vmx_dump_sel(char *name, uint32_t sel) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 43f573f6ca46..d14401c8e499 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -331,6 +331,7 @@ struct vcpu_vmx {
/* Support for PML */ #define PML_ENTITY_NUM 512 +#define PML_LAST_ENTRY (PML_ENTITY_NUM - 1) struct page *pml_pg;
/* apic deadline value in host tsc */
On Wed, Dec 11, 2024, Maxim Levitsky wrote:
X86 spec specifies that the CPU writes to the PML log 'backwards'
SDM, because this is Intel specific.
or in other words, it first writes entry 511, then entry 510 and so on, until it writes entry 0, after which the 'PML log full' VM exit happens.
I also confirmed on the bare metal that the CPU indeed writes the entries in this order.
KVM on the other hand, reads the entries in the opposite order, from the last written entry and towards entry 511 and dumps them in this order to the dirty ring.
Usually this doesn't matter, except for one complex nesting case:
KVM reties the instructions that cause MMU faults. This might cause an emulated PML log entry to be visible to L1's hypervisor before the actual memory write was committed.
This happens when the L0 MMU fault is followed directly by the VM exit to L1, for example due to a pending L1 interrupt or due to the L1's 'PML log full' event.
Hmm, this an L0 bug. Exiting to L1 to deliver a pending IRQ in the middle of an instruction is a blatant architectural violation. As discussed in the RSM => SHUTDOWN thread[*], fixing this would require adding a flag to note that the vCPU needs to enter the guest before generating an exit to L1.
Oof. It's probably worse than that. For this case, KVM would need to ensure the original instruction *completed*. That would get really, really ugly. And for something like VSCATTER, where each write can be completed independently, trying to do the right thing for PML would be absurdly complex.
I'm not opposed to fudging around processing the PML log in the "correct" order, because irrespective of this bug, populating the dirty ring using order in which accesses occurred is probably a good idea.
But, I can't help but wonder why KVM bothers emulating PML. I can appreciate that avoiding exits to L1 would be beneficial, but what use case actually cares about dirty logging performance in L1?
[*] https://lore.kernel.org/all/ZcY_GbqcFXH2pR5E@google.com
This problem doesn't have a noticeable real-world impact because this write retry is not much different from the guest writing to the same page multiple times, which is also not reflected in the dirty log. The users of the dirty logging only rely on correct reporting of the clean pages, or in other words they assume that if a page is clean, then no writes were committed to it since the moment it was marked clean.
However KVM has a kvm_dirty_log_test selftest, a test that tests both the clean and the dirty pages vs the memory contents, and can fail if it detects a dirty page which has an old value at the offset 0 which the test writes.
To avoid failure, the test has a workaround for this specific problem:
The test skips checking memory that belongs to the last dirty ring entry, which it has seen, relying on the fact that as long as memory writes are committed in-order, only the last entry can belong to a not yet committed memory write.
However, since L1's KVM is reading the PML log in the opposite direction that L0 wrote it, the last dirty ring entry often will be not the last entry written by the L0.
To fix this, switch the order in which KVM reads the PML log.
Note that this issue is not present on the bare metal, because on the bare metal, an update of the A/D bits of a present entry, PML logging and the actual memory write are all done by the CPU without any hypervisor intervention and pending interrupt evaluation, thus once a PML log and/or vCPU kick happens, all memory writes that are in the PML log are committed to memory.
The only exception to this rule is when the guest hits a not present EPT entry, in which case KVM first reads (backward) the PML log, dumps it to the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs this to the dirty ring, thus making the entry be the last one in the dirty ring.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++----------- arch/x86/kvm/vmx/vmx.h | 1 + 2 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 0f008f5ef6f0..6fb946b58a75 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6211,31 +6211,41 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); u64 *pml_buf;
- u16 pml_idx;
- u16 pml_idx, pml_last_written_entry;
- int i;
pml_idx = vmcs_read16(GUEST_PML_INDEX); /* Do nothing if PML buffer is empty */
- if (pml_idx == (PML_ENTITY_NUM - 1))
- if (pml_idx == PML_LAST_ENTRY)
Heh, this is mildly confusing, in that the first entry filled is actually called the "last" entry by KVM. And then below, pml_list_written_entry could point at the first entry.
The best idea I can come up with is PML_HEAD_INDEX and then pml_last_written_entry becomes pml_tail_index. It's not a circular buffer, but I think/hope head/tail terminology would be intuitive for most readers.
E.g. the for-loop becomes:
for (i = PML_HEAD_INDEX; i >= pml_tail_index; i--) u64 gpa;
gpa = pml_buf[i]; WARN_ON(gpa & (PAGE_SIZE - 1)); kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT); }
return;
- /*
* PML index always points to the next available PML buffer entity
* unless PML log has just overflowed, in which case PML index will be
If you don't have a strong preference, I vote to do s/entity/entry and then rename PML_ENTITY_NUM => NR_PML_ENTRIES (or maybe PML_LOG_NR_ENTRIES?). I find the existing "entity" terminology weird and unhelpful, and arguably wrong.
entity - a thing with distinct and independent existence.
The things being consumed are entries in a buffer.
* 0xFFFF.
*/
- pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;
- /* PML index always points to next available PML buffer entity */
- if (pml_idx >= PML_ENTITY_NUM)
pml_idx = 0;
- else
pml_idx++;
- /*
* PML log is written backwards: the CPU first writes the entity 511
* then the entity 510, and so on, until it writes the entity 0 at which
* point the PML log full VM exit happens and the logging stops.
This is technically wrong. The PML Full exit only occurs on the next write. E.g. KVM could observe GUEST_PML_INDEX == -1 without ever seeing a PML Full exit.
If the PML index is not in the range 0–511, there is a page-modification log-full event and a VM exit occurs. In this case, the accessed or dirty flag is not set, and the guest-physical access that triggered the event does not occur.
On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
On Wed, Dec 11, 2024, Maxim Levitsky wrote:
X86 spec specifies that the CPU writes to the PML log 'backwards'
SDM, because this is Intel specific.
True.
or in other words, it first writes entry 511, then entry 510 and so on, until it writes entry 0, after which the 'PML log full' VM exit happens.
I also confirmed on the bare metal that the CPU indeed writes the entries in this order.
KVM on the other hand, reads the entries in the opposite order, from the last written entry and towards entry 511 and dumps them in this order to the dirty ring.
Usually this doesn't matter, except for one complex nesting case:
KVM reties the instructions that cause MMU faults. This might cause an emulated PML log entry to be visible to L1's hypervisor before the actual memory write was committed.
This happens when the L0 MMU fault is followed directly by the VM exit to L1, for example due to a pending L1 interrupt or due to the L1's 'PML log full' event.
Hmm, this an L0 bug. Exiting to L1 to deliver a pending IRQ in the middle of an instruction is a blatant architectural violation. As discussed in the RSM => SHUTDOWN thread[*], fixing this would require adding a flag to note that the vCPU needs to enter the guest before generating an exit to L1.
Agree, note that this is to some extent visible not nested as well, for example this workaround that the test has is for non nested case.
One can argue that dirty ring is not an x86 feature, but I am sure that there are other more complicated cases in which retried write can be observed by this or other vCPUs in the violation of x86 spec.
Oof. It's probably worse than that. For this case, KVM would need to ensure the original instruction *completed*. That would get really, really ugly. And for something like VSCATTER, where each write can be completed independently, trying to do the right thing for PML would be absurdly complex.
I also agree. Instruction retry is much simpler and safer that emulating it, KVM really can't stop doing this.
I'm not opposed to fudging around processing the PML log in the "correct" order, because irrespective of this bug, populating the dirty ring using order in which accesses occurred is probably a good idea.
But, I can't help but wonder why KVM bothers emulating PML. I can appreciate that avoiding exits to L1 would be beneficial, but what use case actually cares about dirty logging performance in L1?
It does help with performance by a lot and the implementation is emulated and simple.
For example this test without pml collects about 500 pages on each iteration with default parameters, and about 2400 pages per iteration with pml (after the caches warm up).
[*] https://lore.kernel.org/all/ZcY_GbqcFXH2pR5E@google.com
This problem doesn't have a noticeable real-world impact because this write retry is not much different from the guest writing to the same page multiple times, which is also not reflected in the dirty log. The users of the dirty logging only rely on correct reporting of the clean pages, or in other words they assume that if a page is clean, then no writes were committed to it since the moment it was marked clean.
However KVM has a kvm_dirty_log_test selftest, a test that tests both the clean and the dirty pages vs the memory contents, and can fail if it detects a dirty page which has an old value at the offset 0 which the test writes.
To avoid failure, the test has a workaround for this specific problem:
The test skips checking memory that belongs to the last dirty ring entry, which it has seen, relying on the fact that as long as memory writes are committed in-order, only the last entry can belong to a not yet committed memory write.
However, since L1's KVM is reading the PML log in the opposite direction that L0 wrote it, the last dirty ring entry often will be not the last entry written by the L0.
To fix this, switch the order in which KVM reads the PML log.
Note that this issue is not present on the bare metal, because on the bare metal, an update of the A/D bits of a present entry, PML logging and the actual memory write are all done by the CPU without any hypervisor intervention and pending interrupt evaluation, thus once a PML log and/or vCPU kick happens, all memory writes that are in the PML log are committed to memory.
The only exception to this rule is when the guest hits a not present EPT entry, in which case KVM first reads (backward) the PML log, dumps it to the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs this to the dirty ring, thus making the entry be the last one in the dirty ring.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++----------- arch/x86/kvm/vmx/vmx.h | 1 + 2 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 0f008f5ef6f0..6fb946b58a75 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6211,31 +6211,41 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); u64 *pml_buf;
- u16 pml_idx;
- u16 pml_idx, pml_last_written_entry;
- int i;
pml_idx = vmcs_read16(GUEST_PML_INDEX); /* Do nothing if PML buffer is empty */
- if (pml_idx == (PML_ENTITY_NUM - 1))
- if (pml_idx == PML_LAST_ENTRY)
Heh, this is mildly confusing, in that the first entry filled is actually called the "last" entry by KVM. And then below, pml_list_written_entry could point at the first entry.
The best idea I can come up with is PML_HEAD_INDEX and then pml_last_written_entry becomes pml_tail_index. It's not a circular buffer, but I think/hope head/tail terminology would be intuitive for most readers.
I agree here. Your proposal does seem better to me, so I'll adopt it in v2.
E.g. the for-loop becomes:
for (i = PML_HEAD_INDEX; i >= pml_tail_index; i--) u64 gpa;
gpa = pml_buf[i]; WARN_ON(gpa & (PAGE_SIZE - 1)); kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
}
return;
- /*
* PML index always points to the next available PML buffer entity
* unless PML log has just overflowed, in which case PML index will be
If you don't have a strong preference, I vote to do s/entity/entry and then rename PML_ENTITY_NUM => NR_PML_ENTRIES (or maybe PML_LOG_NR_ENTRIES?). I find the existing "entity" terminology weird and unhelpful, and arguably wrong.
I don't mind renaming this.
entity - a thing with distinct and independent existence.
The things being consumed are entries in a buffer.
* 0xFFFF.
*/
- pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;
- /* PML index always points to next available PML buffer entity */
- if (pml_idx >= PML_ENTITY_NUM)
pml_idx = 0;
- else
pml_idx++;
- /*
* PML log is written backwards: the CPU first writes the entity 511
* then the entity 510, and so on, until it writes the entity 0 at which
* point the PML log full VM exit happens and the logging stops.
This is technically wrong. The PML Full exit only occurs on the next write. E.g. KVM could observe GUEST_PML_INDEX == -1 without ever seeing a PML Full exit.
I skipped over this part of the PRM when I was reading upon how PML works. I will drop this sentence in the next version.
If the PML index is not in the range 0–511, there is a page-modification log-full event and a VM exit occurs. In this case, the accessed or dirty flag is not set, and the guest-physical access that triggered the event does not occur.
Do you have any comments for the rest of the patch series? If not then I'll send v2 of the patch series.
Best regards, Maxim Levitsky
On Thu, Dec 12, 2024, Maxim Levitsky wrote:
On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
But, I can't help but wonder why KVM bothers emulating PML. I can appreciate that avoiding exits to L1 would be beneficial, but what use case actually cares about dirty logging performance in L1?
It does help with performance by a lot and the implementation is emulated and simple.
Yeah, it's not a lot of complexity, but it's architecturally flawed. And I get that it helps with performance, I'm just stumped as to the use case for dirty logging in a nested VM in the first place.
Do you have any comments for the rest of the patch series? If not then I'll send v2 of the patch series.
*sigh*
I do. Through no fault of your own. I was trying to figure out a way to ensure the vCPU made meaningful progress, versus just guaranteeing at least one write, and stumbled onto a plethora of flaws and unnecessary complexity in the test.
Can you post this patch as a standalone v2? I'd like to do a more agressive cleanup of the selftest, but I don't want to hold this up, and there's no hard dependency.
As for the issues I encountered with the selftest:
1. Tracing how many pages have been written for the current iteration with a guest side counter doesn't work without more fixes, because the test doesn't collect all dirty entries for the current iterations. For the dirty ring, this results in a vCPU *starting* an iteration with a full dirty ring, and the test hangs because the guest can't make forward progress until log_mode_collect_dirty_pages() is called.
2. The test presumably doesn't collect all dirty entries because of the weird and unnecessary kick in dirty_ring_collect_dirty_pages(), and all the synchronization that comes with it. The kick is "justified" with a comment saying "This makes sure that hardware PML cache flushed", but there's no reason to do *if* pages that the test collects dirty pages *after* stopping the vCPU. Which is easy to do while also collecting while the vCPU is running, if the kick+synchronization is eliminated (i.e. it's a self-inflicted wound of sorts).
3. dirty_ring_after_vcpu_run() doesn't honor vcpu_sync_stop_requested, and so every other iteration runs until the ring is full. Testing the "soft full" logic is interesting, but not _that_ interesting, and filling the dirty ring basically ignores the "interval". Fixing this reduces the runtime by a significant amount, especially on nested, at the cost of providing less coverage for the dirty ring with default interval in a nested VM (but if someone cares about testing the dirty ring soft full in a nested VM, they can darn well bump the interval).
4. Fixing the test to collect all dirty entries for the current iteration exposes another flaw. The bitmaps (not dirty ring) start with all bits set. And so the first iteration can see "dirty" pages that have never been written, but only when applying your fix to limit the hack to s390.
5. "iteration" is synched to the guest *after* the vCPU is restarted, i.e. the guest could see a stale iteration if the main thread is delayed.
6. host_bmap_track and all of the weird exemptions for writes from previous iterations goes away if all entries are collected for the current iteration (though a second bitmap is needed to handle the second collection; KVM's "get" of the bitmap clobbers the previous value).
I have everything more or less coded up, but I need to split it into patches, write changelogs, and interleave it with your fixes. Hopefully I'll get to that tomorrow.
On Thu, 2024-12-12 at 22:19 -0800, Sean Christopherson wrote:
On Thu, Dec 12, 2024, Maxim Levitsky wrote:
On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
But, I can't help but wonder why KVM bothers emulating PML. I can appreciate that avoiding exits to L1 would be beneficial, but what use case actually cares about dirty logging performance in L1?
It does help with performance by a lot and the implementation is emulated and simple.
Yeah, it's not a lot of complexity, but it's architecturally flawed. And I get that it helps with performance, I'm just stumped as to the use case for dirty logging in a nested VM in the first place.
Do you have any comments for the rest of the patch series? If not then I'll send v2 of the patch series.
*sigh*
I do. Through no fault of your own. I was trying to figure out a way to ensure the vCPU made meaningful progress, versus just guaranteeing at least one write, and stumbled onto a plethora of flaws and unnecessary complexity in the test.
Can you post this patch as a standalone v2? I'd like to do a more agressive cleanup of the selftest, but I don't want to hold this up, and there's no hard dependency.
As for the issues I encountered with the selftest:
Tracing how many pages have been written for the current iteration with a guest side counter doesn't work without more fixes, because the test doesn't collect all dirty entries for the current iterations. For the dirty ring, this results in a vCPU *starting* an iteration with a full dirty ring, and the test hangs because the guest can't make forward progress until log_mode_collect_dirty_pages() is called.
The test presumably doesn't collect all dirty entries because of the weird and unnecessary kick in dirty_ring_collect_dirty_pages(), and all the synchronization that comes with it. The kick is "justified" with a comment saying "This makes sure that hardware PML cache flushed", but there's no reason to do *if* pages that the test collects dirty pages *after* stopping the vCPU. Which is easy to do while also collecting while the vCPU is running, if the kick+synchronization is eliminated (i.e. it's a self-inflicted wound of sorts).
dirty_ring_after_vcpu_run() doesn't honor vcpu_sync_stop_requested, and so every other iteration runs until the ring is full. Testing the "soft full" logic is interesting, but not _that_ interesting, and filling the dirty ring basically ignores the "interval". Fixing this reduces the runtime by a significant amount, especially on nested, at the cost of providing less coverage for the dirty ring with default interval in a nested VM (but if someone cares about testing the dirty ring soft full in a nested VM, they can darn well bump the interval).
Fixing the test to collect all dirty entries for the current iteration exposes another flaw. The bitmaps (not dirty ring) start with all bits set. And so the first iteration can see "dirty" pages that have never been written, but only when applying your fix to limit the hack to s390.
"iteration" is synched to the guest *after* the vCPU is restarted, i.e. the guest could see a stale iteration if the main thread is delayed.
host_bmap_track and all of the weird exemptions for writes from previous iterations goes away if all entries are collected for the current iteration (though a second bitmap is needed to handle the second collection; KVM's "get" of the bitmap clobbers the previous value).
I have everything more or less coded up, but I need to split it into patches, write changelogs, and interleave it with your fixes. Hopefully I'll get to that tomorrow.
Hi!
I will take a look at your patch series once you post it. I also think that the logic in the test is somewhat broken, but then this also serves as a way to cause as much havoc as possible.
The fact that not all dirty pages are collected is because the ring harvest happens at the same time the guest continues dirtying the pages, adding more entries to the ring, simulating what would happen during real-life migration.
kicking the guest just before ring harvest is also IMHO a good thing as it also simulates the IRQ load that would happen.
we can avoid kicking the guest if it is already stopped due to dirty ring, in fact, the fact that we still kick it, delays the kick to the point where we resume the guest and wait for it to stop again before the do the verify step, which makes it often exit not due to log full event.
I did this but this makes the test be way less random, and the whole point of this test is to cause as much havoc as possible.
I do think that we don't need to stop the guest during verify for the dirty-ring case, this is probably a code that only dirty bitmap part of the test needs.
I added Peter Xu to CC to hear his option about this as well.
Best regards, Maxim Levitsky
On Fri, Dec 13, 2024, Maxim Levitsky wrote:
On Thu, 2024-12-12 at 22:19 -0800, Sean Christopherson wrote:
On Thu, Dec 12, 2024, Maxim Levitsky wrote:
On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
But, I can't help but wonder why KVM bothers emulating PML. I can appreciate that avoiding exits to L1 would be beneficial, but what use case actually cares about dirty logging performance in L1?
It does help with performance by a lot and the implementation is emulated and simple.
Yeah, it's not a lot of complexity, but it's architecturally flawed. And I get that it helps with performance, I'm just stumped as to the use case for dirty logging in a nested VM in the first place.
Do you have any comments for the rest of the patch series? If not then I'll send v2 of the patch series.
*sigh*
I do. Through no fault of your own. I was trying to figure out a way to ensure the vCPU made meaningful progress, versus just guaranteeing at least one write, and stumbled onto a plethora of flaws and unnecessary complexity in the test.
Can you post this patch as a standalone v2? I'd like to do a more agressive cleanup of the selftest, but I don't want to hold this up, and there's no hard dependency.
As for the issues I encountered with the selftest:
Tracing how many pages have been written for the current iteration with a guest side counter doesn't work without more fixes, because the test doesn't collect all dirty entries for the current iterations. For the dirty ring, this results in a vCPU *starting* an iteration with a full dirty ring, and the test hangs because the guest can't make forward progress until log_mode_collect_dirty_pages() is called.
The test presumably doesn't collect all dirty entries because of the weird and unnecessary kick in dirty_ring_collect_dirty_pages(), and all the synchronization that comes with it. The kick is "justified" with a comment saying "This makes sure that hardware PML cache flushed", but there's no reason to do *if* pages that the test collects dirty pages *after* stopping the vCPU. Which is easy to do while also collecting while the vCPU is running, if the kick+synchronization is eliminated (i.e. it's a self-inflicted wound of sorts).
dirty_ring_after_vcpu_run() doesn't honor vcpu_sync_stop_requested, and so every other iteration runs until the ring is full. Testing the "soft full" logic is interesting, but not _that_ interesting, and filling the dirty ring basically ignores the "interval". Fixing this reduces the runtime by a significant amount, especially on nested, at the cost of providing less coverage for the dirty ring with default interval in a nested VM (but if someone cares about testing the dirty ring soft full in a nested VM, they can darn well bump the interval).
Fixing the test to collect all dirty entries for the current iteration exposes another flaw. The bitmaps (not dirty ring) start with all bits set. And so the first iteration can see "dirty" pages that have never been written, but only when applying your fix to limit the hack to s390.
"iteration" is synched to the guest *after* the vCPU is restarted, i.e. the guest could see a stale iteration if the main thread is delayed.
host_bmap_track and all of the weird exemptions for writes from previous iterations goes away if all entries are collected for the current iteration (though a second bitmap is needed to handle the second collection; KVM's "get" of the bitmap clobbers the previous value).
I have everything more or less coded up, but I need to split it into patches, write changelogs, and interleave it with your fixes. Hopefully I'll get to that tomorrow.
Hi!
I will take a look at your patch series once you post it. I also think that the logic in the test is somewhat broken, but then this also serves as a way to cause as much havoc as possible.
The fact that not all dirty pages are collected is because the ring harvest happens at the same time the guest continues dirtying the pages, adding more entries to the ring, simulating what would happen during real-life migration.
But as above, that behavior is trivially easy to mimic even when collecting all entries simply by playing nice with multiple collections per iteration.
kicking the guest just before ring harvest is also IMHO a good thing as it also simulates the IRQ load that would happen.
I am not at all convinced that's interesting. And *if* it's really truly all that interesting, then the kick should be done for all flavors.
Unless the host is tickless, the vCPU will get interrupt from time to time, at least for any decently large interval. The kick from the test itself adds an absurd amount of complexity for no meaningful test coverage.
we can avoid kicking the guest if it is already stopped due to dirty ring, in fact, the fact that we still kick it, delays the kick to the point where we resume the guest and wait for it to stop again before the do the verify step, which makes it often exit not due to log full event.
I did this but this makes the test be way less random, and the whole point of this test is to cause as much havoc as possible.
I agree randomness is a good thing for testing, but this is more noise than random/controlled chaos.
E.g. we can do _far_ better for large interval numbers. As is, collecting _once_ per iteration means the vCPU is all but guarantee to stall on a dirty ring for any decently large interval.
And conversely, not honor the "stop" request means every other iteration is all but guaranteed to fill the dirty ring, even for small intervals.
I do think that we don't need to stop the guest during verify for the dirty-ring case, this is probably a code that only dirty bitmap part of the test needs.
Not stopping the vCPU would reduce test coverage (which is one of my complaints with not fully harvesting the dirty entries). If KVM misses a dirty log event on iteration X, and the vCPU also writes the same page in iteration X+1, then the test will get a false pass because iteration X+1 will see the page as dirty and think all is well.
s390 specific workaround causes the dirty-log mode of the test to dirty all the guest memory on the first iteration which is very slow when run nested.
Limit this workaround to s390x.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- tools/testing/selftests/kvm/dirty_log_test.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index aacf80f57439..f60e2aceeae0 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -98,6 +98,7 @@ static void guest_code(void) uint64_t addr; int i;
+#ifdef __s390x__ /* * On s390x, all pages of a 1M segment are initially marked as dirty * when a page of the segment is written to for the very first time. @@ -108,6 +109,7 @@ static void guest_code(void) addr = guest_test_virt_mem + i * guest_page_size; vcpu_arch_put_guest(*(uint64_t *)addr, READ_ONCE(iteration)); } +#endif
while (true) { for (i = 0; i < TEST_PAGES_PER_LOOP; i++) {
When the dirty_log_test is run nested, due to very slow nature of the environment, it is possible to reach a situation in which no entries were harvested from the dirty ring and that breaks the test logic.
Detect this case and just let the guest run a bit more until test obtains some entries from the dirty ring.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- tools/testing/selftests/kvm/dirty_log_test.c | 25 +++++++++++++------- 1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index f60e2aceeae0..a9428076a681 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -227,19 +227,21 @@ static void clear_log_create_vm_done(struct kvm_vm *vm) vm_enable_cap(vm, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, manual_caps); }
-static void dirty_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, +static bool dirty_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, void *bitmap, uint32_t num_pages, uint32_t *unused) { kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap); + return true; }
-static void clear_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, +static bool clear_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, void *bitmap, uint32_t num_pages, uint32_t *unused) { kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap); kvm_vm_clear_dirty_log(vcpu->vm, slot, bitmap, 0, num_pages); + return true; }
/* Should only be called after a GUEST_SYNC */ @@ -350,7 +352,7 @@ static void dirty_ring_continue_vcpu(void) sem_post(&sem_vcpu_cont); }
-static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, +static bool dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, void *bitmap, uint32_t num_pages, uint32_t *ring_buf_idx) { @@ -373,6 +375,10 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, slot, bitmap, num_pages, ring_buf_idx);
+ /* Retry if no entries were collected */ + if (count == 0) + return false; + cleared = kvm_vm_reset_dirty_ring(vcpu->vm);
/* @@ -389,6 +395,7 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, }
pr_info("Iteration %ld collected %u pages\n", iteration, count); + return true; }
static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err) @@ -424,7 +431,7 @@ struct log_mode { /* Hook when the vm creation is done (before vcpu creation) */ void (*create_vm_done)(struct kvm_vm *vm); /* Hook to collect the dirty pages into the bitmap provided */ - void (*collect_dirty_pages) (struct kvm_vcpu *vcpu, int slot, + bool (*collect_dirty_pages)(struct kvm_vcpu *vcpu, int slot, void *bitmap, uint32_t num_pages, uint32_t *ring_buf_idx); /* Hook to call when after each vcpu run */ @@ -488,7 +495,7 @@ static void log_mode_create_vm_done(struct kvm_vm *vm) mode->create_vm_done(vm); }
-static void log_mode_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, +static bool log_mode_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, void *bitmap, uint32_t num_pages, uint32_t *ring_buf_idx) { @@ -496,7 +503,7 @@ static void log_mode_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
TEST_ASSERT(mode->collect_dirty_pages != NULL, "collect_dirty_pages() is required for any log mode!"); - mode->collect_dirty_pages(vcpu, slot, bitmap, num_pages, ring_buf_idx); + return mode->collect_dirty_pages(vcpu, slot, bitmap, num_pages, ring_buf_idx); }
static void log_mode_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err) @@ -783,9 +790,11 @@ static void run_test(enum vm_guest_mode mode, void *arg) while (iteration < p->iterations) { /* Give the vcpu thread some time to dirty some pages */ usleep(p->interval * 1000); - log_mode_collect_dirty_pages(vcpu, TEST_MEM_SLOT_INDEX, + + if (!log_mode_collect_dirty_pages(vcpu, TEST_MEM_SLOT_INDEX, bmap, host_num_pages, - &ring_buf_idx); + &ring_buf_idx)) + continue;
/* * See vcpu_sync_stop_requested definition for details on why
If dirty_log_test is run nested, it is possible for entries in the emulated PML log to appear before the actual memory write is committed to the RAM, due to the way KVM retries memory writes as a response to a MMU fault.
In addition to that in some very rare cases retry can happen more than once, which will lead to the test failure because once the write is finally committed it may have a very outdated iteration value.
Detect and avoid this case.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- tools/testing/selftests/kvm/dirty_log_test.c | 52 +++++++++++++++++++- 1 file changed, 50 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index a9428076a681..f07126b0205d 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -154,6 +154,7 @@ static atomic_t vcpu_sync_stop_requested; * sem_vcpu_stop and before vcpu continues to run. */ static bool dirty_ring_vcpu_ring_full; + /* * This is only used for verifying the dirty pages. Dirty ring has a very * tricky case when the ring just got full, kvm will do userspace exit due to @@ -168,7 +169,51 @@ static bool dirty_ring_vcpu_ring_full; * dirty gfn we've collected, so that if a mismatch of data found later in the * verifying process, we let it pass. */ -static uint64_t dirty_ring_last_page; +static uint64_t dirty_ring_last_page = -1ULL; + +/* + * In addition to the above, it is possible (especially if this + * test is run nested) for the above scenario to repeat multiple times: + * + * The following can happen: + * + * - L1 vCPU: Memory write is logged to PML but not committed. + * + * - L1 test thread: Ignores the write because its last dirty ring entry + * Resets the dirty ring which: + * - Resets the A/D bits in EPT + * - Issues tlb flush (invept), which is intercepted by L0 + * + * - L0: frees the whole nested ept mmu root as the response to invept, + * and thus ensures that when memory write is retried, it will fault again + * + * - L1 vCPU: Same memory write is logged to the PML but not committed again. + * + * - L1 test thread: Ignores the write because its last dirty ring entry (again) + * Resets the dirty ring which: + * - Resets the A/D bits in EPT (again) + * - Issues tlb flush (again) which is intercepted by L0 + * + * ... + * + * N times + * + * - L1 vCPU: Memory write is logged in the PML and then committed. + * Lots of other memory writes are logged and committed. + * ... + * + * - L1 test thread: Sees the memory write along with other memory writes + * in the dirty ring, and since the write is usually not + * the last entry in the dirty-ring and has a very outdated + * iteration, the test fails. + * + * + * Note that this is only possible when the write was the last log entry + * write during iteration N-1, thus remember last iteration last log entry + * and also don't fail when it is reported in the next iteration, together with + * an outdated iteration count. + */ +static uint64_t dirty_ring_prev_iteration_last_page;
enum log_mode_t { /* Only use KVM_GET_DIRTY_LOG for logging */ @@ -320,6 +365,8 @@ static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns, struct kvm_dirty_gfn *cur; uint32_t count = 0;
+ dirty_ring_prev_iteration_last_page = dirty_ring_last_page; + while (true) { cur = &dirty_gfns[*fetch_index % test_dirty_ring_count]; if (!dirty_gfn_is_dirtied(cur)) @@ -622,7 +669,8 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap) */ min_iter = iteration - 1; continue; - } else if (page == dirty_ring_last_page) { + } else if (page == dirty_ring_last_page || + page == dirty_ring_prev_iteration_last_page) { /* * Please refer to comments in * dirty_ring_last_page.
linux-kselftest-mirror@lists.linaro.org