When the TDP MMU is write-protection GFNs for page table protection (as opposed to for dirty logging, or due to the HVA not being writable), it checks if the SPTE is already write-protected and if so skips modifying the SPTE and the TLB flush.
This behavior is incorrect because the SPTE may be write-protected for dirty logging. This implies that the SPTE could be locklessly be made writable on the next write access, and that vCPUs could still be running with writable SPTEs cached in their TLB.
Fix this by unconditionally setting the SPTE and only skipping the TLB flush if the SPTE was already marked !MMU-writable or !Host-writable, which guarantees the SPTE cannot be locklessly be made writable and no vCPUs are running the writable SPTEs cached in their TLBs.
Technically it would be safe to skip setting the SPTE as well since:
(a) If MMU-writable is set then Host-writable must be cleared and the only way to set Host-writable is to fault the SPTE back in entirely (at which point any unsynced shadow pages reachable by the new SPTE will be synced and MMU-writable can be safetly be set again).
and
(b) MMU-writable is never consulted on its own.
And in fact this is what the shadow MMU does when write-protecting guest page tables. However setting the SPTE unconditionally is much easier to reason about and does not require a huge comment explaining why it is safe.
Fixes: 46044f72c382 ("kvm: x86/mmu: Support write protection for nesting in tdp MMU") Cc: stable@vger.kernel.org Signed-off-by: David Matlack dmatlack@google.com --- arch/x86/kvm/mmu/tdp_mmu.c | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7b1bc816b7c3..462c6de9f944 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1423,14 +1423,16 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm, /* * Removes write access on the last level SPTE mapping this GFN and unsets the * MMU-writable bit to ensure future writes continue to be intercepted. - * Returns true if an SPTE was set and a TLB flush is needed. + * + * Returns true if a TLB flush is needed to ensure no CPU has a writable + * version of the SPTE in its TLB. */ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root, gfn_t gfn, int min_level) { struct tdp_iter iter; u64 new_spte; - bool spte_set = false; + bool flush = false;
BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
@@ -1442,19 +1444,30 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root, !is_last_spte(iter.old_spte, iter.level)) continue;
- if (!is_writable_pte(iter.old_spte)) - break; - new_spte = iter.old_spte & ~(PT_WRITABLE_MASK | shadow_mmu_writable_mask);
tdp_mmu_set_spte(kvm, &iter, new_spte); - spte_set = true; + + /* + * The TLB flush can be skipped if the old SPTE cannot be + * locklessly be made writable, which implies it is already + * write-protected due to being !MMU-writable or !Host-writable. + * This guarantees no CPU currently has a writable version of + * this SPTE in its TLB. + * + * Otherwise the old SPTE was either not write-protected or was + * write-protected but for dirty logging (which does not flush + * TLBs before dropping the MMU lock), so a TLB flush is + * required. + */ + if (spte_can_locklessly_be_made_writable(iter.old_spte)) + flush = true; }
rcu_read_unlock();
- return spte_set; + return flush; }
/*
base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
On Wed, Jan 12, 2022, David Matlack wrote:
When the TDP MMU is write-protection GFNs for page table protection (as opposed to for dirty logging, or due to the HVA not being writable), it checks if the SPTE is already write-protected and if so skips modifying the SPTE and the TLB flush.
This behavior is incorrect because the SPTE may be write-protected for dirty logging. This implies that the SPTE could be locklessly be made writable on the next write access, and that vCPUs could still be running with writable SPTEs cached in their TLB.
Fix this by unconditionally setting the SPTE and only skipping the TLB flush if the SPTE was already marked !MMU-writable or !Host-writable, which guarantees the SPTE cannot be locklessly be made writable and no vCPUs are running the writable SPTEs cached in their TLBs.
Technically it would be safe to skip setting the SPTE as well since:
(a) If MMU-writable is set then Host-writable must be cleared and the only way to set Host-writable is to fault the SPTE back in entirely (at which point any unsynced shadow pages reachable by the new SPTE will be synced and MMU-writable can be safetly be set again).
and
(b) MMU-writable is never consulted on its own.
And in fact this is what the shadow MMU does when write-protecting guest page tables. However setting the SPTE unconditionally is much easier to reason about and does not require a huge comment explaining why it is safe.
I disagree. I looked at the code+comment before reading the full changelog and typed up a response saying the code should be:
if (!is_writable_pte(iter.old_spte) && !spte_can_locklessly_be_made_writable(spte)) break;
Then I went read the changelog and here we are :-)
I find that much more easier to grok, e.g. in plain English: "if the SPTE isn't writable and can't be made writable, there's nothing to do".
Versus "unconditionally clear the writable bits because ???, but only flush if the write was actually necessary", with a slightly opinionated translation :-)
And with that, you don't need to do s/spte_set/flush. Though I would be in favor of a separate patch to do s/spte_set/write_protected here and in the caller, to match kvm_mmu_slot_gfn_write_protect().
On Wed, Jan 12, 2022 at 3:14 PM Sean Christopherson seanjc@google.com wrote:
On Wed, Jan 12, 2022, David Matlack wrote:
When the TDP MMU is write-protection GFNs for page table protection (as opposed to for dirty logging, or due to the HVA not being writable), it checks if the SPTE is already write-protected and if so skips modifying the SPTE and the TLB flush.
This behavior is incorrect because the SPTE may be write-protected for dirty logging. This implies that the SPTE could be locklessly be made writable on the next write access, and that vCPUs could still be running with writable SPTEs cached in their TLB.
Fix this by unconditionally setting the SPTE and only skipping the TLB flush if the SPTE was already marked !MMU-writable or !Host-writable, which guarantees the SPTE cannot be locklessly be made writable and no vCPUs are running the writable SPTEs cached in their TLBs.
Technically it would be safe to skip setting the SPTE as well since:
(a) If MMU-writable is set then Host-writable must be cleared and the only way to set Host-writable is to fault the SPTE back in entirely (at which point any unsynced shadow pages reachable by the new SPTE will be synced and MMU-writable can be safetly be set again).
and
(b) MMU-writable is never consulted on its own.
And in fact this is what the shadow MMU does when write-protecting guest page tables. However setting the SPTE unconditionally is much easier to reason about and does not require a huge comment explaining why it is safe.
I disagree. I looked at the code+comment before reading the full changelog and typed up a response saying the code should be:
if (!is_writable_pte(iter.old_spte) && !spte_can_locklessly_be_made_writable(spte)) break;
Then I went read the changelog and here we are :-)
I find that much more easier to grok, e.g. in plain English: "if the SPTE isn't writable and can't be made writable, there's nothing to do".
Oh interesting. I actually find that confusing because it can easily lead to the MMU-writable bit staying set. Here we are protecting GFNs and we're opting to leave the MMU-writable bit set. It takes a lot of digging to figure out that this is safe because if MMU-writable is set and the SPTE cannot be locklessly be made writable then it implies Host-writable is clear, and Host-writable can't be reset without syncing the all shadow pages reachable by the MMU. Oh and the MMU-writable bit is never consulted on its own (e.g. We never iterate through all SPTEs to find the ones that are !MMU-writable).
Maybe my understanding is horribly off since this all seems unnecessarily convoluted, and the cost of always clearing MMU-writable is just an extra bitwise-OR.
The TLB flush is certainly unnecessary if the SPTE is already !Host-writable, which is what this commit does.
Versus "unconditionally clear the writable bits because ???, but only flush if the write was actually necessary", with a slightly opinionated translation :-)
If MMU-writable is already clear we can definitely break. I had that in a previous version of the patch by checking if iter.old_spte == new_spte but it seemed unnecessary since the guts of tdp_mmu_spte_set() already optimizes for this.
And with that, you don't need to do s/spte_set/flush. Though I would be in favor of a separate patch to do s/spte_set/write_protected here and in the caller, to match kvm_mmu_slot_gfn_write_protect().
I'm not sure write_protected would not be a good variable name because even if we did not write-protect the SPTE (i.e. PT_WRITABLE_MASK was already clear) we may still need a TLB flush to ensure no CPUs have a writable SPTE in their TLB. Perhaps we have different definitions for "write-protecting"?
On Wed, Jan 12, 2022, David Matlack wrote:
On Wed, Jan 12, 2022 at 3:14 PM Sean Christopherson seanjc@google.com wrote:
On Wed, Jan 12, 2022, David Matlack wrote:
When the TDP MMU is write-protection GFNs for page table protection (as opposed to for dirty logging, or due to the HVA not being writable), it checks if the SPTE is already write-protected and if so skips modifying the SPTE and the TLB flush.
This behavior is incorrect because the SPTE may be write-protected for dirty logging. This implies that the SPTE could be locklessly be made writable on the next write access, and that vCPUs could still be running with writable SPTEs cached in their TLB.
Fix this by unconditionally setting the SPTE and only skipping the TLB flush if the SPTE was already marked !MMU-writable or !Host-writable, which guarantees the SPTE cannot be locklessly be made writable and no vCPUs are running the writable SPTEs cached in their TLBs.
Technically it would be safe to skip setting the SPTE as well since:
(a) If MMU-writable is set then Host-writable must be cleared and the only way to set Host-writable is to fault the SPTE back in entirely (at which point any unsynced shadow pages reachable by the new SPTE will be synced and MMU-writable can be safetly be set again).
and
(b) MMU-writable is never consulted on its own.
And in fact this is what the shadow MMU does when write-protecting guest page tables. However setting the SPTE unconditionally is much easier to reason about and does not require a huge comment explaining why it is safe.
I disagree. I looked at the code+comment before reading the full changelog and typed up a response saying the code should be:
if (!is_writable_pte(iter.old_spte) && !spte_can_locklessly_be_made_writable(spte)) break;
Then I went read the changelog and here we are :-)
I find that much more easier to grok, e.g. in plain English: "if the SPTE isn't writable and can't be made writable, there's nothing to do".
Oh interesting. I actually find that confusing because it can easily lead to the MMU-writable bit staying set. Here we are protecting GFNs and we're opting to leave the MMU-writable bit set. It takes a lot of digging to figure out that this is safe because if MMU-writable is set and the SPTE cannot be locklessly be made writable then it implies Host-writable is clear, and Host-writable can't be reset without syncing the all shadow pages reachable by the MMU. Oh and the MMU-writable bit is never consulted on its own (e.g. We never iterate through all SPTEs to find the ones that are !MMU-writable).
Ah, you've missed the other wrinkle: MMU-writable can bet set iff Host-writable is set. In other words, the MMU-writable bit is never left set because it can't be set if spte_can_locklessly_be_made_writable() returns false.
To reduce confusion, we can and probably should do:
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index a4af2a42695c..bc691ff72cab 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -316,8 +316,7 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
static inline bool spte_can_locklessly_be_made_writable(u64 spte) { - return (spte & shadow_host_writable_mask) && - (spte & shadow_mmu_writable_mask); + return (spte & shadow_mmu_writable_mask); }
static inline u64 get_mmio_spte_generation(u64 spte)
Though it'd be nice to have a WARN somewhere to enforce that MMU-Writable isn't set without Host-writable.
We could also rename the helper to is_mmu_writable_spte(), though I'm not sure that's actually better.
Yet another option would be to invert the flag and make it shadow_mmu_pt_protected_mask or something, i.e. make it more explicitly a flag that says "this thing is write-protected for shadowing a page table".
On Wed, Jan 12, 2022 at 4:29 PM Sean Christopherson seanjc@google.com wrote:
On Wed, Jan 12, 2022, David Matlack wrote:
On Wed, Jan 12, 2022 at 3:14 PM Sean Christopherson seanjc@google.com wrote:
On Wed, Jan 12, 2022, David Matlack wrote:
When the TDP MMU is write-protection GFNs for page table protection (as opposed to for dirty logging, or due to the HVA not being writable), it checks if the SPTE is already write-protected and if so skips modifying the SPTE and the TLB flush.
This behavior is incorrect because the SPTE may be write-protected for dirty logging. This implies that the SPTE could be locklessly be made writable on the next write access, and that vCPUs could still be running with writable SPTEs cached in their TLB.
Fix this by unconditionally setting the SPTE and only skipping the TLB flush if the SPTE was already marked !MMU-writable or !Host-writable, which guarantees the SPTE cannot be locklessly be made writable and no vCPUs are running the writable SPTEs cached in their TLBs.
Technically it would be safe to skip setting the SPTE as well since:
(a) If MMU-writable is set then Host-writable must be cleared and the only way to set Host-writable is to fault the SPTE back in entirely (at which point any unsynced shadow pages reachable by the new SPTE will be synced and MMU-writable can be safetly be set again).
and
(b) MMU-writable is never consulted on its own.
And in fact this is what the shadow MMU does when write-protecting guest page tables. However setting the SPTE unconditionally is much easier to reason about and does not require a huge comment explaining why it is safe.
I disagree. I looked at the code+comment before reading the full changelog and typed up a response saying the code should be:
if (!is_writable_pte(iter.old_spte) && !spte_can_locklessly_be_made_writable(spte)) break;
Then I went read the changelog and here we are :-)
I find that much more easier to grok, e.g. in plain English: "if the SPTE isn't writable and can't be made writable, there's nothing to do".
Oh interesting. I actually find that confusing because it can easily lead to the MMU-writable bit staying set. Here we are protecting GFNs and we're opting to leave the MMU-writable bit set. It takes a lot of digging to figure out that this is safe because if MMU-writable is set and the SPTE cannot be locklessly be made writable then it implies Host-writable is clear, and Host-writable can't be reset without syncing the all shadow pages reachable by the MMU. Oh and the MMU-writable bit is never consulted on its own (e.g. We never iterate through all SPTEs to find the ones that are !MMU-writable).
Ah, you've missed the other wrinkle: MMU-writable can bet set iff Host-writable is set. In other words, the MMU-writable bit is never left set because it can't be set if spte_can_locklessly_be_made_writable() returns false.
Ohhh I did miss that and yes that explains it. I'll send another version of this patch that skips setting the SPTE unnecessarily.
To reduce confusion, we can and probably should do:
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index a4af2a42695c..bc691ff72cab 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -316,8 +316,7 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
static inline bool spte_can_locklessly_be_made_writable(u64 spte) {
return (spte & shadow_host_writable_mask) &&
(spte & shadow_mmu_writable_mask);
return (spte & shadow_mmu_writable_mask);
}
static inline u64 get_mmio_spte_generation(u64 spte)
Though it'd be nice to have a WARN somewhere to enforce that MMU-Writable isn't set without Host-writable.
We could also rename the helper to is_mmu_writable_spte(), though I'm not sure that's actually better.
Yet another option would be to invert the flag and make it shadow_mmu_pt_protected_mask or something, i.e. make it more explicitly a flag that says "this thing is write-protected for shadowing a page table".
On Thu, Jan 13, 2022 at 9:04 AM David Matlack dmatlack@google.com wrote:
On Wed, Jan 12, 2022 at 4:29 PM Sean Christopherson seanjc@google.com wrote:
On Wed, Jan 12, 2022, David Matlack wrote:
On Wed, Jan 12, 2022 at 3:14 PM Sean Christopherson seanjc@google.com wrote:
On Wed, Jan 12, 2022, David Matlack wrote:
When the TDP MMU is write-protection GFNs for page table protection (as opposed to for dirty logging, or due to the HVA not being writable), it checks if the SPTE is already write-protected and if so skips modifying the SPTE and the TLB flush.
This behavior is incorrect because the SPTE may be write-protected for dirty logging. This implies that the SPTE could be locklessly be made writable on the next write access, and that vCPUs could still be running with writable SPTEs cached in their TLB.
Fix this by unconditionally setting the SPTE and only skipping the TLB flush if the SPTE was already marked !MMU-writable or !Host-writable, which guarantees the SPTE cannot be locklessly be made writable and no vCPUs are running the writable SPTEs cached in their TLBs.
Technically it would be safe to skip setting the SPTE as well since:
(a) If MMU-writable is set then Host-writable must be cleared and the only way to set Host-writable is to fault the SPTE back in entirely (at which point any unsynced shadow pages reachable by the new SPTE will be synced and MMU-writable can be safetly be set again).
and
(b) MMU-writable is never consulted on its own.
And in fact this is what the shadow MMU does when write-protecting guest page tables. However setting the SPTE unconditionally is much easier to reason about and does not require a huge comment explaining why it is safe.
I disagree. I looked at the code+comment before reading the full changelog and typed up a response saying the code should be:
if (!is_writable_pte(iter.old_spte) && !spte_can_locklessly_be_made_writable(spte)) break;
Then I went read the changelog and here we are :-)
I find that much more easier to grok, e.g. in plain English: "if the SPTE isn't writable and can't be made writable, there's nothing to do".
Oh interesting. I actually find that confusing because it can easily lead to the MMU-writable bit staying set. Here we are protecting GFNs and we're opting to leave the MMU-writable bit set. It takes a lot of digging to figure out that this is safe because if MMU-writable is set and the SPTE cannot be locklessly be made writable then it implies Host-writable is clear, and Host-writable can't be reset without syncing the all shadow pages reachable by the MMU. Oh and the MMU-writable bit is never consulted on its own (e.g. We never iterate through all SPTEs to find the ones that are !MMU-writable).
Ah, you've missed the other wrinkle: MMU-writable can bet set iff Host-writable is set. In other words, the MMU-writable bit is never left set because it can't be set if spte_can_locklessly_be_made_writable() returns false.
The changed_pte notifier looks like it clears Host-writable without clearing MMU-writable. Specifically the call chain:
kvm_mmu_notifier_change_pte() kvm_set_spte_gfn() kvm_tdp_mmu_set_spte_gfn() set_spte_gfn() kvm_mmu_changed_pte_notifier_make_spte()
Is there some guarantee that old_spte is !MMU-writable at this point? If not I could easily change kvm_mmu_changed_pte_notifier_make_spte() to also clear MMU-writable and preserve the invariant.
Ohhh I did miss that and yes that explains it. I'll send another version of this patch that skips setting the SPTE unnecessarily.
To reduce confusion, we can and probably should do:
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index a4af2a42695c..bc691ff72cab 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -316,8 +316,7 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
static inline bool spte_can_locklessly_be_made_writable(u64 spte) {
return (spte & shadow_host_writable_mask) &&
(spte & shadow_mmu_writable_mask);
return (spte & shadow_mmu_writable_mask);
}
static inline u64 get_mmio_spte_generation(u64 spte)
Though it'd be nice to have a WARN somewhere to enforce that MMU-Writable isn't set without Host-writable.
We could also rename the helper to is_mmu_writable_spte(), though I'm not sure that's actually better.
Yet another option would be to invert the flag and make it shadow_mmu_pt_protected_mask or something, i.e. make it more explicitly a flag that says "this thing is write-protected for shadowing a page table".
On Thu, Jan 13, 2022, David Matlack wrote:
On Thu, Jan 13, 2022 at 9:04 AM David Matlack dmatlack@google.com wrote:
On Wed, Jan 12, 2022 at 4:29 PM Sean Christopherson seanjc@google.com wrote:
Oh interesting. I actually find that confusing because it can easily lead to the MMU-writable bit staying set. Here we are protecting GFNs and we're opting to leave the MMU-writable bit set. It takes a lot of digging to figure out that this is safe because if MMU-writable is set and the SPTE cannot be locklessly be made writable then it implies Host-writable is clear, and Host-writable can't be reset without syncing the all shadow pages reachable by the MMU. Oh and the MMU-writable bit is never consulted on its own (e.g. We never iterate through all SPTEs to find the ones that are !MMU-writable).
Ah, you've missed the other wrinkle: MMU-writable can bet set iff Host-writable is set. In other words, the MMU-writable bit is never left set because it can't be set if spte_can_locklessly_be_made_writable() returns false.
The changed_pte notifier looks like it clears Host-writable without clearing MMU-writable. Specifically the call chain:
kvm_mmu_notifier_change_pte() kvm_set_spte_gfn() kvm_tdp_mmu_set_spte_gfn() set_spte_gfn() kvm_mmu_changed_pte_notifier_make_spte()
Is there some guarantee that old_spte is !MMU-writable at this point?
Ugh, I misread that code, multiple times. There's no guarantee, it was likely just missed when MMU-writable was introduced.
Note, you literally cannot hit that code path in current kernels. See commit c13fda237f08 ("KVM: Assert that notifier count is elevated in .change_pte()"). So whatever you do is effectively untestable.
I really want to rip out .change_pte(), but I also don't want to do any performance testing to justify removing the code instead of fixing it proper, so it's hung around as a zombie...
If not I could easily change kvm_mmu_changed_pte_notifier_make_spte() to also clear MMU-writable and preserve the invariant.
Yes, please.
linux-stable-mirror@lists.linaro.org