From: Ross Zwisler ross.zwisler@linux.intel.com
commit 097963959594c5eccaba42510f7033f703211bda upstream.
Patch series "Write protect DAX PMDs in *sync path".
Currently dax_mapping_entry_mkclean() fails to clean and write protect the pmd_t of a DAX PMD entry during an *sync operation. This can result in data loss, as detailed in patch 2.
This series is based on Dan's "libnvdimm-pending" branch, which is the current home for Jan's "dax: Page invalidation fixes" series. You can find a working tree here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pm...
This patch (of 2):
Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a huge page PMD leaf to be found and returned.
Link: http://lkml.kernel.org/r/1482272586-21177-2-git-send-email-ross.zwisler@linu... Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Suggested-by: Dave Hansen dave.hansen@intel.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Christoph Hellwig hch@lst.de Cc: Dan Williams dan.j.williams@intel.com Cc: Dave Chinner david@fromorbit.com Cc: Jan Kara jack@suse.cz Cc: Matthew Wilcox mawilcox@microsoft.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [bwh: Backported to 4.9: adjust context] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- include/linux/mm.h | 2 ++ mm/memory.c | 37 ++++++++++++++++++++++++++++++------- 2 files changed, 32 insertions(+), 7 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 7a4c035b187f..81ee5d0b2642 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1269,6 +1269,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); +int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index c2890dc104d9..2b2cc69ddcce 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3780,8 +3780,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */
-static int __follow_pte(struct mm_struct *mm, unsigned long address, - pte_t **ptepp, spinlock_t **ptlp) +static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { pgd_t *pgd; pud_t *pud; @@ -3798,11 +3798,20 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address,
pmd = pmd_offset(pud, address); VM_BUG_ON(pmd_trans_huge(*pmd)); - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) - goto out;
- /* We cannot handle huge page PFN maps. Luckily they don't exist. */ - if (pmd_huge(*pmd)) + if (pmd_huge(*pmd)) { + if (!pmdpp) + goto out; + + *ptlp = pmd_lock(mm, pmd); + if (pmd_huge(*pmd)) { + *pmdpp = pmd; + return 0; + } + spin_unlock(*ptlp); + } + + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) goto out;
ptep = pte_offset_map_lock(mm, pmd, address, ptlp); @@ -3825,9 +3834,23 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
/* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte(mm, address, ptepp, ptlp))); + !(res = __follow_pte_pmd(mm, address, ptepp, NULL, + ptlp))); + return res; +} + +int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) +{ + int res; + + /* (void) is needed to make gcc happy */ + (void) __cond_lock(*ptlp, + !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp, + ptlp))); return res; } +EXPORT_SYMBOL(follow_pte_pmd);
/** * follow_pfn - look up PFN at a user virtual address
From: Paolo Bonzini pbonzini@redhat.com
commit bd2fae8da794b55bf2ac02632da3a151b10e664c upstream.
In order to convert an HVA to a PFN, KVM usually tries to use the get_user_pages family of functinso. This however is not possible for VM_IO vmas; in that case, KVM instead uses follow_pfn.
In doing this however KVM loses the information on whether the PFN is writable. That is usually not a problem because the main use of VM_IO vmas with KVM is for BARs in PCI device assignment, however it is a bug. To fix it, use follow_pte and check pte_write while under the protection of the PTE lock. The information can be used to fail hva_to_pfn_remapped or passed back to the caller via *writable.
Usage of follow_pfn was introduced in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05); however, even older version have the same issue, all the way back to commit 2e2e3738af33 ("KVM: Handle vma regions with no backing page", 2008-07-20), as they also did not check whether the PFN was writable.
Fixes: 2e2e3738af33 ("KVM: Handle vma regions with no backing page") Reported-by: David Stevens stevensd@google.com Cc: 3pvd@google.com Cc: Jann Horn jannh@google.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini pbonzini@redhat.com [OP: backport to 4.19, adjust follow_pte() -> follow_pte_pmd()] Signed-off-by: Ovidiu Panait ovidiu.panait@windriver.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org [bwh: Backport to 4.9: follow_pte_pmd() does not take start or end parameters] Signed-off-by: Ben Hutchings ben@decadent.org.uk --- virt/kvm/kvm_main.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index db859b595dba..2729704f836e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1519,9 +1519,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, kvm_pfn_t *p_pfn) { unsigned long pfn; + pte_t *ptep; + spinlock_t *ptl; int r;
- r = follow_pfn(vma, addr, &pfn); + r = follow_pte_pmd(vma->vm_mm, addr, &ptep, NULL, &ptl); if (r) { /* * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does @@ -1536,14 +1538,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, if (r) return r;
- r = follow_pfn(vma, addr, &pfn); + r = follow_pte_pmd(vma->vm_mm, addr, &ptep, NULL, &ptl); if (r) return r; + }
+ if (write_fault && !pte_write(*ptep)) { + pfn = KVM_PFN_ERR_RO_FAULT; + goto out; }
if (writable) - *writable = true; + *writable = pte_write(*ptep); + pfn = pte_pfn(*ptep);
/* * Get a reference here because callers of *hva_to_pfn* and @@ -1558,6 +1565,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, */ kvm_get_pfn(pfn);
+out: + pte_unmap_unlock(ptep, ptl); *p_pfn = pfn; return 0; }
From: Sean Christopherson seanjc@google.com
commit a9545779ee9e9e103648f6f2552e73cfe808d0f4 upstream.
Use kvm_pfn_t, a.k.a. u64, for the local 'pfn' variable when retrieving a so called "remapped" hva/pfn pair. In theory, the hva could resolve to a pfn in high memory on a 32-bit kernel.
This bug was inadvertantly exposed by commit bd2fae8da794 ("KVM: do not assume PTE is writable after follow_pfn"), which added an error PFN value to the mix, causing gcc to comlain about overflowing the unsigned long.
arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function ‘hva_to_pfn_remapped’: include/linux/kvm_host.h:89:30: error: conversion from ‘long long unsigned int’ to ‘long unsigned int’ changes value from ‘9218868437227405314’ to ‘2’ [-Werror=overflow] 89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) | ^ virt/kvm/kvm_main.c:1935:9: note: in expansion of macro ‘KVM_PFN_ERR_RO_FAULT’
Cc: stable@vger.kernel.org Fixes: add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up") Signed-off-by: Sean Christopherson seanjc@google.com Message-Id: 20210208201940.1258328-1-seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Ben Hutchings ben@decadent.org.uk --- virt/kvm/kvm_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2729704f836e..dc7b60c81039 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1518,7 +1518,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, bool write_fault, bool *writable, kvm_pfn_t *p_pfn) { - unsigned long pfn; + kvm_pfn_t pfn; pte_t *ptep; spinlock_t *ptl; int r;
From: Nicholas Piggin npiggin@gmail.com
commit f8be156be163a052a067306417cd0ff679068c97 upstream.
It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly.
Fix this by only taking a reference on valid pages if it was non-zero, which indicates it is participating in normal refcounting (and can be released with put_page).
This addresses CVE-2021-22543.
Signed-off-by: Nicholas Piggin npiggin@gmail.com Tested-by: Paolo Bonzini pbonzini@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Ben Hutchings ben@decadent.org.uk --- virt/kvm/kvm_main.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index dc7b60c81039..d9b7001227e3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1513,6 +1513,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) return true; }
+static int kvm_try_get_pfn(kvm_pfn_t pfn) +{ + if (kvm_is_reserved_pfn(pfn)) + return 1; + return get_page_unless_zero(pfn_to_page(pfn)); +} + static int hva_to_pfn_remapped(struct vm_area_struct *vma, unsigned long addr, bool *async, bool write_fault, bool *writable, @@ -1562,13 +1569,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * Whoever called remap_pfn_range is also going to call e.g. * unmap_mapping_range before the underlying pages are freed, * causing a call to our MMU notifier. + * + * Certain IO or PFNMAP mappings can be backed with valid + * struct pages, but be allocated without refcounting e.g., + * tail pages of non-compound higher order allocations, which + * would then underflow the refcount when the caller does the + * required put_page. Don't allow those pages here. */ - kvm_get_pfn(pfn); + if (!kvm_try_get_pfn(pfn)) + r = -EFAULT;
out: pte_unmap_unlock(ptep, ptl); *p_pfn = pfn; - return 0; + + return r; }
/*
linux-stable-mirror@lists.linaro.org