February 2018 - Linux-stable-mirror

[4.9-stable PATCH 06/11] v4l2: disable filesystem-dax mapping support

by Dan Williams

commit b70131de648c2b997d22f4653934438013f407a1 upstream. V4L2 memory registrations are incompatible with filesystem-dax that needs the ability to revoke dma access to a mapping at will, or otherwise allow the kernel to wait for completion of DMA. The filesystem-dax implementation breaks the traditional solution of truncate of active file backed mappings since there is no page-cache page we can orphan to sustain ongoing DMA. If v4l2 wants to support long lived DMA mappings it needs to arrange to hold a file lease or use some other mechanism so that the kernel can coordinate revoking DMA access when the filesystem needs to truncate mappings. Link: http://lkml.kernel.org/r/151068940499.7446.12846708245365671207.stgit@dwill… Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> Reported-by: Jan Kara <jack(a)suse.cz> Reviewed-by: Jan Kara <jack(a)suse.cz> Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org> Cc: Christoph Hellwig <hch(a)lst.de> Cc: Doug Ledford <dledford(a)redhat.com> Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com> Cc: Inki Dae <inki.dae(a)samsung.com> Cc: Jason Gunthorpe <jgg(a)mellanox.com> Cc: Jeff Moyer <jmoyer(a)redhat.com> Cc: Joonyoung Shim <jy0922.shim(a)samsung.com> Cc: Kyungmin Park <kyungmin.park(a)samsung.com> Cc: Mel Gorman <mgorman(a)suse.de> Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com> Cc: Sean Hefty <sean.hefty(a)intel.com> Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> --- drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index 1db0af6c7f94..b6189a4958c5 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n", data, size, dma->nr_pages); - err = get_user_pages(data & PAGE_MASK, dma->nr_pages, + err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages, flags, dma->pages, NULL); if (err != dma->nr_pages) { dma->nr_pages = (err >= 0) ? err : 0; - dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages); + dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err, + dma->nr_pages); return err < 0 ? err : -EINVAL; } return 0;

7 years, 4 months

2
1
0 0

[4.9-stable PATCH 05/11] mm: introduce get_user_pages_longterm

by Dan Williams

commit 2bb6d2837083de722bfdc369cb0d76ce188dd9b4 upstream. Patch series "introduce get_user_pages_longterm()", v2. Here is a new get_user_pages api for cases where a driver intends to keep an elevated page count indefinitely. This is distinct from usages like iov_iter_get_pages where the elevated page counts are transient. The iov_iter_get_pages cases immediately turn around and submit the pages to a device driver which will put_page when the i/o operation completes (under kernel control). In the longterm case userspace is responsible for dropping the page reference at some undefined point in the future. This is untenable for filesystem-dax case where the filesystem is in control of the lifetime of the block / page and needs reasonable limits on how long it can wait for pages in a mapping to become idle. Fixing filesystems to actually wait for dax pages to be idle before blocks from a truncate/hole-punch operation are repurposed is saved for a later patch series. Also, allowing longterm registration of dax mappings is a future patch series that introduces a "map with lease" semantic where the kernel can revoke a lease and force userspace to drop its page references. I have also tagged these for -stable to purposely break cases that might assume that longterm memory registrations for filesystem-dax mappings were supported by the kernel. The behavior regression this policy change implies is one of the reasons we maintain the "dax enabled. Warning: EXPERIMENTAL, use at your own risk" notification when mounting a filesystem in dax mode. It is worth noting the device-dax interface does not suffer the same constraints since it does not support file space management operations like hole-punch. This patch (of 4): Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed. This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2). [akpm(a)linux-foundation.org: use kcalloc()] Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwill… Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> Suggested-by: Christoph Hellwig <hch(a)lst.de> Cc: Doug Ledford <dledford(a)redhat.com> Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com> Cc: Inki Dae <inki.dae(a)samsung.com> Cc: Jan Kara <jack(a)suse.cz> Cc: Jason Gunthorpe <jgg(a)mellanox.com> Cc: Jeff Moyer <jmoyer(a)redhat.com> Cc: Joonyoung Shim <jy0922.shim(a)samsung.com> Cc: Kyungmin Park <kyungmin.park(a)samsung.com> Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org> Cc: Mel Gorman <mgorman(a)suse.de> Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com> Cc: Sean Hefty <sean.hefty(a)intel.com> Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> --- include/linux/dax.h | 5 ---- include/linux/fs.h | 20 ++++++++++++++++ include/linux/mm.h | 13 ++++++++++ mm/gup.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 97 insertions(+), 5 deletions(-) diff --git a/include/linux/dax.h b/include/linux/dax.h index add6c4bc568f..ed9cf2f5cd06 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -61,11 +61,6 @@ static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr, int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) -static inline bool vma_is_dax(struct vm_area_struct *vma) -{ - return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); -} - static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); diff --git a/include/linux/fs.h b/include/linux/fs.h index d705ae084edd..745ea1b2e02c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -18,6 +18,7 @@ #include <linux/bug.h> #include <linux/mutex.h> #include <linux/rwsem.h> +#include <linux/mm_types.h> #include <linux/capability.h> #include <linux/semaphore.h> #include <linux/fiemap.h> @@ -3033,6 +3034,25 @@ static inline bool io_is_direct(struct file *filp) return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host); } +static inline bool vma_is_dax(struct vm_area_struct *vma) +{ + return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); +} + +static inline bool vma_is_fsdax(struct vm_area_struct *vma) +{ + struct inode *inode; + + if (!vma->vm_file) + return false; + if (!vma_is_dax(vma)) + return false; + inode = file_inode(vma->vm_file); + if (inode->i_mode == S_IFCHR) + return false; /* device-dax */ + return true; +} + static inline int iocb_flags(struct file *file) { int res = 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 2217e2f18247..8e506783631b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1288,6 +1288,19 @@ long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm, struct page **pages, unsigned int gup_flags); long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +#ifdef CONFIG_FS_DAX +long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas); +#else +static inline long get_user_pages_longterm(unsigned long start, + unsigned long nr_pages, unsigned int gup_flags, + struct page **pages, struct vm_area_struct **vmas) +{ + return get_user_pages(start, nr_pages, gup_flags, pages, vmas); +} +#endif /* CONFIG_FS_DAX */ + int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); diff --git a/mm/gup.c b/mm/gup.c index c63a0341ae38..6c3b4e822946 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -982,6 +982,70 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_FS_DAX +/* + * This is the same as get_user_pages() in that it assumes we are + * operating on the current task's mm, but it goes further to validate + * that the vmas associated with the address range are suitable for + * longterm elevated page reference counts. For example, filesystem-dax + * mappings are subject to the lifetime enforced by the filesystem and + * we need guarantees that longterm users like RDMA and V4L2 only + * establish mappings that have a kernel enforced revocation mechanism. + * + * "longterm" == userspace controlled elevated page count lifetime. + * Contrast this to iov_iter_get_pages() usages which are transient. + */ +long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas_arg) +{ + struct vm_area_struct **vmas = vmas_arg; + struct vm_area_struct *vma_prev = NULL; + long rc, i; + + if (!pages) + return -EINVAL; + + if (!vmas) { + vmas = kcalloc(nr_pages, sizeof(struct vm_area_struct *), + GFP_KERNEL); + if (!vmas) + return -ENOMEM; + } + + rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas); + + for (i = 0; i < rc; i++) { + struct vm_area_struct *vma = vmas[i]; + + if (vma == vma_prev) + continue; + + vma_prev = vma; + + if (vma_is_fsdax(vma)) + break; + } + + /* + * Either get_user_pages() failed, or the vma validation + * succeeded, in either case we don't need to put_page() before + * returning. + */ + if (i >= rc) + goto out; + + for (i = 0; i < rc; i++) + put_page(pages[i]); + rc = -EOPNOTSUPP; +out: + if (vmas != vmas_arg) + kfree(vmas); + return rc; +} +EXPORT_SYMBOL(get_user_pages_longterm); +#endif /* CONFIG_FS_DAX */ + /** * populate_vma_page_range() - populate a range of pages in the vma. * @vma: target vma

7 years, 4 months

2
1
0 0

Patch "mm: Fix devm_memremap_pages() collision handling" has been added to the 4.9-stable tree

by gregkh＠linuxfoundation.org

This is a note to let you know that I've just added the patch titled mm: Fix devm_memremap_pages() collision handling to the 4.9-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum… The filename of the patch is: mm-fix-devm_memremap_pages-collision-handling.patch and it can be found in the queue-4.9 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let <stable(a)vger.kernel.org> know about it. >From foo@baz Mon Feb 26 20:55:53 CET 2018 From: Dan Williams <dan.j.williams(a)intel.com> Date: Fri, 23 Feb 2018 14:06:10 -0800 Subject: mm: Fix devm_memremap_pages() collision handling To: gregkh(a)linuxfoundation.org Cc: Message-ID: <151942357089.21775.3486425046348885247.stgit(a)dwillia2-desk3.amr.corp.intel.com> From: Jan H. Schönherr <jschoenh(a)amazon.de> commit 77dd66a3c67c93ab401ccc15efff25578be281fd upstream. If devm_memremap_pages() detects a collision while adding entries to the radix-tree, we call pgmap_radix_release(). Unfortunately, the function removes *all* entries for the range -- including the entries that caused the collision in the first place. Modify pgmap_radix_release() to take an additional argument to indicate where to stop, so that only newly added entries are removed from the tree. Cc: <stable(a)vger.kernel.org> Fixes: 9476df7d80df ("mm: introduce find_dev_pagemap()") Signed-off-by: Jan H. Schönherr <jschoenh(a)amazon.de> Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> --- kernel/memremap.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -194,7 +194,7 @@ void put_zone_device_page(struct page *p } EXPORT_SYMBOL(put_zone_device_page); -static void pgmap_radix_release(struct resource *res) +static void pgmap_radix_release(struct resource *res, resource_size_t end_key) { resource_size_t key, align_start, align_size, align_end; @@ -203,8 +203,11 @@ static void pgmap_radix_release(struct r align_end = align_start + align_size - 1; mutex_lock(&pgmap_lock); - for (key = res->start; key <= res->end; key += SECTION_SIZE) + for (key = res->start; key <= res->end; key += SECTION_SIZE) { + if (key >= end_key) + break; radix_tree_delete(&pgmap_radix, key >> PA_SECTION_SHIFT); + } mutex_unlock(&pgmap_lock); } @@ -255,7 +258,7 @@ static void devm_memremap_pages_release( unlock_device_hotplug(); untrack_pfn(NULL, PHYS_PFN(align_start), align_size); - pgmap_radix_release(res); + pgmap_radix_release(res, -1); dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc, "%s: failed to free all reserved pages\n", __func__); } @@ -289,7 +292,7 @@ struct dev_pagemap *find_dev_pagemap(res void *devm_memremap_pages(struct device *dev, struct resource *res, struct percpu_ref *ref, struct vmem_altmap *altmap) { - resource_size_t key, align_start, align_size, align_end; + resource_size_t key = 0, align_start, align_size, align_end; pgprot_t pgprot = PAGE_KERNEL; struct dev_pagemap *pgmap; struct page_map *page_map; @@ -392,7 +395,7 @@ void *devm_memremap_pages(struct device untrack_pfn(NULL, PHYS_PFN(align_start), align_size); err_pfn_remap: err_radix: - pgmap_radix_release(res); + pgmap_radix_release(res, key); devres_free(page_map); return ERR_PTR(error); } Patches currently in stable-queue which might be from dan.j.williams(a)intel.com are queue-4.9/mm-fix-devm_memremap_pages-collision-handling.patch queue-4.9/ib-core-disable-memory-registration-of-filesystem-dax-vmas.patch queue-4.9/mm-avoid-spurious-bad-pmd-warning-messages.patch queue-4.9/mm-introduce-get_user_pages_longterm.patch queue-4.9/mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch queue-4.9/fs-dax.c-fix-inefficiency-in-dax_writeback_mapping_range.patch queue-4.9/device-dax-implement-split-to-catch-invalid-munmap-attempts.patch queue-4.9/v4l2-disable-filesystem-dax-mapping-support.patch queue-4.9/libnvdimm-dax-fix-1gb-aligned-namespaces-vs-physical-misalignment.patch queue-4.9/x86-entry-64-clear-extra-registers-beyond-syscall-arguments-to-reduce-speculation-attack-surface.patch queue-4.9/libnvdimm-fix-integer-overflow-static-analysis-warning.patch

7 years, 4 months

1
0
0 0

[4.9-stable PATCH 10/11] mm: fail get_vaddr_frames() for filesystem-dax mappings

by Dan Williams

commit b7f0554a56f21fb3e636a627450a9add030889be upstream. Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow V4L2, Exynos, and other frame vector users to create long standing / irrevocable memory registrations against filesytem-dax vmas. [dan.j.williams(a)intel.com: add comment for vma_is_fsdax() check in get_vaddr_frames(), per Jan] Link: http://lkml.kernel.org/r/151197874035.26211.4061781453123083667.stgit@dwill… Link: http://lkml.kernel.org/r/151068939985.7446.15684639617389154187.stgit@dwill… Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> Reviewed-by: Jan Kara <jack(a)suse.cz> Cc: Inki Dae <inki.dae(a)samsung.com> Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com> Cc: Joonyoung Shim <jy0922.shim(a)samsung.com> Cc: Kyungmin Park <kyungmin.park(a)samsung.com> Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org> Cc: Mel Gorman <mgorman(a)suse.de> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: Christoph Hellwig <hch(a)lst.de> Cc: Doug Ledford <dledford(a)redhat.com> Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com> Cc: Jason Gunthorpe <jgg(a)mellanox.com> Cc: Jeff Moyer <jmoyer(a)redhat.com> Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com> Cc: Sean Hefty <sean.hefty(a)intel.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> --- mm/frame_vector.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/frame_vector.c b/mm/frame_vector.c index db77dcb38afd..375a103d7a56 100644 --- a/mm/frame_vector.c +++ b/mm/frame_vector.c @@ -52,6 +52,18 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, ret = -EFAULT; goto out; } + + /* + * While get_vaddr_frames() could be used for transient (kernel + * controlled lifetime) pinning of memory pages all current + * users establish long term (userspace controlled lifetime) + * page pinning. Treat get_vaddr_frames() like + * get_user_pages_longterm() and disallow it for filesystem-dax + * mappings. + */ + if (vma_is_fsdax(vma)) + return -EOPNOTSUPP; + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) { vec->got_ref = true; vec->is_pfns = false;

7 years, 4 months

2
1
0 0

[4.9-stable PATCH 01/11] mm: avoid spurious 'bad pmd' warning messages

by Dan Williams

From: Ross Zwisler <ross.zwisler(a)linux.intel.com> commit d0f0931de936a0a468d7e59284d39581c16d3a73 upstream. When the pmd_devmap() checks were added by 5c7fb56e5e3f ("mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge pages, they were all added to the end of if() statements after existing pmd_trans_huge() checks. So, things like: - if (pmd_trans_huge(*pmd)) + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) When further checks were added after pmd_trans_unstable() checks by commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map") they were also added at the end of the conditional: + if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd)) This ordering is fine for pmd_trans_huge(), but doesn't work for pmd_trans_unstable(). This is because DAX huge pages trip the bad_pmd() check inside of pmd_none_or_trans_huge_or_clear_bad() (called by pmd_trans_unstable()), which prints out a warning and returns 1. So, we do end up doing the right thing, but only after spamming dmesg with suspicious looking messages: mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5) Reorder these checks in a helper so that pmd_devmap() is checked first, avoiding the error messages, and add a comment explaining why the ordering is important. Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map") Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com> Reviewed-by: Jan Kara <jack(a)suse.cz> Cc: Pawel Lebioda <pawel.lebioda(a)intel.com> Cc: "Darrick J. Wong" <darrick.wong(a)oracle.com> Cc: Alexander Viro <viro(a)zeniv.linux.org.uk> Cc: Christoph Hellwig <hch(a)lst.de> Cc: Dan Williams <dan.j.williams(a)intel.com> Cc: Dave Hansen <dave.hansen(a)intel.com> Cc: Matthew Wilcox <mawilcox(a)microsoft.com> Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com> Cc: Dave Jiang <dave.jiang(a)intel.com> Cc: Xiong Zhou <xzhou(a)redhat.com> Cc: Eryu Guan <eguan(a)redhat.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> --- mm/memory.c | 40 ++++++++++++++++++++++++++++++---------- 1 file changed, 30 insertions(+), 10 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2e68767a373..d2db2c4eb0a4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2848,6 +2848,17 @@ static int __do_fault(struct fault_env *fe, pgoff_t pgoff, return ret; } +/* + * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. + * If we check pmd_trans_unstable() first we will trip the bad_pmd() check + * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. + */ +static int pmd_devmap_trans_unstable(pmd_t *pmd) +{ + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); +} + static int pte_alloc_one_map(struct fault_env *fe) { struct vm_area_struct *vma = fe->vma; @@ -2871,18 +2882,27 @@ static int pte_alloc_one_map(struct fault_env *fe) map_pte: /* * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd - * didn't become pmd_trans_huge under us and then back to pmd_none, as - * a result of MADV_DONTNEED running immediately after a huge pmd fault - * in a different thread of this mm, in turn leading to a misleading - * pmd_trans_huge() retval. All we have to ensure is that it is a - * regular pmd that we can walk with pte_offset_map() and we can do that - * through an atomic read in C, which is what pmd_trans_unstable() - * provides. + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of + * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge + * under us and then back to pmd_none, as a result of MADV_DONTNEED + * running immediately after a huge pmd fault in a different thread of + * this mm, in turn leading to a misleading pmd_trans_huge() retval. + * All we have to ensure is that it is a regular pmd that we can walk + * with pte_offset_map() and we can do that through an atomic read in + * C, which is what pmd_trans_unstable() provides. */ - if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd)) + if (pmd_devmap_trans_unstable(fe->pmd)) return VM_FAULT_NOPAGE; + /* + * At this point we know that our vmf->pmd points to a page of ptes + * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() + * for the duration of the fault. If a racing MADV_DONTNEED runs and + * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still + * be valid and we will re-check to make sure the vmf->pte isn't + * pte_none() under vmf->ptl protection when we return to + * alloc_set_pte(). + */ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl); return 0; @@ -3456,7 +3476,7 @@ static int handle_pte_fault(struct fault_env *fe) fe->pte = NULL; } else { /* See comment in pte_alloc_one_map() */ - if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd)) + if (pmd_devmap_trans_unstable(fe->pmd)) return 0; /* * A regular pmd is established and it can't morph into a huge

7 years, 4 months

2
1
0 0

[4.9-stable PATCH 03/11] libnvdimm: fix integer overflow static analysis warning

by Dan Williams

commit 58738c495e15badd2015e19ff41f1f1ed55200bc upstream. Dan reports: The patch 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the following static checker warning: drivers/nvdimm/bus.c:1018 __nd_ioctl() warn: integer overflows 'buf_len' From a casual review, this seems like it might be a real bug. On the first iteration we load some data into in_env[]. On the second iteration we read a use controlled "in_size" from nd_cmd_in_size(). It can go up to UINT_MAX - 1. A high number means we will fill the whole in_env[] buffer. But we potentially keep looping and adding more to in_len so now it can be any value. It simple enough to change, but it feels weird that we keep looping even though in_env is totally full. Shouldn't we just return an error if we don't have space for desc->in_num. We keep looping because the size of the total input is allowed to be bigger than the 'envelope' which is a subset of the payload that tells us how much data to expect. For safety explicitly check that buf_len does not overflow which is what the checker flagged. Cc: <stable(a)vger.kernel.org> Fixes: 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus..." Reported-by: Dan Carpenter <dan.carpenter(a)oracle.com> Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> --- drivers/nvdimm/bus.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 0392eb8a0dea..8311a93cabd8 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -812,16 +812,17 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, int read_only, unsigned int ioctl_cmd, unsigned long arg) { struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc; - size_t buf_len = 0, in_len = 0, out_len = 0; static char out_env[ND_CMD_MAX_ENVELOPE]; static char in_env[ND_CMD_MAX_ENVELOPE]; const struct nd_cmd_desc *desc = NULL; unsigned int cmd = _IOC_NR(ioctl_cmd); void __user *p = (void __user *) arg; struct device *dev = &nvdimm_bus->dev; - struct nd_cmd_pkg pkg; const char *cmd_name, *dimm_name; + u32 in_len = 0, out_len = 0; unsigned long cmd_mask; + struct nd_cmd_pkg pkg; + u64 buf_len = 0; void *buf; int rc, i; @@ -882,7 +883,7 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, } if (cmd == ND_CMD_CALL) { - dev_dbg(dev, "%s:%s, idx: %llu, in: %zu, out: %zu, len %zu\n", + dev_dbg(dev, "%s:%s, idx: %llu, in: %u, out: %u, len %llu\n", __func__, dimm_name, pkg.nd_command, in_len, out_len, buf_len); @@ -912,9 +913,9 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, out_len += out_size; } - buf_len = out_len + in_len; + buf_len = (u64) out_len + (u64) in_len; if (buf_len > ND_IOCTL_MAX_BUFLEN) { - dev_dbg(dev, "%s:%s cmd: %s buf_len: %zu > %d\n", __func__, + dev_dbg(dev, "%s:%s cmd: %s buf_len: %llu > %d\n", __func__, dimm_name, cmd_name, buf_len, ND_IOCTL_MAX_BUFLEN); return -EINVAL;

7 years, 4 months

2
1
0 0

[4.9-stable PATCH 08/11] libnvdimm, dax: fix 1GB-aligned namespaces vs physical misalignment

by Dan Williams

commit 41fce90f26333c4fa82e8e43b9ace86c4e8a0120 upstream. The following namespace configuration attempt: # ndctl create-namespace -e namespace0.0 -m devdax -a 1G -f libndctl: ndctl_dax_enable: dax0.1: failed to enable Error: namespace0.0: failed to enable failed to reconfigure namespace: No such device or address ...fails when the backing memory range is not physically aligned to 1G: # cat /proc/iomem | grep Persistent 210000000-30fffffff : Persistent Memory (legacy) In the above example the 4G persistent memory range starts and ends on a 256MB boundary. We handle this case correctly when needing to handle cases that violate section alignment (128MB) collisions against "System RAM", and we simply need to extend that padding/truncation for the 1GB alignment use case. Cc: <stable(a)vger.kernel.org> Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute...") Reported-and-tested-by: Jane Chu <jane.chu(a)oracle.com> Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> --- drivers/nvdimm/pfn_devs.c | 15 ++++++++++++--- include/linux/kernel.h | 1 + 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index 42abdd2391c9..d6aa59ca68b9 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -563,6 +563,12 @@ static struct vmem_altmap *__nvdimm_setup_pfn(struct nd_pfn *nd_pfn, return altmap; } +static u64 phys_pmem_align_down(struct nd_pfn *nd_pfn, u64 phys) +{ + return min_t(u64, PHYS_SECTION_ALIGN_DOWN(phys), + ALIGN_DOWN(phys, nd_pfn->align)); +} + static int nd_pfn_init(struct nd_pfn *nd_pfn) { u32 dax_label_reserve = is_nd_dax(&nd_pfn->dev) ? SZ_128K : 0; @@ -618,13 +624,16 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) start = nsio->res.start; size = PHYS_SECTION_ALIGN_UP(start + size) - start; if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM, - IORES_DESC_NONE) == REGION_MIXED) { + IORES_DESC_NONE) == REGION_MIXED + || !IS_ALIGNED(start + resource_size(&nsio->res), + nd_pfn->align)) { size = resource_size(&nsio->res); - end_trunc = start + size - PHYS_SECTION_ALIGN_DOWN(start + size); + end_trunc = start + size - phys_pmem_align_down(nd_pfn, + start + size); } if (start_pad + end_trunc) - dev_info(&nd_pfn->dev, "%s section collision, truncate %d bytes\n", + dev_info(&nd_pfn->dev, "%s alignment collision, truncate %d bytes\n", dev_name(&ndns->dev), start_pad + end_trunc); /* diff --git a/include/linux/kernel.h b/include/linux/kernel.h index bc6ed52a39b9..61054f12be7c 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -46,6 +46,7 @@ #define REPEAT_BYTE(x) ((~0ul / 0xff) * (x)) #define ALIGN(x, a) __ALIGN_KERNEL((x), (a)) +#define ALIGN_DOWN(x, a) __ALIGN_KERNEL((x) - ((a) - 1), (a)) #define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask)) #define PTR_ALIGN(p, a) ((typeof(p))ALIGN((unsigned long)(p), (a))) #define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)

7 years, 4 months

2
1
0 0

[4.9-stable PATCH 07/11] IB/core: disable memory registration of filesystem-dax vmas

by Dan Williams

commit 5f1d43de54164dcfb9bfa542fcc92c1e1a1b6c1d upstream. Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow RDMA to create long standing memory registrations against filesytem-dax vmas. Link: http://lkml.kernel.org/r/151068941011.7446.7766030590347262502.stgit@dwilli… Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> Reported-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Acked-by: Jason Gunthorpe <jgg(a)mellanox.com> Acked-by: Doug Ledford <dledford(a)redhat.com> Cc: Sean Hefty <sean.hefty(a)intel.com> Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com> Cc: Jeff Moyer <jmoyer(a)redhat.com> Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com> Cc: Inki Dae <inki.dae(a)samsung.com> Cc: Jan Kara <jack(a)suse.cz> Cc: Joonyoung Shim <jy0922.shim(a)samsung.com> Cc: Kyungmin Park <kyungmin.park(a)samsung.com> Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org> Cc: Mel Gorman <mgorman(a)suse.de> Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> --- drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index c22fde6207d1..8e973a2993a6 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -193,7 +193,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, sg_list_start = umem->sg_head.sgl; while (npages) { - ret = get_user_pages(cur_base, + ret = get_user_pages_longterm(cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), gup_flags, page_list, vma_list);

7 years, 4 months

2
1
0 0

[4.9-stable PATCH 02/11] fs/dax.c: fix inefficiency in dax_writeback_mapping_range()

by Dan Williams

From: Jan Kara <jack(a)suse.cz> commit 1eb643d02b21412e603b42cdd96010a2ac31c05f upstream. dax_writeback_mapping_range() fails to update iteration index when searching radix tree for entries needing cache flushing. Thus each pagevec worth of entries is searched starting from the start which is inefficient and prone to livelocks. Update index properly. Link: http://lkml.kernel.org/r/20170619124531.21491-1-jack@suse.cz Fixes: 9973c98ecfda3 ("dax: add support for fsync/sync") Signed-off-by: Jan Kara <jack(a)suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler(a)linux.intel.com> Cc: Dan Williams <dan.j.williams(a)intel.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> --- fs/dax.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/dax.c b/fs/dax.c index 800748f10b3d..71f87d74afe1 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -785,6 +785,7 @@ int dax_writeback_mapping_range(struct address_space *mapping, if (ret < 0) return ret; } + start_index = indices[pvec.nr - 1] + 1; } return 0; }

7 years, 4 months

2
1
0 0

[4.9-stable PATCH 04/11] device-dax: implement ->split() to catch invalid munmap attempts

by Dan Williams

commit 9702cffdbf2129516db679e4467db81e1cd287da upstream. Similar to how device-dax enforces that the 'address', 'offset', and 'len' parameters to mmap() be aligned to the device's fundamental alignment, the same constraints apply to munmap(). Implement ->split() to fail munmap calls that violate the alignment constraint. Otherwise, we later fail VM_BUG_ON checks in the unmap_page_range() path with crash signatures of the form: vma ffff8800b60c8a88 start 00007f88c0000000 end 00007f88c0e00000 next (null) prev (null) mm ffff8800b61150c0 prot 8000000000000027 anon_vma (null) vm_ops ffffffffa0091240 pgoff 0 file ffff8800b638ef80 private_data (null) flags: 0x380000fb(read|write|shared|mayread|maywrite|mayexec|mayshare|softdirty|mixedmap|hugepage) ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:2014! [..] RIP: 0010:__split_huge_pud+0x12a/0x180 [..] Call Trace: unmap_page_range+0x245/0xa40 ? __vma_adjust+0x301/0x990 unmap_vmas+0x4c/0xa0 unmap_region+0xae/0x120 ? __vma_rb_erase+0x11a/0x230 do_munmap+0x276/0x410 vm_munmap+0x6a/0xa0 SyS_munmap+0x1d/0x30 Link: http://lkml.kernel.org/r/151130418681.4029.7118245855057952010.stgit@dwilli… Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> Reported-by: Jeff Moyer <jmoyer(a)redhat.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> --- drivers/dax/dax.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 40be3747724d..473b44c008dd 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -453,9 +453,21 @@ static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr, return rc; } +static int dax_dev_split(struct vm_area_struct *vma, unsigned long addr) +{ + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + struct dax_region *dax_region = dax_dev->region; + + if (!IS_ALIGNED(addr, dax_region->align)) + return -EINVAL; + return 0; +} + static const struct vm_operations_struct dax_dev_vm_ops = { .fault = dax_dev_fault, .pmd_fault = dax_dev_pmd_fault, + .split = dax_dev_split, }; static int dax_mmap(struct file *filp, struct vm_area_struct *vma)

7 years, 4 months

2
1
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror February 2018