On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
Add tracking of pages that were pinned via FOLL_PIN.
As mentioned in the FOLL_PIN documentation, callers who effectively set FOLL_PIN are required to ultimately free such pages via put_user_page(). The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new function call:
bool page_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later patchsets. There is discussion about this in [1].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
This also has a couple of trivial, non-functional change fixes to try_get_compound_head(). That function got moved to the top of the file.
Maybe split that as a separate trivial patch.
This includes the following fix from Ira Weiny:
DAX requires detection of a page crossing to a ref count of 1. Fix this for GUP pages by introducing put_devmap_managed_user_page() which accounts for GUP_PIN_COUNTING_BIAS now used by GUP.
Please do the put_devmap_managed_page() changes in a separate patch, it would be a lot easier to follow, also on that front see comments below.
[1] https://lwn.net/Articles/784574/ "Some slow progress on get_user_pages()"
Suggested-by: Jan Kara jack@suse.cz Suggested-by: Jérôme Glisse jglisse@redhat.com Signed-off-by: Ira Weiny ira.weiny@intel.com Signed-off-by: John Hubbard jhubbard@nvidia.com
include/linux/mm.h | 80 +++++++++++---- include/linux/mmzone.h | 2 + include/linux/page_ref.h | 10 ++ mm/gup.c | 213 +++++++++++++++++++++++++++++++-------- mm/huge_memory.c | 32 +++++- mm/hugetlb.c | 28 ++++- mm/memremap.c | 4 +- mm/vmstat.c | 2 + 8 files changed, 300 insertions(+), 71 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index cdfb6fedb271..03b3600843b7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page *page) #endif #ifdef CONFIG_DEV_PAGEMAP_OPS -void __put_devmap_managed_page(struct page *page); +void __put_devmap_managed_page(struct page *page, int count); DECLARE_STATIC_KEY_FALSE(devmap_managed_key); -static inline bool put_devmap_managed_page(struct page *page)
+static inline bool page_is_devmap_managed(struct page *page) { if (!static_branch_unlikely(&devmap_managed_key)) return false; @@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page *page) switch (page->pgmap->type) { case MEMORY_DEVICE_PRIVATE: case MEMORY_DEVICE_FS_DAX:
return true; default: break;__put_devmap_managed_page(page);
@@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page *page) return false; } +static inline bool put_devmap_managed_page(struct page *page) +{
- bool is_devmap = page_is_devmap_managed(page);
- if (is_devmap) {
int count = page_ref_dec_return(page);
__put_devmap_managed_page(page, count);
- }
- return is_devmap;
+}
I think the __put_devmap_managed_page() should be rename to free_devmap_managed_page() and that the count != 1 case move to this inline function ie:
static inline bool put_devmap_managed_page(struct page *page) { bool is_devmap = page_is_devmap_managed(page);
if (is_devmap) { int count = page_ref_dec_return(page);
/* * If refcount is 1 then page is freed and refcount is stable as nobody * holds a reference on the page. */ if (count == 1) free_devmap_managed_page(page, count); else if (!count) __put_page(page); }
return is_devmap; }
#else /* CONFIG_DEV_PAGEMAP_OPS */ static inline bool put_devmap_managed_page(struct page *page) { @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page) return true; } +__must_check bool user_page_ref_inc(struct page *page);
What about having it as an inline here as it is pretty small.
static inline void put_page(struct page *page) { page = compound_head(page); @@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page) __put_page(page); } -/**
- put_user_page() - release a gup-pinned page
- @page: pointer to page to be released
+/*
- GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
- the page's refcount so that two separate items are tracked: the original page
- reference count, and also a new count of how many get_user_pages() calls were
- made against the page. ("gup-pinned" is another term for the latter).
- With this scheme, get_user_pages() becomes special: such pages are marked
- as distinct from normal pages. As such, the new put_user_page() call (and
- its variants) must be used in order to release gup-pinned pages.
- Choice of value:
- Pages that were pinned via get_user_pages*() must be released via
- either put_user_page(), or one of the put_user_pages*() routines
- below. This is so that eventually, pages that are pinned via
- get_user_pages*() can be separately tracked and uniquely handled. In
- particular, interactions with RDMA and filesystems need special
- handling.
- By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
- counts with respect to get_user_pages() and put_user_page() becomes simpler,
- due to the fact that adding an even power of two to the page refcount has
- the effect of using only the upper N bits, for the code that counts up using
- the bias value. This means that the lower bits are left for the exclusive
- use of the original code that increments and decrements by one (or at least,
- by much smaller values than the bias value).
- put_user_page() and put_page() are not interchangeable, despite this early
- implementation that makes them look the same. put_user_page() calls must
- be perfectly matched up with get_user_page() calls.
- Of course, once the lower bits overflow into the upper bits (and this is
- OK, because subtraction recovers the original values), then visual inspection
- no longer suffices to directly view the separate counts. However, for normal
- applications that don't have huge page reference counts, this won't be an
- issue.
- Locking: the lockless algorithm described in page_cache_get_speculative()
- and page_cache_gup_pin_speculative() provides safe operation for
- get_user_pages and page_mkclean and other calls that race to set up page
*/
- table entries.
-static inline void put_user_page(struct page *page) -{
- put_page(page);
-} +#define GUP_PIN_COUNTING_BIAS (1UL << 10) +void put_user_page(struct page *page); void put_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty);
void put_user_pages(struct page **pages, unsigned long npages); +/**
- page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*()
- or pin_longterm_pages*()
- @page: pointer to page to be queried.
- @Return: True, if it is likely that the page has been "dma-pinned".
False, if the page is definitely not dma-pinned.
- */
Maybe add a small comment about wrap around :)
+static inline bool page_dma_pinned(struct page *page) +{
- return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS;
+}
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define SECTION_IN_PAGE_FLAGS #endif
[...]
diff --git a/mm/gup.c b/mm/gup.c index 1aea48427879..c9727e65fad3 100644 --- a/mm/gup.c +++ b/mm/gup.c
[...]
@@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, pgmap = get_dev_pagemap(pfn, pgmap); if (unlikely(!pgmap)) {
undo_dev_pagemap(nr, nr_start, pages);
} SetPageReferenced(page); pages[*nr] = page;undo_dev_pagemap(nr, nr_start, flags, pages); return 0;
get_page(page);
if (flags & FOLL_PIN) {
if (unlikely(!user_page_ref_inc(page))) {
undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;
}
Maybe add a comment about a case that should never happens ie user_page_ref_inc() fails after the second iteration of the loop as it would be broken and a bug to call undo_dev_pagemap() after the first iteration of that loop.
Also i believe that this should never happens as if first iteration succeed than __page_cache_add_speculative() will succeed for all the iterations.
Note that the pgmap case above follows that too ie the call to get_dev_pagemap() can only fail on first iteration of the loop, well i assume you can never have a huge device page that span different pgmap ie different devices (which is a reasonable assumption). So maybe this code needs fixing ie :
pgmap = get_dev_pagemap(pfn, pgmap); if (unlikely(!pgmap)) return 0;
} else
get_page(page);
- (*nr)++; pfn++; } while (addr += PAGE_SIZE, addr != end);
[...]
@@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages, unsigned long addr, len, end; int nr = 0, ret = 0;
- if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
- if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
Maybe add a comments to explain, something like:
/* * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN * * Note that get_user_pages_fast() imply FOLL_GET flag by default but * callers can over-ride this default to pin case by setting FOLL_PIN. */
return -EINVAL;
start = untagged_addr(start) & PAGE_MASK; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 13cc93785006..66bf4c8b88f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c
[...]
@@ -968,7 +973,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn);
- get_page(page);
- if (flags & FOLL_GET)
get_page(page);
- else if (flags & FOLL_PIN)
if (unlikely(!user_page_ref_inc(page)))
page = ERR_PTR(-ENOMEM);
While i agree that user_page_ref_inc() (ie page_cache_add_speculative()) should never fails here as we are holding the pmd lock and thus no one can unmap the pmd and free the page it points to. I believe you should return -EFAULT like for the pgmap and not -ENOMEM as the pgmap should not fail either for the same reason. Thus it would be better to have consistent error. Maybe also add a comments explaining that it should not fail here.
return page; }
[...]
@@ -1100,7 +1115,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, * device mapped pages can only be returned if the * caller will manage the page reference count. */
- if (!(flags & FOLL_GET))
- if (!(flags & (FOLL_GET | FOLL_PIN))) return ERR_PTR(-EEXIST);
Maybe add a comment that FOLL_GET or FOLL_PIN must be set.
pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; @@ -1108,7 +1123,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn);
- get_page(page);
- if (flags & FOLL_GET)
get_page(page);
- else if (flags & FOLL_PIN)
if (unlikely(!user_page_ref_inc(page)))
page = ERR_PTR(-ENOMEM);
Same as for follow_devmap_pmd() see above.
return page; } @@ -1522,8 +1542,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, skip_mlock: page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
- if (flags & FOLL_GET) get_page(page);
- else if (flags & FOLL_PIN)
if (unlikely(!user_page_ref_inc(page)))
page = NULL;
This should not fail either as we are holding the pmd lock maybe add a comment. Dunno if we want a WARN() or something to catch this degenerate case, or dump the page.
out: return page; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index b45a95363a84..da335b1cd798 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4462,7 +4462,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, same_page: if (pages) { pages[i] = mem_map_offset(page, pfn_offset);
get_page(pages[i]);
if (flags & FOLL_GET)
get_page(pages[i]);
else if (flags & FOLL_PIN)
if (unlikely(!user_page_ref_inc(pages[i]))) {
spin_unlock(ptl);
remainder = 0;
err = -ENOMEM;
WARN_ON_ONCE(1);
break;
}}
user_page_ref_inc() should not fail here either because we hold the ptl, so the WAR_ON_ONCE() is right but maybe add a comment.
if (vmas)
[...]
@@ -5034,8 +5050,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, pte = huge_ptep_get((pte_t *)pmd); if (pte_present(pte)) { page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
- if (flags & FOLL_GET) get_page(page);
else if (flags & FOLL_PIN)
if (unlikely(!user_page_ref_inc(page))) {
page = NULL;
goto out;
}
This should not fail either (again holding pmd lock), dunno if we want a warn or something to catch this degenerate case.
} else { if (is_hugetlb_entry_migration(pte)) { spin_unlock(ptl);
[...]