Hi Andrew, please consider this series for 4.18.
For maintainability, as ZONE_DEVICE continues to attract new users, it is useful to keep all users consolidated on devm_memremap_pages() as the interface for create "device pages".
The devm_memremap_pages() implementation was recently reworked to make it more generic for arbitrary users, like the proposed peer-to-peer PCI-E enabling. HMM pre-dated this rework and opted to duplicate devm_memremap_pages() as hmm_devmem_pages_create().
Rework HMM to be a consumer of devm_memremap_pages() directly and fix up the licensing on the exports given the deep dependencies on the mm.
Patches based on v4.17-rc6 where there are no upstream consumers of the HMM functionality.
---
Dan Williams (5): mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL mm, devm_memremap_pages: handle errors allocating final devres action mm, hmm: use devm semantics for hmm_devmem_{add,remove} mm, hmm: replace hmm_devmem_pages_create() with devm_memremap_pages() mm, hmm: mark hmm_devmem_{add,add_resource} EXPORT_SYMBOL_GPL
Documentation/vm/hmm.txt | 1 include/linux/hmm.h | 4 - include/linux/memremap.h | 1 kernel/memremap.c | 39 +++++- mm/hmm.c | 297 +++++++--------------------------------------- 5 files changed, 77 insertions(+), 265 deletions(-)
The last step before devm_memremap_pages() returns success is to allocate a release action to tear the entire setup down. However, the result from devm_add_action() is not checked.
Checking the error also means that we need to handle the fact that the percpu_ref may not be killed by the time devm_memremap_pages_release() runs. Add a new state flag for this case.
Cc: stable@vger.kernel.org Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...") Cc: Christoph Hellwig hch@lst.de Cc: "Jérôme Glisse" jglisse@redhat.com Cc: Logan Gunthorpe logang@deltatee.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- include/linux/memremap.h | 1 + kernel/memremap.c | 8 ++++++-- 2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 7b4899c06f49..44a7ee517513 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -115,6 +115,7 @@ struct dev_pagemap { dev_page_free_t page_free; struct vmem_altmap altmap; bool altmap_valid; + bool registered; struct resource res; struct percpu_ref *ref; struct device *dev; diff --git a/kernel/memremap.c b/kernel/memremap.c index c614645227a7..30d96be5a965 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -296,7 +296,7 @@ static void devm_memremap_pages_release(void *data) for_each_device_pfn(pfn, pgmap) put_page(pfn_to_page(pfn));
- if (percpu_ref_tryget_live(pgmap->ref)) { + if (pgmap->registered && percpu_ref_tryget_live(pgmap->ref)) { dev_WARN(dev, "%s: page mapping is still live!\n", __func__); percpu_ref_put(pgmap->ref); } @@ -418,7 +418,11 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) percpu_ref_get(pgmap->ref); }
- devm_add_action(dev, devm_memremap_pages_release, pgmap); + error = devm_add_action_or_reset(dev, devm_memremap_pages_release, + pgmap); + if (error) + return ERR_PTR(error); + pgmap->registered = true;
return __va(res->start);
On Mon, 21 May 2018 15:35:24 -0700 Dan Williams dan.j.williams@intel.com wrote:
The last step before devm_memremap_pages() returns success is to allocate a release action to tear the entire setup down. However, the result from devm_add_action() is not checked.
Checking the error also means that we need to handle the fact that the percpu_ref may not be killed by the time devm_memremap_pages_release() runs. Add a new state flag for this case.
Cc: stable@vger.kernel.org Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...") Cc: Christoph Hellwig hch@lst.de Cc: "Jérôme Glisse" jglisse@redhat.com Cc: Logan Gunthorpe logang@deltatee.com Signed-off-by: Dan Williams dan.j.williams@intel.com
Why the cc:stable? The changelog doesn't describe the end-user-visible impact of the bug (it always should, for this reason).
AFAICT we only go wrong when a small GFP_KERNEL allocation fails (basically never happens), with undescribed results :(
[ resend as the last attempt dropped all the cc's ]
On Mon, May 21, 2018 at 4:10 PM, Andrew Morton akpm@linux-foundation.org wrote:
On Mon, 21 May 2018 15:35:24 -0700 Dan Williams dan.j.williams@intel.com wrote:
The last step before devm_memremap_pages() returns success is to allocate a release action to tear the entire setup down. However, the result from devm_add_action() is not checked.
Checking the error also means that we need to handle the fact that the percpu_ref may not be killed by the time devm_memremap_pages_release() runs. Add a new state flag for this case.
Cc: stable@vger.kernel.org Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...") Cc: Christoph Hellwig hch@lst.de Cc: "Jérôme Glisse" jglisse@redhat.com Cc: Logan Gunthorpe logang@deltatee.com Signed-off-by: Dan Williams dan.j.williams@intel.com
Why the cc:stable? The changelog doesn't describe the end-user-visible impact of the bug (it always should, for this reason).
True, I should have included that, one of these years I'll stop making this mistake.
AFAICT we only go wrong when a small GFP_KERNEL allocation fails (basically never happens), with undescribed results :(
Here's a better changelog:
--- The last step before devm_memremap_pages() returns success is to allocate a release action to tear the entire setup down. However, the result from devm_add_action() is not checked.
Checking the error also means that we need to handle the fact that the percpu_ref may not be killed by the time devm_memremap_pages_release() runs. Add a new state flag for this case.
Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range.
Hey Dan,
On 21/05/18 06:07 PM, Dan Williams wrote:
Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range.
Sorry, I don't follow this. The change only seems to prevent a warning from occurring in this situation. Won't pgmap_radix_release() still be called regardless of whether this patch is applied?
But it looks to me like this patch doesn't quite solve the issue -- at least when looking at dax/pmem.c: If devm_add_action_or_reset() fails, then dax_pmem_percpu_kill() won't be registered as an action and the percpu_ref will never get killed. Thus, dax_pmem_percpu_release() would not get called and dax_pmem_percpu_exit() will hang waiting for a completion that will never occur. So we probably need to add a kill call somewhere on the failing path...
Logan
On Tue, May 22, 2018 at 9:42 AM, Logan Gunthorpe logang@deltatee.com wrote:
Hey Dan,
On 21/05/18 06:07 PM, Dan Williams wrote:
Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range.
Sorry, I don't follow this. The change only seems to prevent a warning from occurring in this situation. Won't pgmap_radix_release() still be called regardless of whether this patch is applied?
devm_add_action() does not call the release function, devm_add_action_or_reset() does.
But it looks to me like this patch doesn't quite solve the issue -- at least when looking at dax/pmem.c: If devm_add_action_or_reset() fails, then dax_pmem_percpu_kill() won't be registered as an action and the percpu_ref will never get killed. Thus, dax_pmem_percpu_release() would not get called and dax_pmem_percpu_exit() will hang waiting for a completion that will never occur. So we probably need to add a kill call somewhere on the failing path...
Ah, true, good catch!
We should manually kill in the !registered case. I think this means we need to pass in the custom kill routine, because for the pmem driver it's blk_freeze_queue_start().
On 22/05/18 10:56 AM, Dan Williams wrote:
On Tue, May 22, 2018 at 9:42 AM, Logan Gunthorpe logang@deltatee.com wrote:
Hey Dan,
On 21/05/18 06:07 PM, Dan Williams wrote:
Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range.
Sorry, I don't follow this. The change only seems to prevent a warning from occurring in this situation. Won't pgmap_radix_release() still be called regardless of whether this patch is applied?
devm_add_action() does not call the release function, devm_add_action_or_reset() does.
Oh, yes. Thanks I see that now.
Ah, true, good catch!
We should manually kill in the !registered case. I think this means we need to pass in the custom kill routine, because for the pmem driver it's blk_freeze_queue_start().
It may be cleaner to just have the caller call the specific kill function if devm_memremap_pages fails... Though, I don't fully understand how the nvdimm pmem driver cleans up the percpu counter.
Logan
On Tue, May 22, 2018 at 10:03 AM, Logan Gunthorpe logang@deltatee.com wrote:
On 22/05/18 10:56 AM, Dan Williams wrote:
On Tue, May 22, 2018 at 9:42 AM, Logan Gunthorpe logang@deltatee.com wrote:
Hey Dan,
On 21/05/18 06:07 PM, Dan Williams wrote:
Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range.
Sorry, I don't follow this. The change only seems to prevent a warning from occurring in this situation. Won't pgmap_radix_release() still be called regardless of whether this patch is applied?
devm_add_action() does not call the release function, devm_add_action_or_reset() does.
Oh, yes. Thanks I see that now.
Ah, true, good catch!
We should manually kill in the !registered case. I think this means we need to pass in the custom kill routine, because for the pmem driver it's blk_freeze_queue_start().
It may be cleaner to just have the caller call the specific kill function if devm_memremap_pages fails...
As far as I can see by then it's too late, or we need to expose release details to the caller which defeats the purpose of devm semantics.
Though, I don't fully understand how the nvdimm pmem driver cleans up the percpu counter.
The dev_pagemap setup for pmem is entirely too subtle and arguably a layering violation as it reuses the block layer q_usage_counter percpu_ref. We arrange for that counter to be shutdown before the blk_cleanup_queue() does the same.
On 22/05/18 11:25 AM, Dan Williams wrote:
As far as I can see by then it's too late, or we need to expose release details to the caller which defeats the purpose of devm semantics.
In the dax/pmem case, I *think* it should be fine...
devm_add_action_or_reset() only calls the action it is passed on failure not the entire devm chain. In which case, it should drop all the references to the percpu counter. Then, if on an error return from devm_memremap_pages() we call dax_pmem_percpu_kill(), the rest of the devm chain should be called when we return from a failed probe and it should proceed as usual.
I think dax_pmem_percpu_kill() also must be called on any error return from devm_memremap_pages and it is currently not...
Logan
On Mon, May 21, 2018 at 03:35:24PM -0700, Dan Williams wrote:
The last step before devm_memremap_pages() returns success is to allocate a release action to tear the entire setup down. However, the result from devm_add_action() is not checked.
Checking the error also means that we need to handle the fact that the percpu_ref may not be killed by the time devm_memremap_pages_release() runs. Add a new state flag for this case.
Looks good (modulo any stable tag issues):
Reviewed-by: Christoph Hellwig hch@lst.de
On Mon, May 21, 2018 at 03:35:14PM -0700, Dan Williams wrote:
Hi Andrew, please consider this series for 4.18.
For maintainability, as ZONE_DEVICE continues to attract new users, it is useful to keep all users consolidated on devm_memremap_pages() as the interface for create "device pages".
The devm_memremap_pages() implementation was recently reworked to make it more generic for arbitrary users, like the proposed peer-to-peer PCI-E enabling. HMM pre-dated this rework and opted to duplicate devm_memremap_pages() as hmm_devmem_pages_create().
Rework HMM to be a consumer of devm_memremap_pages() directly and fix up the licensing on the exports given the deep dependencies on the mm.
I am on PTO right now so i won't be able to quickly review it all but forcing GPL export is problematic for me now. I rather have device driver using "sane" common helpers than creating their own crazy thing.
Back in couple weeks i will review this some more.
Patches based on v4.17-rc6 where there are no upstream consumers of the HMM functionality.
Dan Williams (5): mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL mm, devm_memremap_pages: handle errors allocating final devres action mm, hmm: use devm semantics for hmm_devmem_{add,remove} mm, hmm: replace hmm_devmem_pages_create() with devm_memremap_pages() mm, hmm: mark hmm_devmem_{add,add_resource} EXPORT_SYMBOL_GPL
Documentation/vm/hmm.txt | 1 include/linux/hmm.h | 4 - include/linux/memremap.h | 1 kernel/memremap.c | 39 +++++- mm/hmm.c | 297 +++++++--------------------------------------- 5 files changed, 77 insertions(+), 265 deletions(-)
On Wed, May 23, 2018 at 5:10 PM, Jerome Glisse jglisse@redhat.com wrote:
On Mon, May 21, 2018 at 03:35:14PM -0700, Dan Williams wrote:
Hi Andrew, please consider this series for 4.18.
For maintainability, as ZONE_DEVICE continues to attract new users, it is useful to keep all users consolidated on devm_memremap_pages() as the interface for create "device pages".
The devm_memremap_pages() implementation was recently reworked to make it more generic for arbitrary users, like the proposed peer-to-peer PCI-E enabling. HMM pre-dated this rework and opted to duplicate devm_memremap_pages() as hmm_devmem_pages_create().
Rework HMM to be a consumer of devm_memremap_pages() directly and fix up the licensing on the exports given the deep dependencies on the mm.
I am on PTO right now so i won't be able to quickly review it all but forcing GPL export is problematic for me now. I rather have device driver using "sane" common helpers than creating their own crazy thing.
Sane drivers that need this level of deep integration with Linux memory management need to be upstream. Otherwise, HMM is an unprecedented departure from the norms of Linux kernel development.
On Wed, May 23, 2018 at 08:18:11PM -0700, Dan Williams wrote:
On Wed, May 23, 2018 at 5:10 PM, Jerome Glisse jglisse@redhat.com wrote:
On Mon, May 21, 2018 at 03:35:14PM -0700, Dan Williams wrote:
Hi Andrew, please consider this series for 4.18.
For maintainability, as ZONE_DEVICE continues to attract new users, it is useful to keep all users consolidated on devm_memremap_pages() as the interface for create "device pages".
The devm_memremap_pages() implementation was recently reworked to make it more generic for arbitrary users, like the proposed peer-to-peer PCI-E enabling. HMM pre-dated this rework and opted to duplicate devm_memremap_pages() as hmm_devmem_pages_create().
Rework HMM to be a consumer of devm_memremap_pages() directly and fix up the licensing on the exports given the deep dependencies on the mm.
I am on PTO right now so i won't be able to quickly review it all but forcing GPL export is problematic for me now. I rather have device driver using "sane" common helpers than creating their own crazy thing.
Sane drivers that need this level of deep integration with Linux memory management need to be upstream. Otherwise, HMM is an unprecedented departure from the norms of Linux kernel development.
Agreed. I consider every driver using this a derived work, independ on the marking or not. And I'm willing to enforce this.
linux-stable-mirror@lists.linaro.org