Changes since v8 [1]: - Rebase on next-20190604 to incorporate the removal of the MHP_MEMBLOCK_API flag and other cleanups from David.
- Move definition of subsection_mask_set() earlier into "mm/sparsemem: Add helpers track active portions of a section at boot" (Oscar)
- Cleanup unnecessary IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) in section_deactivate() in response to a request (declined) to split the pure CONFIG_SPARSEMEM bits from section_{de,}activate(). I submit that the maintenance is less error prone, especially when modifying common logic, if the implementations remain unified. (Oscar)
- Cleanup sparse_add_section() vs sparse_index_init() return code. (Oscar)
- Document ZONE_DEVICE and subsection semantics relative to CONFIG_SPARSEMEM_VMEMMAP in Documentation/vm/memory-model.rst. (Mike)
[1]: https://lore.kernel.org/lkml/155718596657.130019.17139634728875079809.stgit@...
---
The memory hotplug section is an arbitrary / convenient unit for memory hotplug. 'Section-size' units have bled into the user interface ('memblock' sysfs) and can not be changed without breaking existing userspace. The section-size constraint, while mostly benign for typical memory hotplug, has and continues to wreak havoc with 'device-memory' use cases, persistent memory (pmem) in particular. Recall that pmem uses devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a 'struct page' memmap for pmem. However, it does not use the 'bottom half' of memory hotplug, i.e. never marks pmem pages online and never exposes the userspace memblock interface for pmem. This leaves an opening to redress the section-size constraint.
To date, the libnvdimm subsystem has attempted to inject padding to satisfy the internal constraints of arch_add_memory(). Beyond complicating the code, leading to bugs [2], wasting memory, and limiting configuration flexibility, the padding hack is broken when the platform changes this physical memory alignment of pmem from one boot to the next. Device failure (intermittent or permanent) and physical reconfiguration are events that can cause the platform firmware to change the physical placement of pmem on a subsequent boot, and device failure is an everyday event in a data-center.
It turns out that sections are only a hard requirement of the user-facing interface for memory hotplug and with a bit more infrastructure sub-section arch_add_memory() support can be added for kernel internal usages like devm_memremap_pages(). Here is an analysis of the current design assumptions in the current code and how they are addressed in the new implementation:
Current design assumptions:
- Sections that describe boot memory (early sections) are never unplugged / removed.
- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a valid_section() check
- __add_pages() and helper routines assume all operations occur in PAGES_PER_SECTION units.
- The memblock sysfs interface only comprehends full sections
New design assumptions:
- Sections are instrumented with a sub-section bitmask to track (on x86) individual 2MB sub-divisions of a 128MB section.
- Partially populated early sections can be extended with additional sub-sections, and those sub-sections can be removed with arch_remove_memory(). With this in place we no longer lose usable memory capacity to padding.
- pfn_valid() is updated to look deeper than valid_section() to also check the active-sub-section mask. This indication is in the same cacheline as the valid_section() so the performance impact is expected to be negligible. So far the lkp robot has not reported any regressions.
- Outside of the core vmemmap population routines which are replaced, other helper routines like shrink_{zone,pgdat}_span() are updated to handle the smaller granularity. Core memory hotplug routines that deal with online memory are not touched.
- The existing memblock sysfs user api guarantees / assumptions are not touched since this capability is limited to !online !memblock-sysfs-accessible sections.
Meanwhile the issue reports continue to roll in from users that do not understand when and how the 128MB constraint will bite them. The current implementation relied on being able to support at least one misaligned namespace, but that immediately falls over on any moderately complex namespace creation attempt. Beyond the initial problem of 'System RAM' colliding with pmem, and the unsolvable problem of physical alignment changes, Linux is now being exposed to platforms that collide pmem ranges with other pmem ranges by default [3]. In short, devm_memremap_pages() has pushed the venerable section-size constraint past the breaking point, and the simplicity of section-aligned arch_add_memory() is no longer tenable.
These patches are exposed to the kbuild robot on my libnvdimm-pending branch [4], and a preview of the unit test for this functionality is available on the 'subsection-pending' branch of ndctl [5].
[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwil... [3]: https://github.com/pmem/ndctl/issues/76 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnv... [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c
---
Dan Williams (12): mm/sparsemem: Introduce struct mem_section_usage mm/sparsemem: Add helpers track active portions of a section at boot mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap() mm/hotplug: Kill is_dev_zone() usage in __remove_pages() mm: Kill is_dev_zone() helper mm/sparsemem: Prepare for sub-section ranges mm/sparsemem: Support sub-section hotplug mm: Document ZONE_DEVICE memory-model implications mm/devm_memremap_pages: Enable sub-section remap libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields libnvdimm/pfn: Stop padding pmem namespaces to section alignment
Documentation/vm/memory-model.rst | 39 ++++ arch/powerpc/include/asm/sparsemem.h | 3 arch/x86/mm/init_64.c | 4 drivers/nvdimm/dax_devs.c | 2 drivers/nvdimm/pfn.h | 15 - drivers/nvdimm/pfn_devs.c | 95 +++------ include/linux/memory_hotplug.h | 7 - include/linux/mm.h | 4 include/linux/mmzone.h | 92 +++++++-- kernel/memremap.c | 61 ++---- mm/memory_hotplug.c | 171 +++++++++------- mm/page_alloc.c | 10 + mm/sparse-vmemmap.c | 21 +- mm/sparse.c | 359 +++++++++++++++++++++++----------- 14 files changed, 534 insertions(+), 349 deletions(-)
At namespace creation time there is the potential for the "expected to be zero" fields of a 'pfn' info-block to be filled with indeterminate data. While the kernel buffer is zeroed on allocation it is immediately overwritten by nd_pfn_validate() filling it with the current contents of the on-media info-block location. For fields like, 'flags' and the 'padding' it potentially means that future implementations can not rely on those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for section alignment, arrange for fields that are not explicitly initialized to be guaranteed zero. Bump the minor version to indicate it is safe to assume the 'padding' and 'flags' are zero. Otherwise, this corruption is expected to benign since all other critical fields are explicitly initialized.
Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/nvdimm/dax_devs.c | 2 +- drivers/nvdimm/pfn.h | 1 + drivers/nvdimm/pfn_devs.c | 18 +++++++++++++++--- 3 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c index 0453f49dc708..326f02ffca81 100644 --- a/drivers/nvdimm/dax_devs.c +++ b/drivers/nvdimm/dax_devs.c @@ -126,7 +126,7 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns) nvdimm_bus_unlock(&ndns->dev); if (!dax_dev) return -ENOMEM; - pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL); + pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL); nd_pfn->pfn_sb = pfn_sb; rc = nd_pfn_validate(nd_pfn, DAX_SIG); dev_dbg(dev, "dax: %s\n", rc == 0 ? dev_name(dax_dev) : "<none>"); diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h index dde9853453d3..e901e3a3b04c 100644 --- a/drivers/nvdimm/pfn.h +++ b/drivers/nvdimm/pfn.h @@ -36,6 +36,7 @@ struct nd_pfn_sb { __le32 end_trunc; /* minor-version-2 record the base alignment of the mapping */ __le32 align; + /* minor-version-3 guarantee the padding and flags are zero */ u8 padding[4000]; __le64 checksum; }; diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index 01f40672507f..a2406253eb70 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -420,6 +420,15 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn) return 0; }
+/** + * nd_pfn_validate - read and validate info-block + * @nd_pfn: fsdax namespace runtime state / properties + * @sig: 'devdax' or 'fsdax' signature + * + * Upon return the info-block buffer contents (->pfn_sb) are + * indeterminate when validation fails, and a coherent info-block + * otherwise. + */ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig) { u64 checksum, offset; @@ -565,7 +574,7 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns) nvdimm_bus_unlock(&ndns->dev); if (!pfn_dev) return -ENOMEM; - pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL); + pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL); nd_pfn = to_nd_pfn(pfn_dev); nd_pfn->pfn_sb = pfn_sb; rc = nd_pfn_validate(nd_pfn, PFN_SIG); @@ -702,7 +711,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) u64 checksum; int rc;
- pfn_sb = devm_kzalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL); + pfn_sb = devm_kmalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL); if (!pfn_sb) return -ENOMEM;
@@ -711,11 +720,14 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) sig = DAX_SIG; else sig = PFN_SIG; + rc = nd_pfn_validate(nd_pfn, sig); if (rc != -ENODEV) return rc;
/* no info block, do init */; + memset(pfn_sb, 0, sizeof(*pfn_sb)); + nd_region = to_nd_region(nd_pfn->dev.parent); if (nd_region->ro) { dev_info(&nd_pfn->dev, @@ -768,7 +780,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) memcpy(pfn_sb->uuid, nd_pfn->uuid, 16); memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16); pfn_sb->version_major = cpu_to_le16(1); - pfn_sb->version_minor = cpu_to_le16(2); + pfn_sb->version_minor = cpu_to_le16(3); pfn_sb->start_pad = cpu_to_le32(start_pad); pfn_sb->end_trunc = cpu_to_le32(end_trunc); pfn_sb->align = cpu_to_le32(nd_pfn->align);
On Wed, 05 Jun 2019 14:58:58 -0700 Dan Williams dan.j.williams@intel.com wrote:
At namespace creation time there is the potential for the "expected to be zero" fields of a 'pfn' info-block to be filled with indeterminate data. While the kernel buffer is zeroed on allocation it is immediately overwritten by nd_pfn_validate() filling it with the current contents of the on-media info-block location. For fields like, 'flags' and the 'padding' it potentially means that future implementations can not rely on those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for section alignment, arrange for fields that are not explicitly initialized to be guaranteed zero. Bump the minor version to indicate it is safe to assume the 'padding' and 'flags' are zero. Otherwise, this corruption is expected to benign since all other critical fields are explicitly initialized.
Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com
The cc:stable in [11/12] seems odd. Is this independent of the other patches? If so, shouldn't it be a standalone thing which can be prioritized?
On Thu, Jun 6, 2019 at 2:46 PM Andrew Morton akpm@linux-foundation.org wrote:
On Wed, 05 Jun 2019 14:58:58 -0700 Dan Williams dan.j.williams@intel.com wrote:
At namespace creation time there is the potential for the "expected to be zero" fields of a 'pfn' info-block to be filled with indeterminate data. While the kernel buffer is zeroed on allocation it is immediately overwritten by nd_pfn_validate() filling it with the current contents of the on-media info-block location. For fields like, 'flags' and the 'padding' it potentially means that future implementations can not rely on those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for section alignment, arrange for fields that are not explicitly initialized to be guaranteed zero. Bump the minor version to indicate it is safe to assume the 'padding' and 'flags' are zero. Otherwise, this corruption is expected to benign since all other critical fields are explicitly initialized.
Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com
The cc:stable in [11/12] seems odd. Is this independent of the other patches? If so, shouldn't it be a standalone thing which can be prioritized?
The cc: stable is about spreading this new policy to as many kernels as possible not fixing an issue in those kernels. It's not until patch 12 "libnvdimm/pfn: Stop padding pmem namespaces to section alignment" as all previous kernel do initialize all fields.
I'd be ok to drop that cc: stable, my concern is distros that somehow pickup and backport patch 12 and miss patch 11.
On Thu, 6 Jun 2019 15:06:26 -0700 Dan Williams dan.j.williams@intel.com wrote:
On Thu, Jun 6, 2019 at 2:46 PM Andrew Morton akpm@linux-foundation.org wrote:
On Wed, 05 Jun 2019 14:58:58 -0700 Dan Williams dan.j.williams@intel.com wrote:
At namespace creation time there is the potential for the "expected to be zero" fields of a 'pfn' info-block to be filled with indeterminate data. While the kernel buffer is zeroed on allocation it is immediately overwritten by nd_pfn_validate() filling it with the current contents of the on-media info-block location. For fields like, 'flags' and the 'padding' it potentially means that future implementations can not rely on those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for section alignment, arrange for fields that are not explicitly initialized to be guaranteed zero. Bump the minor version to indicate it is safe to assume the 'padding' and 'flags' are zero. Otherwise, this corruption is expected to benign since all other critical fields are explicitly initialized.
Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com
The cc:stable in [11/12] seems odd. Is this independent of the other patches? If so, shouldn't it be a standalone thing which can be prioritized?
The cc: stable is about spreading this new policy to as many kernels as possible not fixing an issue in those kernels. It's not until patch 12 "libnvdimm/pfn: Stop padding pmem namespaces to section alignment" as all previous kernel do initialize all fields.
I'd be ok to drop that cc: stable, my concern is distros that somehow pickup and backport patch 12 and miss patch 11.
Could you please propose a changelog paragraph which explains all this to those who will be considering this patch for backports?
On Fri, Jun 7, 2019 at 12:54 PM Andrew Morton akpm@linux-foundation.org wrote:
On Thu, 6 Jun 2019 15:06:26 -0700 Dan Williams dan.j.williams@intel.com wrote:
On Thu, Jun 6, 2019 at 2:46 PM Andrew Morton akpm@linux-foundation.org wrote:
On Wed, 05 Jun 2019 14:58:58 -0700 Dan Williams dan.j.williams@intel.com wrote:
At namespace creation time there is the potential for the "expected to be zero" fields of a 'pfn' info-block to be filled with indeterminate data. While the kernel buffer is zeroed on allocation it is immediately overwritten by nd_pfn_validate() filling it with the current contents of the on-media info-block location. For fields like, 'flags' and the 'padding' it potentially means that future implementations can not rely on those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for section alignment, arrange for fields that are not explicitly initialized to be guaranteed zero. Bump the minor version to indicate it is safe to assume the 'padding' and 'flags' are zero. Otherwise, this corruption is expected to benign since all other critical fields are explicitly initialized.
Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com
The cc:stable in [11/12] seems odd. Is this independent of the other patches? If so, shouldn't it be a standalone thing which can be prioritized?
The cc: stable is about spreading this new policy to as many kernels as possible not fixing an issue in those kernels. It's not until patch 12 "libnvdimm/pfn: Stop padding pmem namespaces to section alignment" as all previous kernel do initialize all fields.
I'd be ok to drop that cc: stable, my concern is distros that somehow pickup and backport patch 12 and miss patch 11.
Could you please propose a changelog paragraph which explains all this to those who will be considering this patch for backports?
Will do. I'll resend the series with this and the fixups proposed by Oscar.
Dan Williams dan.j.williams@intel.com writes:
At namespace creation time there is the potential for the "expected to be zero" fields of a 'pfn' info-block to be filled with indeterminate data. While the kernel buffer is zeroed on allocation it is immediately overwritten by nd_pfn_validate() filling it with the current contents of the on-media info-block location. For fields like, 'flags' and the 'padding' it potentially means that future implementations can not rely on those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for section alignment, arrange for fields that are not explicitly initialized to be guaranteed zero. Bump the minor version to indicate it is safe to assume the 'padding' and 'flags' are zero. Otherwise, this corruption is expected to benign since all other critical fields are explicitly initialized.
Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com
drivers/nvdimm/dax_devs.c | 2 +- drivers/nvdimm/pfn.h | 1 + drivers/nvdimm/pfn_devs.c | 18 +++++++++++++++--- 3 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c index 0453f49dc708..326f02ffca81 100644 --- a/drivers/nvdimm/dax_devs.c +++ b/drivers/nvdimm/dax_devs.c @@ -126,7 +126,7 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns) nvdimm_bus_unlock(&ndns->dev); if (!dax_dev) return -ENOMEM;
- pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
- pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL); nd_pfn->pfn_sb = pfn_sb; rc = nd_pfn_validate(nd_pfn, DAX_SIG); dev_dbg(dev, "dax: %s\n", rc == 0 ? dev_name(dax_dev) : "<none>");
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h index dde9853453d3..e901e3a3b04c 100644 --- a/drivers/nvdimm/pfn.h +++ b/drivers/nvdimm/pfn.h @@ -36,6 +36,7 @@ struct nd_pfn_sb { __le32 end_trunc; /* minor-version-2 record the base alignment of the mapping */ __le32 align;
- /* minor-version-3 guarantee the padding and flags are zero */ u8 padding[4000]; __le64 checksum;
}; diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index 01f40672507f..a2406253eb70 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -420,6 +420,15 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn) return 0; } +/**
- nd_pfn_validate - read and validate info-block
- @nd_pfn: fsdax namespace runtime state / properties
- @sig: 'devdax' or 'fsdax' signature
- Upon return the info-block buffer contents (->pfn_sb) are
- indeterminate when validation fails, and a coherent info-block
- otherwise.
- */
int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig) { u64 checksum, offset; @@ -565,7 +574,7 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns) nvdimm_bus_unlock(&ndns->dev); if (!pfn_dev) return -ENOMEM;
- pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
- pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL); nd_pfn = to_nd_pfn(pfn_dev); nd_pfn->pfn_sb = pfn_sb; rc = nd_pfn_validate(nd_pfn, PFN_SIG);
@@ -702,7 +711,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) u64 checksum; int rc;
- pfn_sb = devm_kzalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
- pfn_sb = devm_kmalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL); if (!pfn_sb) return -ENOMEM;
@@ -711,11 +720,14 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) sig = DAX_SIG; else sig = PFN_SIG;
- rc = nd_pfn_validate(nd_pfn, sig); if (rc != -ENODEV) return rc;
/* no info block, do init */;
- memset(pfn_sb, 0, sizeof(*pfn_sb));
- nd_region = to_nd_region(nd_pfn->dev.parent); if (nd_region->ro) { dev_info(&nd_pfn->dev,
@@ -768,7 +780,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) memcpy(pfn_sb->uuid, nd_pfn->uuid, 16); memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16); pfn_sb->version_major = cpu_to_le16(1);
- pfn_sb->version_minor = cpu_to_le16(2);
- pfn_sb->version_minor = cpu_to_le16(3); pfn_sb->start_pad = cpu_to_le32(start_pad); pfn_sb->end_trunc = cpu_to_le32(end_trunc); pfn_sb->align = cpu_to_le32(nd_pfn->align);
How will this minor version 3 be used? If we are not having start_pad/end_trunc updated in pfn_sb, how will the older kernel enable these namesapces?
Do we need a patch like https://lore.kernel.org/linux-mm/20190604091357.32213-2-aneesh.kumar@linux.i...
-aneesh
linux-stable-mirror@lists.linaro.org