Lately Linux has encountered platforms that collide Persistent Memory regions between each other, specifically cases where ->start_pad needed to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions". That commit allowed namespaces to be mapped with devm_memremap_pages(). However dax operations on those configurations currently fail if attempted within the ->start_pad range because pmem_device->data_offset was still relative to raw resource base not relative to the section aligned resource range mapped by devm_memremap_pages().
Luckily __bdev_dax_supported() caught these failures and simply disabled dax. However, to fix this situation a non-backwards compatible change needs to be made to the interpretation of the nd_pfn info-block. ->start_pad needs to be accounted in ->map.map_offset (formerly ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be adjusted to the section aligned resource base used to establish ->map.map formerly (formerly ->virt_addr).
See patch 7 "libnvdimm/pfn: Fix 'start_pad' implementation" for more details, and the ndctl patch series "Improve support + testing for labels + info-blocks" for the corresponding regression test.
---
Dan Williams (7): libnvdimm/pfn: Account for PAGE_SIZE > info-block-size in nd_pfn_init() libnvdimm/pmem: Honor force_raw for legacy pmem regions dax: Check the end of the block-device capacity with dax_direct_access() libnvdimm/pfn: Introduce super-block minimum version requirements libnvdimm/pfn: Remove dax_label_reserve libnvdimm/pfn: Introduce 'struct pfn_map_info' libnvdimm/pfn: Fix 'start_pad' implementation
drivers/dax/pmem.c | 9 +- drivers/dax/super.c | 39 ++++++-- drivers/nvdimm/namespace_devs.c | 4 + drivers/nvdimm/nd.h | 15 +++ drivers/nvdimm/pfn.h | 4 + drivers/nvdimm/pfn_devs.c | 181 ++++++++++++++++++++++++++++----------- drivers/nvdimm/pmem.c | 111 +++++++++++------------- drivers/nvdimm/pmem.h | 12 --- tools/testing/nvdimm/pmem-dax.c | 15 ++- 9 files changed, 244 insertions(+), 146 deletions(-)
For recovery, where non-dax access is needed to a given physical address range, and testing, allow the 'force_raw' attribute to override the default establishment of a dev_pagemap.
Otherwise without this capability it is possible to end up with a namespace that can not be activated due to corrupted info-block, and one that can not be repaired due to a section collision.
Cc: stable@vger.kernel.org Fixes: 004f1afbe199 ("libnvdimm, pmem: direct map legacy pmem by default") Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/nvdimm/namespace_devs.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c index 4b077555ac70..33a3b23b3db7 100644 --- a/drivers/nvdimm/namespace_devs.c +++ b/drivers/nvdimm/namespace_devs.c @@ -138,6 +138,7 @@ bool nd_is_uuid_unique(struct device *dev, u8 *uuid) bool pmem_should_map_pages(struct device *dev) { struct nd_region *nd_region = to_nd_region(dev->parent); + struct nd_namespace_common *ndns = to_ndns(dev); struct nd_namespace_io *nsio;
if (!IS_ENABLED(CONFIG_ZONE_DEVICE)) @@ -149,6 +150,9 @@ bool pmem_should_map_pages(struct device *dev) if (is_nd_pfn(dev) || is_nd_btt(dev)) return false;
+ if (ndns->force_raw) + return false; + nsio = to_nd_namespace_io(dev); if (region_intersects(nsio->res.start, resource_size(&nsio->res), IORESOURCE_SYSTEM_RAM,
In the beginning the pmem driver simply passed the persistent memory resource range to memremap and was done. With the introduction of devm_memremap_pages() and vmem_altmap the implementation needed to contend with metadata at the start of the resource to indicate whether the vmemmap is located in System RAM or Persistent Memory, and reserve vmemmap capacity in pmem for the latter case.
The indication of metadata space was communicated in the nd_pfn->data_offset property and it was defined to be identical to the pmem_device->data_offset property, i.e. relative to the raw resource base of the namespace. Up until this point in the driver's development pmem_device->phys_addr == __pa(pmem_device->virt_addr). This implementation was fine up until the discovery of platforms with physical address layouts that mapped Persistent Memory and System RAM to the same Linux memory hotplug section (128MB span).
The nd_pfn->start_pad and nd_pfn->end_trunc properties were introduced to pad and truncate the capacity to fit within an exclusive Linux memory hotplug section span, and it was at this point where the ->start_pad definition did not comprehend the pmem_device->phys_addr to pmem_device->virt_addr relationship. Platforms in the wild typically only collided 'System RAM' at the end of the Persistent Memory range so ->start_pad was often zero.
Lately Linux has encountered platforms that collide Persistent Memory regions between each other, specifically cases where ->start_pad needed to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions". That commit allowed namespaces to be mapped with devm_memremap_pages(). However dax operations on those configurations currently fail if attempted within the ->start_pad range because pmem_device->data_offset was still relative to raw resource base not relative to the section aligned resource range mapped by devm_memremap_pages().
Luckily __bdev_dax_supported() caught these failures and simply disabled dax. However, to fix this situation a non-backwards compatible change needs to be made to the interpretation of the nd_pfn info-block. ->start_pad needs to be accounted in ->map.map_offset (formerly ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be adjusted to the section aligned resource base used to establish ->map.map formerly (formerly ->virt_addr).
The guiding principles of the info-block compatibility fixup is to maintain the interpretation of ->data_offset for implementations like the EFI driver that only care about data_access not dax, but cause older Linux implementations that care about the mode and dax to fail to parse the new info-block.
A unit test in ndctl is introduced to test the collision case going forward as well as how new kernels interpret older info-block configurations. Recall that the __bdev_dax_supported() checks have been strengthened to catch more failure cases (changed in a previous patch) and the new test caught several regressions in the course of developing this fixed support.
Fixes: ae86cbfef381 ("libnvdimm, pfn: Pad pfn namespaces relative to other regions") Cc: stable@vger.kernel.org Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/nvdimm/nd.h | 2 + drivers/nvdimm/pfn.h | 2 - drivers/nvdimm/pfn_devs.c | 163 +++++++++++++++++++++++++++++---------------- 3 files changed, 109 insertions(+), 58 deletions(-)
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h index 4339d338928b..7c33e7f7fae0 100644 --- a/drivers/nvdimm/nd.h +++ b/drivers/nvdimm/nd.h @@ -198,6 +198,8 @@ enum nd_pfn_mode { PFN_MODE_NONE, PFN_MODE_RAM, PFN_MODE_PMEM, + PFN3_MODE_RAM, + PFN3_MODE_PMEM, };
struct nd_pfn { diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h index 710cb743fad6..53563a67c48d 100644 --- a/drivers/nvdimm/pfn.h +++ b/drivers/nvdimm/pfn.h @@ -20,7 +20,7 @@ #define PFN_SIG_LEN 16 #define PFN_SIG "NVDIMM_PFN_INFO\0" #define DAX_SIG "NVDIMM_DAX_INFO\0" -#define PFN_VERSION_SUPPORT 2 +#define PFN_VERSION_SUPPORT 3
struct nd_pfn_sb { u8 signature[PFN_SIG_LEN]; diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index 1c2a0e707da3..56fa75a299b7 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -210,6 +210,36 @@ static ssize_t namespace_store(struct device *dev, } static DEVICE_ATTR_RW(namespace);
+static u32 pfn_start_pad(struct nd_pfn *nd_pfn) +{ + struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb; + + /* starting v1.3 start_pad is accounted in dataoff */ + if (__le16_to_cpu(pfn_sb->version_minor) < 3) + return __le32_to_cpu(pfn_sb->start_pad); + return 0; +} + +/* + * Where does data start relative to the start of the namespace resource + * base? + */ +static u64 pfn_dataoff(struct nd_pfn *nd_pfn) +{ + return __le64_to_cpu(nd_pfn->pfn_sb->dataoff); +} + +/* + * How much of the namespace resource capacity is taking up by padding, + * outside of what is accounted in the data_offset? + */ +static u32 pfn_pad(struct nd_pfn *nd_pfn) +{ + struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb; + + return pfn_start_pad(nd_pfn) + __le32_to_cpu(pfn_sb->end_trunc); +} + static ssize_t resource_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -218,14 +248,11 @@ static ssize_t resource_show(struct device *dev,
device_lock(dev); if (dev->driver) { - struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb; - u64 offset = __le64_to_cpu(pfn_sb->dataoff); struct nd_namespace_common *ndns = nd_pfn->ndns; - u32 start_pad = __le32_to_cpu(pfn_sb->start_pad); struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
rc = sprintf(buf, "%#llx\n", (unsigned long long) nsio->res.start - + start_pad + offset); + + pfn_dataoff(nd_pfn) + pfn_start_pad(nd_pfn)); } else { /* no address to convey if the pfn instance is disabled */ rc = -ENXIO; @@ -244,16 +271,12 @@ static ssize_t size_show(struct device *dev,
device_lock(dev); if (dev->driver) { - struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb; - u64 offset = __le64_to_cpu(pfn_sb->dataoff); struct nd_namespace_common *ndns = nd_pfn->ndns; - u32 start_pad = __le32_to_cpu(pfn_sb->start_pad); - u32 end_trunc = __le32_to_cpu(pfn_sb->end_trunc); struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
rc = sprintf(buf, "%llu\n", (unsigned long long) - resource_size(&nsio->res) - start_pad - - end_trunc - offset); + resource_size(&nsio->res) - pfn_dataoff(nd_pfn) + - pfn_pad(nd_pfn)); } else { /* no size to convey if the pfn instance is disabled */ rc = -ENXIO; @@ -422,10 +445,10 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn)
int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig) { + unsigned long align; u64 checksum, offset; enum nd_pfn_mode mode; struct nd_namespace_io *nsio; - unsigned long align, start_pad; struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb; struct nd_namespace_common *ndns = nd_pfn->ndns; const u8 *parent_uuid = nd_dev_to_uuid(&ndns->dev); @@ -465,20 +488,29 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig) if (__le16_to_cpu(pfn_sb->min_version) > PFN_VERSION_SUPPORT) return -EINVAL;
- switch (le32_to_cpu(pfn_sb->mode)) { + mode = le32_to_cpu(pfn_sb->mode); + /* + * PFN3_MODEs are used to make older Linux implementations fail + * to parse the info-block. + */ + switch (mode) { case PFN_MODE_RAM: case PFN_MODE_PMEM: break; + case PFN3_MODE_RAM: + mode = PFN_MODE_RAM; + break; + case PFN3_MODE_PMEM: + mode = PFN_MODE_PMEM; + break; default: return -ENXIO; }
align = le32_to_cpu(pfn_sb->align); offset = le64_to_cpu(pfn_sb->dataoff); - start_pad = le32_to_cpu(pfn_sb->start_pad); if (align == 0) align = 1UL << ilog2(offset); - mode = le32_to_cpu(pfn_sb->mode);
if (!nd_pfn->uuid) { /* @@ -534,7 +566,8 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig) return -EBUSY; }
- if ((align && !IS_ALIGNED(nsio->res.start + offset + start_pad, align)) + if ((align && !IS_ALIGNED(nsio->res.start + offset + + pfn_start_pad(nd_pfn), align)) || !IS_ALIGNED(offset, PAGE_SIZE)) { dev_err(&nd_pfn->dev, "bad offset: %#llx dax disabled align: %#lx\n", @@ -586,31 +619,23 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns) } EXPORT_SYMBOL(nd_pfn_probe);
-static u32 info_block_reserve(void) +static u32 info_block_reserve(u32 start_pad) { - return ALIGN(SZ_8K, PAGE_SIZE); + u32 reserve = ALIGN(SZ_8K, PAGE_SIZE); + + /* + * Does the start_pad subsume the info-block at the start of the + * raw resource base? + */ + if (start_pad <= reserve) + return reserve - start_pad; + return 0; }
/* * We hotplug memory at section granularity, pad the reserved area from * the previous section base to the namespace base address. */ -static unsigned long init_altmap_base(resource_size_t base) -{ - unsigned long base_pfn = PHYS_PFN(base); - - return PFN_SECTION_ALIGN_DOWN(base_pfn); -} - -static unsigned long init_altmap_reserve(resource_size_t base) -{ - unsigned long reserve = info_block_reserve() >> PAGE_SHIFT; - unsigned long base_pfn = PHYS_PFN(base); - - reserve += base_pfn - PFN_SECTION_ALIGN_DOWN(base_pfn); - return reserve; -} - static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap, struct pfn_map_info *mi) { @@ -618,31 +643,39 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap, struct vmem_altmap *altmap = &pgmap->altmap; struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb; u64 offset = le64_to_cpu(pfn_sb->dataoff); - u32 start_pad = __le32_to_cpu(pfn_sb->start_pad); + u32 start_pad = __le32_to_cpu(pfn_sb->start_pad), map_start_pad; u32 end_trunc = __le32_to_cpu(pfn_sb->end_trunc); - u32 reserve = info_block_reserve(); + u32 reserve = info_block_reserve(start_pad); struct nd_namespace_common *ndns = nd_pfn->ndns; struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev); resource_size_t base = nsio->res.start + start_pad; + struct device *dev = &nd_pfn->dev; struct vmem_altmap __altmap = { - .base_pfn = init_altmap_base(base), - .reserve = init_altmap_reserve(base), + .base_pfn = PHYS_PFN(base), + .reserve = PHYS_PFN(reserve), };
memcpy(res, &nsio->res, sizeof(*res)); res->start += start_pad; res->end -= end_trunc;
+ /* + * map_start_pad is adjusted to 0 for pre-v1.3 infoblocks to + * preserve a bug in the implementation that can only be fixed + * by migrating to a v1.3+ configuration. + * + * Post v1.3 start_pad is accounted in ->data_offset, and + * ensures that: + * __va(map_base) == __pa(map) + * + * map_pad is only non-zero when ensuring backwards + * compatibility with pre-v1.3 configurations. + */ + map_start_pad = start_pad - pfn_start_pad(nd_pfn); *mi = (struct pfn_map_info) { - /* - * TODO fix implementation to properly account for - * 'start_pad' in map_base, and map_offset. As is, the - * fact that __va(map_base) != __pa(map) leads - * mistranslations in pmem_direct_access(). - */ - .map_base = nsio->res.start, - .map_pad = start_pad, - .map_offset = offset, + .map_base = nsio->res.start + map_start_pad, + .map_pad = pfn_start_pad(nd_pfn), + .map_offset = offset - map_start_pad, .map_size = resource_size(res), };
@@ -653,14 +686,14 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap, pgmap->altmap_valid = false; } else if (nd_pfn->mode == PFN_MODE_PMEM) { nd_pfn->npfns = PFN_SECTION_ALIGN_UP((resource_size(res) + - pfn_start_pad(nd_pfn) - offset) / PAGE_SIZE); if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns) - dev_info(&nd_pfn->dev, - "number of pfns truncated from %lld to %ld\n", + dev_info(dev, "number of pfns truncated from %lld to %ld\n", le64_to_cpu(nd_pfn->pfn_sb->npfns), nd_pfn->npfns); memcpy(altmap, &__altmap, sizeof(*altmap)); - altmap->free = PHYS_PFN(offset - reserve); + altmap->free = PHYS_PFN(mi->map_offset - reserve); altmap->alloc = 0; pgmap->altmap_valid = true; } else @@ -712,7 +745,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) { struct nd_namespace_common *ndns = nd_pfn->ndns; struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev); - u32 start_pad, end_trunc, reserve = info_block_reserve(); + u32 start_pad, end_trunc, reserve; resource_size_t start, size; struct nd_region *nd_region; struct nd_pfn_sb *pfn_sb; @@ -750,6 +783,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) if (start_pad + end_trunc) dev_info(&nd_pfn->dev, "%s alignment collision, truncate %d bytes\n", dev_name(&ndns->dev), start_pad + end_trunc); + reserve = info_block_reserve(start_pad);
/* * Note, we use 64 here for the standard size of struct page, @@ -769,29 +803,44 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) */ offset = ALIGN(start + reserve + 64 * npfns, max(nd_pfn->align, PMD_SIZE)) - start; - } else if (nd_pfn->mode == PFN_MODE_RAM) + offset += start_pad; + } else if (nd_pfn->mode == PFN_MODE_RAM) { offset = ALIGN(start + reserve, nd_pfn->align) - start; - else + offset += start_pad; + } else return -ENXIO;
- if (offset + start_pad + end_trunc >= size) { + if (offset + end_trunc >= size) { dev_err(&nd_pfn->dev, "%s unable to satisfy requested alignment\n", dev_name(&ndns->dev)); return -ENXIO; }
- npfns = (size - offset - start_pad - end_trunc) / SZ_4K; - pfn_sb->mode = cpu_to_le32(nd_pfn->mode); + + npfns = (size - offset - end_trunc) / SZ_4K; + pfn_sb->dataoff = cpu_to_le64(offset); pfn_sb->npfns = cpu_to_le64(npfns); memcpy(pfn_sb->signature, sig, PFN_SIG_LEN); memcpy(pfn_sb->uuid, nd_pfn->uuid, 16); memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16); pfn_sb->version_major = cpu_to_le16(1); - pfn_sb->version_minor = cpu_to_le16(2); - pfn_sb->min_version = cpu_to_le16(2); pfn_sb->start_pad = cpu_to_le32(start_pad); pfn_sb->end_trunc = cpu_to_le32(end_trunc); + if (start_pad) { + /* + * Require implementations to account for start_pad in + * data_offset. Use PFN3_MODE to cause versions older + * than the introduction of min_version to fail. + */ + pfn_sb->mode = cpu_to_le32(nd_pfn->mode + 2); + pfn_sb->version_minor = cpu_to_le16(3); + pfn_sb->min_version = cpu_to_le16(3); + } else { + pfn_sb->mode = cpu_to_le32(nd_pfn->mode); + pfn_sb->version_minor = cpu_to_le16(2); + pfn_sb->min_version = cpu_to_le16(2); + } pfn_sb->align = cpu_to_le32(nd_pfn->align); checksum = nd_sb_checksum((struct nd_gen_sb *) pfn_sb); pfn_sb->checksum = cpu_to_le64(checksum);
Hi, Dan,
Thanks for the comprehensive write-up. Comments below.
Dan Williams dan.j.williams@intel.com writes:
In the beginning the pmem driver simply passed the persistent memory resource range to memremap and was done. With the introduction of devm_memremap_pages() and vmem_altmap the implementation needed to contend with metadata at the start of the resource to indicate whether the vmemmap is located in System RAM or Persistent Memory, and reserve vmemmap capacity in pmem for the latter case.
The indication of metadata space was communicated in the nd_pfn->data_offset property and it was defined to be identical to the pmem_device->data_offset property, i.e. relative to the raw resource base of the namespace. Up until this point in the driver's development pmem_device->phys_addr == __pa(pmem_device->virt_addr). This implementation was fine up until the discovery of platforms with physical address layouts that mapped Persistent Memory and System RAM to the same Linux memory hotplug section (128MB span).
The nd_pfn->start_pad and nd_pfn->end_trunc properties were introduced to pad and truncate the capacity to fit within an exclusive Linux memory hotplug section span, and it was at this point where the ->start_pad definition did not comprehend the pmem_device->phys_addr to pmem_device->virt_addr relationship. Platforms in the wild typically only collided 'System RAM' at the end of the Persistent Memory range so ->start_pad was often zero.
Lately Linux has encountered platforms that collide Persistent Memory regions between each other, specifically cases where ->start_pad needed to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions". That commit allowed namespaces to be mapped with devm_memremap_pages(). However dax operations on those configurations currently fail if attempted within the ->start_pad range because pmem_device->data_offset was still relative to raw resource base not relative to the section aligned resource range mapped by devm_memremap_pages().
Luckily __bdev_dax_supported() caught these failures and simply disabled dax.
Let me make sure I understand the current state of things. Assume a machine with two persistent memory ranges overlapping the same hotplug memory section. Let's take the example from the ndctl github issue[1]:
187c000000-967bffffff : Persistent Memory
/sys/bus/nd/devices/region0/resource: 0x187c000000 /sys/bus/nd/devices/region1/resource: 0x577c000000
Create a namespace in region1. That namespace will have a start_pad of 64MiB. The problem is that, while the correct offset was specified when laying out the struct pages (via arch_add_memory), the data_offset for the pmem block device itself does not take the start_pad into account (despite the comment in the nd_pfn_sb data structure!). As a result, the block device starts at the beginning of the address range, but struct pages only exist for the address space starting 64MiB into the range. __bdev_dax_supported fails, because it tries to perform a direct_access call on sector 0, and there's no pgmap for the address corresponding to that sector.
So, we can't simply make the code correct (by adding the start_pad to pmem->data_offset) without bumping the superblock version, because that would change the size of the block device, and the location of data on that block device would all be off by 64MiB (and you'd lose the first 64MiB). Mass hysteria.
However, to fix this situation a non-backwards compatible change needs to be made to the interpretation of the nd_pfn info-block. ->start_pad needs to be accounted in ->map.map_offset (formerly ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be adjusted to the section aligned resource base used to establish ->map.map formerly (formerly ->virt_addr).
The guiding principles of the info-block compatibility fixup is to maintain the interpretation of ->data_offset for implementations like the EFI driver that only care about data_access not dax, but cause older Linux implementations that care about the mode and dax to fail to parse the new info-block.
What if the core mm grew support for hotplug on sub-section boundaries? Would't that fix this problem (and others)?
-Jeff
[ add linux-mm ]
On Thu, Feb 21, 2019 at 3:47 PM Jeff Moyer jmoyer@redhat.com wrote:
Hi, Dan,
Thanks for the comprehensive write-up. Comments below.
Dan Williams dan.j.williams@intel.com writes:
In the beginning the pmem driver simply passed the persistent memory resource range to memremap and was done. With the introduction of devm_memremap_pages() and vmem_altmap the implementation needed to contend with metadata at the start of the resource to indicate whether the vmemmap is located in System RAM or Persistent Memory, and reserve vmemmap capacity in pmem for the latter case.
The indication of metadata space was communicated in the nd_pfn->data_offset property and it was defined to be identical to the pmem_device->data_offset property, i.e. relative to the raw resource base of the namespace. Up until this point in the driver's development pmem_device->phys_addr == __pa(pmem_device->virt_addr). This implementation was fine up until the discovery of platforms with physical address layouts that mapped Persistent Memory and System RAM to the same Linux memory hotplug section (128MB span).
The nd_pfn->start_pad and nd_pfn->end_trunc properties were introduced to pad and truncate the capacity to fit within an exclusive Linux memory hotplug section span, and it was at this point where the ->start_pad definition did not comprehend the pmem_device->phys_addr to pmem_device->virt_addr relationship. Platforms in the wild typically only collided 'System RAM' at the end of the Persistent Memory range so ->start_pad was often zero.
Lately Linux has encountered platforms that collide Persistent Memory regions between each other, specifically cases where ->start_pad needed to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions". That commit allowed namespaces to be mapped with devm_memremap_pages(). However dax operations on those configurations currently fail if attempted within the ->start_pad range because pmem_device->data_offset was still relative to raw resource base not relative to the section aligned resource range mapped by devm_memremap_pages().
Luckily __bdev_dax_supported() caught these failures and simply disabled dax.
Let me make sure I understand the current state of things. Assume a machine with two persistent memory ranges overlapping the same hotplug memory section. Let's take the example from the ndctl github issue[1]:
187c000000-967bffffff : Persistent Memory
/sys/bus/nd/devices/region0/resource: 0x187c000000 /sys/bus/nd/devices/region1/resource: 0x577c000000
Create a namespace in region1. That namespace will have a start_pad of 64MiB. The problem is that, while the correct offset was specified when laying out the struct pages (via arch_add_memory), the data_offset for the pmem block device itself does not take the start_pad into account (despite the comment in the nd_pfn_sb data structure!).
Unfortunately, yes.
As a result, the block device starts at the beginning of the address range, but struct pages only exist for the address space starting 64MiB into the range. __bdev_dax_supported fails, because it tries to perform a direct_access call on sector 0, and there's no pgmap for the address corresponding to that sector.
So, we can't simply make the code correct (by adding the start_pad to pmem->data_offset) without bumping the superblock version, because that would change the size of the block device, and the location of data on that block device would all be off by 64MiB (and you'd lose the first 64MiB). Mass hysteria.
Correct. Systems with this bug are working fine without DAX because everything is aligned in that case. We can't change the interpretation of the fields to make DAX work without losing access to existing data at the proper offsets through the non-DAX path.
However, to fix this situation a non-backwards compatible change needs to be made to the interpretation of the nd_pfn info-block. ->start_pad needs to be accounted in ->map.map_offset (formerly ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be adjusted to the section aligned resource base used to establish ->map.map formerly (formerly ->virt_addr).
The guiding principles of the info-block compatibility fixup is to maintain the interpretation of ->data_offset for implementations like the EFI driver that only care about data_access not dax, but cause older Linux implementations that care about the mode and dax to fail to parse the new info-block.
What if the core mm grew support for hotplug on sub-section boundaries? Would't that fix this problem (and others)?
Yes, I think it would, and I had patches along these lines [2]. Last time I looked at this I was asked by core-mm folks to await some general refactoring of hotplug [3], and I wasn't proud about some of the hacks I used to make it work. In general I'm less confident about being able to get sub-section-hotplug over the goal line (core-mm resistance to hotplug complexity) vs the local hacks in nvdimm to deal with this breakage.
Local hacks are always a sad choice, but I think leaving these configurations stranded for another kernel cycle is not tenable. It wasn't until the github issue did I realize that the problem was happening in the wild on NVDIMM-N platforms.
[2]: https://lore.kernel.org/lkml/148964440651.19438.2288075389153762985.stgit@dw... [3]: https://lore.kernel.org/lkml/20170319163531.GA25835@dhcp22.suse.cz/
-Jeff
Dan Williams dan.j.williams@intel.com writes:
However, to fix this situation a non-backwards compatible change needs to be made to the interpretation of the nd_pfn info-block. ->start_pad needs to be accounted in ->map.map_offset (formerly ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be adjusted to the section aligned resource base used to establish ->map.map formerly (formerly ->virt_addr).
The guiding principles of the info-block compatibility fixup is to maintain the interpretation of ->data_offset for implementations like the EFI driver that only care about data_access not dax, but cause older Linux implementations that care about the mode and dax to fail to parse the new info-block.
What if the core mm grew support for hotplug on sub-section boundaries? Would't that fix this problem (and others)?
Yes, I think it would, and I had patches along these lines [2]. Last time I looked at this I was asked by core-mm folks to await some general refactoring of hotplug [3], and I wasn't proud about some of the hacks I used to make it work. In general I'm less confident about being able to get sub-section-hotplug over the goal line (core-mm resistance to hotplug complexity) vs the local hacks in nvdimm to deal with this breakage.
You first posted that patch series in December of 2016. How long do we wait for this refactoring to happen?
Meanwhile, we've been kicking this can down the road for far too long. Simple namespace creation fails to work. For example:
# ndctl create-namespace -m fsdax -s 128m Error: '--size=' must align to interleave-width: 6 and alignment: 2097152 did you intend --size=132M?
failed to create namespace: Invalid argument
ok, I can't actually create a small, section-aligned namespace. Let's bump it up:
# ndctl create-namespace -m fsdax -s 132m { "dev":"namespace1.0", "mode":"fsdax", "map":"dev", "size":"126.00 MiB (132.12 MB)", "uuid":"2a5f8fe0-69e2-46bf-98bc-0f5667cd810a", "raw_uuid":"f7324317-5cd2-491e-8cd1-ad03770593f2", "sector_size":512, "blockdev":"pmem1", "numa_node":1 }
Great! Now let's create another one.
# ndctl create-namespace -m fsdax -s 132m libndctl: ndctl_pfn_enable: pfn1.1: failed to enable Error: namespace1.2: failed to enable
failed to create namespace: No such device or address
(along with a kernel warning spew)
And at this point, all further ndctl create-namespace commands fail. Lovely. This is a wart that was acceptable only because a fix was coming. 2+ years later, and we're still adding hacks to work around it (and there have been *several* hacks).
Local hacks are always a sad choice, but I think leaving these configurations stranded for another kernel cycle is not tenable. It wasn't until the github issue did I realize that the problem was happening in the wild on NVDIMM-N platforms.
I understand the desire for expediency. At some point, though, we have to address the root of the problem.
-Jeff
-Jeff
On Fri, Feb 22, 2019 at 7:42 AM Jeff Moyer jmoyer@redhat.com wrote:
Dan Williams dan.j.williams@intel.com writes:
However, to fix this situation a non-backwards compatible change needs to be made to the interpretation of the nd_pfn info-block. ->start_pad needs to be accounted in ->map.map_offset (formerly ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be adjusted to the section aligned resource base used to establish ->map.map formerly (formerly ->virt_addr).
The guiding principles of the info-block compatibility fixup is to maintain the interpretation of ->data_offset for implementations like the EFI driver that only care about data_access not dax, but cause older Linux implementations that care about the mode and dax to fail to parse the new info-block.
What if the core mm grew support for hotplug on sub-section boundaries? Would't that fix this problem (and others)?
Yes, I think it would, and I had patches along these lines [2]. Last time I looked at this I was asked by core-mm folks to await some general refactoring of hotplug [3], and I wasn't proud about some of the hacks I used to make it work. In general I'm less confident about being able to get sub-section-hotplug over the goal line (core-mm resistance to hotplug complexity) vs the local hacks in nvdimm to deal with this breakage.
You first posted that patch series in December of 2016. How long do we wait for this refactoring to happen?
Meanwhile, we've been kicking this can down the road for far too long. Simple namespace creation fails to work. For example:
# ndctl create-namespace -m fsdax -s 128m Error: '--size=' must align to interleave-width: 6 and alignment: 2097152 did you intend --size=132M?
failed to create namespace: Invalid argument
ok, I can't actually create a small, section-aligned namespace. Let's bump it up:
# ndctl create-namespace -m fsdax -s 132m { "dev":"namespace1.0", "mode":"fsdax", "map":"dev", "size":"126.00 MiB (132.12 MB)", "uuid":"2a5f8fe0-69e2-46bf-98bc-0f5667cd810a", "raw_uuid":"f7324317-5cd2-491e-8cd1-ad03770593f2", "sector_size":512, "blockdev":"pmem1", "numa_node":1 }
Great! Now let's create another one.
# ndctl create-namespace -m fsdax -s 132m libndctl: ndctl_pfn_enable: pfn1.1: failed to enable Error: namespace1.2: failed to enable
failed to create namespace: No such device or address
(along with a kernel warning spew)
I assume you're seeing this on the libnvdimm-pending branch?
And at this point, all further ndctl create-namespace commands fail. Lovely. This is a wart that was acceptable only because a fix was coming. 2+ years later, and we're still adding hacks to work around it (and there have been *several* hacks).
True.
Local hacks are always a sad choice, but I think leaving these configurations stranded for another kernel cycle is not tenable. It wasn't until the github issue did I realize that the problem was happening in the wild on NVDIMM-N platforms.
I understand the desire for expediency. At some point, though, we have to address the root of the problem.
Well, you've defibrillated me back to reality. We've suffered the incomplete broken hacks for 2 years, what's another 10 weeks? I'll dust off the sub-section patches and take another run at it.
Dan Williams dan.j.williams@intel.com writes:
Great! Now let's create another one.
# ndctl create-namespace -m fsdax -s 132m libndctl: ndctl_pfn_enable: pfn1.1: failed to enable Error: namespace1.2: failed to enable
failed to create namespace: No such device or address
(along with a kernel warning spew)
I assume you're seeing this on the libnvdimm-pending branch?
Yes, but also on linus' master branch. Things have been operating in this manner for some time.
I understand the desire for expediency. At some point, though, we have to address the root of the problem.
Well, you've defibrillated me back to reality. We've suffered the incomplete broken hacks for 2 years, what's another 10 weeks? I'll dust off the sub-section patches and take another run at it.
OK, thanks. Let me know if I can help at all.
Cheers, Jeff
On Tue, Feb 12, 2019 at 1:37 PM Dan Williams dan.j.williams@intel.com wrote:
Lately Linux has encountered platforms that collide Persistent Memory regions between each other, specifically cases where ->start_pad needed to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions". That commit allowed namespaces to be mapped with devm_memremap_pages(). However dax operations on those configurations currently fail if attempted within the ->start_pad range because pmem_device->data_offset was still relative to raw resource base not relative to the section aligned resource range mapped by devm_memremap_pages().
Luckily __bdev_dax_supported() caught these failures and simply disabled dax. However, to fix this situation a non-backwards compatible change needs to be made to the interpretation of the nd_pfn info-block. ->start_pad needs to be accounted in ->map.map_offset (formerly ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be adjusted to the section aligned resource base used to establish ->map.map formerly (formerly ->virt_addr).
See patch 7 "libnvdimm/pfn: Fix 'start_pad' implementation" for more details, and the ndctl patch series "Improve support + testing for labels + info-blocks" for the corresponding regression test.
Hello valued reviewers, can I plead for a sanity check of at least "libnvdimm/pfn: Introduce super-block minimum version requirements" and "libnvdimm/pfn: Fix 'start_pad' implementation"? In particular Jeff / Johannes this has end user / distro impact in that users may lose access to namespaces that are upgraded to v1.3 info-blocks and then boot an old kernel. I did not see a way around that sharp edge.
Dan Williams (7): libnvdimm/pfn: Account for PAGE_SIZE > info-block-size in nd_pfn_init() libnvdimm/pmem: Honor force_raw for legacy pmem regions dax: Check the end of the block-device capacity with dax_direct_access() libnvdimm/pfn: Introduce super-block minimum version requirements libnvdimm/pfn: Remove dax_label_reserve libnvdimm/pfn: Introduce 'struct pfn_map_info' libnvdimm/pfn: Fix 'start_pad' implementation
drivers/dax/pmem.c | 9 +- drivers/dax/super.c | 39 ++++++-- drivers/nvdimm/namespace_devs.c | 4 + drivers/nvdimm/nd.h | 15 +++ drivers/nvdimm/pfn.h | 4 + drivers/nvdimm/pfn_devs.c | 181 ++++++++++++++++++++++++++++----------- drivers/nvdimm/pmem.c | 111 +++++++++++------------- drivers/nvdimm/pmem.h | 12 --- tools/testing/nvdimm/pmem-dax.c | 15 ++- 9 files changed, 244 insertions(+), 146 deletions(-)
On 20/02/2019 18:11, Dan Williams wrote:
Hello valued reviewers, can I plead for a sanity check of at least "libnvdimm/pfn: Introduce super-block minimum version requirements" and "libnvdimm/pfn: Fix 'start_pad' implementation"? In particular Jeff / Johannes this has end user / distro impact in that users may lose access to namespaces that are upgraded to v1.3 info-blocks and then boot an old kernel. I did not see a way around that sharp edge.
Oh sorry it was on my todo list and then I forgot about it again, I'll give it a shot tomorrow.
Byte, Johannes
Dan Williams dan.j.williams@intel.com writes:
On Tue, Feb 12, 2019 at 1:37 PM Dan Williams dan.j.williams@intel.com wrote:
Lately Linux has encountered platforms that collide Persistent Memory regions between each other, specifically cases where ->start_pad needed to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions". That commit allowed namespaces to be mapped with devm_memremap_pages(). However dax operations on those configurations currently fail if attempted within the ->start_pad range because pmem_device->data_offset was still relative to raw resource base not relative to the section aligned resource range mapped by devm_memremap_pages().
Luckily __bdev_dax_supported() caught these failures and simply disabled dax. However, to fix this situation a non-backwards compatible change needs to be made to the interpretation of the nd_pfn info-block. ->start_pad needs to be accounted in ->map.map_offset (formerly ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be adjusted to the section aligned resource base used to establish ->map.map formerly (formerly ->virt_addr).
See patch 7 "libnvdimm/pfn: Fix 'start_pad' implementation" for more details, and the ndctl patch series "Improve support + testing for labels + info-blocks" for the corresponding regression test.
Hello valued reviewers, can I plead for a sanity check of at least "libnvdimm/pfn: Introduce super-block minimum version requirements" and "libnvdimm/pfn: Fix 'start_pad' implementation"? In particular Jeff / Johannes this has end user / distro impact in that users may lose access to namespaces that are upgraded to v1.3 info-blocks and then boot an old kernel. I did not see a way around that sharp edge.
Yes, I'll take a look.
Cheers, Jeff
linux-stable-mirror@lists.linaro.org