On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
Summary:
CXL RAM support allows for the dynamic provisioning of new CXL RAM regions, and more routinely, assembling a region from an existing configuration established by platform-firmware. The latter is motivated by CXL memory RAS (Reliability, Availability and Serviceability) support, that requires associating device events with System Physical Address ranges and vice versa.
The 'Soft Reserved' policy rework arranges for performance differentiated memory like CXL attached DRAM, or high-bandwidth memory, to be designated for 'System RAM' by default, rather than the device-dax dedicated access mode. That current device-dax default is confusing and surprising for the Pareto of users that do not expect memory to be quarantined for dedicated access by default. Most users expect all 'System RAM'-capable memory to show up in FREE(1).
Details:
Recall that the Linux 'Soft Reserved' designation for memory is a reaction to platform-firmware, like EFI EDK2, delineating memory with the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An alternative way to think of that attribute is that it specifies the *not* general-purpose memory pool. It is memory that may be too precious for general usage or not performant enough for some hot data structures. However, in the absence of explicit policy it should just be 'System RAM' by default.
Rather than require every distribution to ship a udev policy to assign dax devices to dax_kmem (the device-memory hotplug driver) just make that the kernel default. This is similar to the rationale in:
commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")
With this change the relatively niche use case of accessing this memory via mapping a device-dax instance can be achieved by building with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying memhp_default_state=offline at boot, and then use:
daxctl reconfigure-device $device -m devdax --force
...to shift the corresponding address range to device-dax access.
The process of assembling a device-dax instance for a given CXL region device configuration is similar to the process of assembling a Device-Mapper or MDRAID storage-device array. Specifically, asynchronous probing by the PCI and driver core enumerates all CXL endpoints and their decoders. Then, once enough decoders have arrived to a describe a given region, that region is passed to the device-dax subsystem where it is subject to the above 'dax_kmem' policy. This assignment and policy choice is only possible if memory is set aside by the 'Soft Reserved' designation. Otherwise, CXL that is mapped as 'System RAM' becomes immutable by CXL driver mechanisms, but is still enumerated for RAS purposes.
This series is also available via:
https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/c...
...and has gone through some preview testing in various forms.
Tested-by: Fan Ni fan.ni@samsung.com
Run the following tests with the patch (with the volatile support at qemu). Note: cxl related code are compiled as modules and loaded before used.
For pmem setup, tried three topologies (1HB1RP1Mem, 1HB2RP2Mem, 1HB2RP4Mem with a cxl switch). The memdev is either provided in the command line when launching qemu or hot added to the guest with device_add command in qemu monitor.
The following operations are performed, 1. create-region with cxl cmd 2. create name-space with ndctl cmd 3. convert cxl mem to ram with daxctl cmd 4. online the memory with daxctl cmd 5. Let app use the memory (numactl --membind=1 htop)
Results: No regression.
For volatile memory (hot add with device_add command), mainly tested 1HB1RP1Mem case (passthrough). 1. the device can be correctly discovered after hot add (cxl list, may need cxl enable-memdev) 2. creating ram region (cxl create-region) succeeded, after creating the region, a dax device under /dev/ is shown. 3. online the memory passes, and the memory is shown on another NUMA node. 4. Let app use the memory (numactl --membind=1 htop) passed.
Dan Williams (18): cxl/Documentation: Update references to attributes added in v6.0 cxl/region: Add a mode attribute for regions cxl/region: Support empty uuids for non-pmem regions cxl/region: Validate region mode vs decoder mode cxl/region: Add volatile region creation support cxl/region: Refactor attach_target() for autodiscovery cxl/region: Move region-position validation to a helper kernel/range: Uplevel the cxl subsystem's range_contains() helper cxl/region: Enable CONFIG_CXL_REGION to be toggled cxl/region: Fix passthrough-decoder detection cxl/region: Add region autodiscovery tools/testing/cxl: Define a fixed volatile configuration to parse dax/hmem: Move HMAT and Soft reservation probe initcall level dax/hmem: Drop unnecessary dax_hmem_remove() dax/hmem: Convey the dax range via memregion_info() dax/hmem: Move hmem device registration to dax_hmem.ko dax: Assign RAM regions to memory-hotplug by default cxl/dax: Create dax devices for CXL RAM regions
Documentation/ABI/testing/sysfs-bus-cxl | 64 +- MAINTAINERS | 1 drivers/acpi/numa/hmat.c | 4 drivers/cxl/Kconfig | 12 drivers/cxl/acpi.c | 3 drivers/cxl/core/core.h | 7 drivers/cxl/core/hdm.c | 8 drivers/cxl/core/pci.c | 5 drivers/cxl/core/port.c | 34 + drivers/cxl/core/region.c | 848 ++++++++++++++++++++++++++++--- drivers/cxl/cxl.h | 46 ++ drivers/cxl/cxlmem.h | 3 drivers/cxl/port.c | 26 + drivers/dax/Kconfig | 17 + drivers/dax/Makefile | 2 drivers/dax/bus.c | 53 +- drivers/dax/bus.h | 12 drivers/dax/cxl.c | 53 ++ drivers/dax/device.c | 3 drivers/dax/hmem/Makefile | 3 drivers/dax/hmem/device.c | 102 ++-- drivers/dax/hmem/hmem.c | 148 +++++ drivers/dax/kmem.c | 1 include/linux/dax.h | 7 include/linux/memregion.h | 2 include/linux/range.h | 5 lib/stackinit_kunit.c | 6 tools/testing/cxl/test/cxl.c | 146 +++++ 28 files changed, 1355 insertions(+), 266 deletions(-) create mode 100644 drivers/dax/cxl.c
base-commit: 172738bbccdb4ef76bdd72fc72a315c741c39161