On Wed, Oct 23, 2024 at 03:12:07PM +0200, Robert Richter wrote:
On 22.10.24 18:43:15, Dan Williams wrote:
Changes since v1 [1]:
- Fix some misspellings missed by checkpatch in changelogs (Jonathan)
- Add comments explaining the order of objects in drivers/cxl/Makefile (Jonathan)
- Rename attach_device => cxl_rescan_attach (Jonathan)
- Fixup Zijun's email (Zijun)
Original cover:
Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to delayed arrival of the CXL "root" infrastructure [1] prompted questions of how the existing mechanism for retrying cxl_mem_probe() could be failing.
I found a similar issue with the region creation.
A region is created with the first endpoint found and immediately added as device which triggers cxl_region_probe(). Now, in interleaving setups the region state comes into commit state only after the last endpoint was probed. So the probe must be repeated until all endpoints were enumerated. I ended up with this change:
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index a07b62254596..c78704e435e5 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -3775,8 +3775,8 @@ static int cxl_region_probe(struct device *dev) } if (p->state < CXL_CONFIG_COMMIT) {
dev_dbg(&cxlr->dev, "config state: %d\n", p->state);
rc = -ENXIO;
rc = dev_err_probe(&cxlr->dev, -EPROBE_DEFER,
goto out; }"region config state: %d\n", p->state);
I have not experienced any out of order operations since applying Dan's v1 of this patch set, do you still see this after applying the existing set?
Probably this is indicative of needing another SOFTDEP / ordering issue.
~Gregory
-- 2.39.5
I don't see an init order issue here as the mem module is always up before the regions are probed.
-Robert
The critical missing piece in the debug was that Gregory's setup had almost all CXL modules built-in to the kernel.
On the way to that discovery several other bugs and init-order corner cases were discovered.
The main fix is to make sure the drivers/cxl/Makefile object order supports root CXL ports being fully initialized upon cxl_acpi_probe() exit. The modular case has some similar potential holes that are fixed with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update cxl_test to reproduce the original report resulted in the discovery of a separate long standing use after free bug in cxl_region_detach().
Dan Williams (6): cxl/port: Fix CXL port initialization order when the subsystem is built-in cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices() cxl/acpi: Ensure ports ready at cxl_acpi_probe() return cxl/port: Fix use-after-free, permit out-of-order decoder shutdown cxl/port: Prevent out-of-order decoder allocation cxl/test: Improve init-order fidelity relative to real-world systems
drivers/base/core.c | 35 +++++++ drivers/cxl/Kconfig | 1 drivers/cxl/Makefile | 20 +++- drivers/cxl/acpi.c | 7 + drivers/cxl/core/hdm.c | 50 +++++++++-- drivers/cxl/core/port.c | 13 ++- drivers/cxl/core/region.c | 91 ++++++++++--------- drivers/cxl/cxl.h | 3 - include/linux/device.h | 3 + tools/testing/cxl/test/cxl.c | 200 +++++++++++++++++++++++------------------- tools/testing/cxl/test/mem.c | 1 11 files changed, 269 insertions(+), 155 deletions(-)
base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b