Changes since v1 [1]: - Fix some misspellings missed by checkpatch in changelogs (Jonathan) - Add comments explaining the order of objects in drivers/cxl/Makefile (Jonathan) - Rename attach_device => cxl_rescan_attach (Jonathan) - Fixup Zijun's email (Zijun)
[1]: http://lore.kernel.org/172862483180.2150669.5564474284074502692.stgit@dwilli...
---
Original cover:
Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to delayed arrival of the CXL "root" infrastructure [1] prompted questions of how the existing mechanism for retrying cxl_mem_probe() could be failing.
The critical missing piece in the debug was that Gregory's setup had almost all CXL modules built-in to the kernel.
On the way to that discovery several other bugs and init-order corner cases were discovered.
The main fix is to make sure the drivers/cxl/Makefile object order supports root CXL ports being fully initialized upon cxl_acpi_probe() exit. The modular case has some similar potential holes that are fixed with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update cxl_test to reproduce the original report resulted in the discovery of a separate long standing use after free bug in cxl_region_detach().
[2]: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net
---
Dan Williams (6): cxl/port: Fix CXL port initialization order when the subsystem is built-in cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices() cxl/acpi: Ensure ports ready at cxl_acpi_probe() return cxl/port: Fix use-after-free, permit out-of-order decoder shutdown cxl/port: Prevent out-of-order decoder allocation cxl/test: Improve init-order fidelity relative to real-world systems
drivers/base/core.c | 35 +++++++ drivers/cxl/Kconfig | 1 drivers/cxl/Makefile | 20 +++- drivers/cxl/acpi.c | 7 + drivers/cxl/core/hdm.c | 50 +++++++++-- drivers/cxl/core/port.c | 13 ++- drivers/cxl/core/region.c | 91 ++++++++++--------- drivers/cxl/cxl.h | 3 - include/linux/device.h | 3 + tools/testing/cxl/test/cxl.c | 200 +++++++++++++++++++++++------------------- tools/testing/cxl/test/mem.c | 1 11 files changed, 269 insertions(+), 155 deletions(-)
base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.
Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.
As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case.
Reported-by: Gregory Price gourry@gourry.net Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price gourry@gourry.net Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Cc: stable@vger.kernel.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Vishal Verma vishal.l.verma@intel.com Cc: Ira Weiny ira.weiny@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/cxl/Kconfig | 1 + drivers/cxl/Makefile | 20 ++++++++++++++------ 2 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index 29c192f20082..876469e23f7a 100644 --- a/drivers/cxl/Kconfig +++ b/drivers/cxl/Kconfig @@ -60,6 +60,7 @@ config CXL_ACPI default CXL_BUS select ACPI_TABLE_LIB select ACPI_HMAT + select CXL_PORT help Enable support for host managed device memory (HDM) resources published by a platform's ACPI CXL memory layout description. See diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile index db321f48ba52..2caa90fa4bf2 100644 --- a/drivers/cxl/Makefile +++ b/drivers/cxl/Makefile @@ -1,13 +1,21 @@ # SPDX-License-Identifier: GPL-2.0 + +# Order is important here for the built-in case: +# - 'core' first for fundamental init +# - 'port' before platform root drivers like 'acpi' so that CXL-root ports +# are immediately enabled +# - 'mem' and 'pmem' before endpoint drivers so that memdevs are +# immediately enabled +# - 'pci' last, also mirrors the hardware enumeration hierarchy obj-y += core/ -obj-$(CONFIG_CXL_PCI) += cxl_pci.o -obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PORT) += cxl_port.o obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o -obj-$(CONFIG_CXL_PORT) += cxl_port.o +obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PCI) += cxl_pci.o
-cxl_mem-y := mem.o -cxl_pci-y := pci.o +cxl_port-y := port.o cxl_acpi-y := acpi.o cxl_pmem-y := pmem.o security.o -cxl_port-y := port.o +cxl_mem-y := mem.o +cxl_pci-y := pci.o
On Tue, 22 Oct 2024 18:43:24 -0700 Dan Williams dan.j.williams@intel.com wrote:
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.
Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.
As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case.
Reported-by: Gregory Price gourry@gourry.net Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price gourry@gourry.net Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver")
I don't like this due to likely long term fragility, but any other solution is probably more painful. Long term we should really get a regression test for these ordering issues in place in one of the CIs.
Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com
Cc: stable@vger.kernel.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Vishal Verma vishal.l.verma@intel.com Cc: Ira Weiny ira.weiny@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com
drivers/cxl/Kconfig | 1 + drivers/cxl/Makefile | 20 ++++++++++++++------ 2 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index 29c192f20082..876469e23f7a 100644 --- a/drivers/cxl/Kconfig +++ b/drivers/cxl/Kconfig @@ -60,6 +60,7 @@ config CXL_ACPI default CXL_BUS select ACPI_TABLE_LIB select ACPI_HMAT
- select CXL_PORT help Enable support for host managed device memory (HDM) resources published by a platform's ACPI CXL memory layout description. See
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile index db321f48ba52..2caa90fa4bf2 100644 --- a/drivers/cxl/Makefile +++ b/drivers/cxl/Makefile @@ -1,13 +1,21 @@ # SPDX-License-Identifier: GPL-2.0
+# Order is important here for the built-in case: +# - 'core' first for fundamental init +# - 'port' before platform root drivers like 'acpi' so that CXL-root ports +# are immediately enabled +# - 'mem' and 'pmem' before endpoint drivers so that memdevs are +# immediately enabled +# - 'pci' last, also mirrors the hardware enumeration hierarchy obj-y += core/ -obj-$(CONFIG_CXL_PCI) += cxl_pci.o -obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PORT) += cxl_port.o obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o -obj-$(CONFIG_CXL_PORT) += cxl_port.o +obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PCI) += cxl_pci.o -cxl_mem-y := mem.o -cxl_pci-y := pci.o +cxl_port-y := port.o cxl_acpi-y := acpi.o cxl_pmem-y := pmem.o security.o -cxl_port-y := port.o +cxl_mem-y := mem.o +cxl_pci-y := pci.o
Jonathan Cameron wrote:
On Tue, 22 Oct 2024 18:43:24 -0700 Dan Williams dan.j.williams@intel.com wrote:
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.
Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.
As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case.
Reported-by: Gregory Price gourry@gourry.net Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price gourry@gourry.net Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver")
I don't like this due to likely long term fragility, but any other
Please be specific about the fragility and how is this different than any other Makefile order fragility, like the many cases in drivers/Makefile/, or patches fixing up initcall order?
Now, an argument can be made that there are too many CXL sub-objects and more can be merged into a monolithic cxl_core object. The flipside of that is reduced testability, at least via symbol mocking techniques. Just look at the recent case where the fact that drivers/cxl/core/region.c is built into cxl_core.o rather than its own cxl_region.o object results in an in-line code change to support cxl_test [1]. There are tradeoffs.
solution is probably more painful. Long term we should really get a regression test for these ordering issues in place in one of the CIs.
The final patch in this series does improve cxl_test's ability to catch regressions in module init order, and that ordering change did uncover a bug. The system works! 😅
Going further the test mode that is needed, in addition to QEMU emulation and cxl_test interface mocking, is kunit or similar [2] infrastructure for some function-scope unit tests.
[1]: http://lore.kernel.org/20241022030054.258942-1-lizhijian@fujitsu.com [2]: http://lore.kernel.org/170795677066.3697776.12587812713093908173.stgit@ubunt...
On Thu, 24 Oct 2024 09:19:17 -0700 Dan Williams dan.j.williams@intel.com wrote:
Jonathan Cameron wrote:
On Tue, 22 Oct 2024 18:43:24 -0700 Dan Williams dan.j.williams@intel.com wrote:
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.
Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.
As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case.
Reported-by: Gregory Price gourry@gourry.net Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price gourry@gourry.net Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver")
I don't like this due to likely long term fragility, but any other
Please be specific about the fragility and how is this different than any other Makefile order fragility, like the many cases in drivers/Makefile/, or patches fixing up initcall order?
Sure, I don't like any of them ;) Mostly was having a grumpy day rather than suggesting a change in this.
Now, an argument can be made that there are too many CXL sub-objects and more can be merged into a monolithic cxl_core object. The flipside of that is reduced testability, at least via symbol mocking techniques. Just look at the recent case where the fact that drivers/cxl/core/region.c is built into cxl_core.o rather than its own cxl_region.o object results in an in-line code change to support cxl_test [1]. There are tradeoffs.
Absolutely agree.
solution is probably more painful. Long term we should really get a regression test for these ordering issues in place in one of the CIs.
The final patch in this series does improve cxl_test's ability to catch regressions in module init order, and that ordering change did uncover a bug. The system works! 😅
That was indeed good to see!
Going further the test mode that is needed, in addition to QEMU emulation and cxl_test interface mocking, is kunit or similar [2] infrastructure for some function-scope unit tests.
Yeah, we have a long way to go on testing (common problem!)
Jonathan
On 10/23/24 02:43, Dan Williams wrote:
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.
I'm having problems with understanding this. The acpi module is initialised following the initcall levels, as defined by the code with the subsys_initcall(cxl_acpi_init), and the cxl_mem module is not, so AFAIK, there should not be any race there with the acpi module always being initialised first. It I'm right, the problem should be another one we do not know yet ...
Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.
... or forcing cxl_port to be built-in is enough. I wonder how, without it, the cxl root ports can be there for cxl_mem ...
As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case.
Reported-by: Gregory Price gourry@gourry.net Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price gourry@gourry.net Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Cc: stable@vger.kernel.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Vishal Verma vishal.l.verma@intel.com Cc: Ira Weiny ira.weiny@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com
drivers/cxl/Kconfig | 1 + drivers/cxl/Makefile | 20 ++++++++++++++------ 2 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index 29c192f20082..876469e23f7a 100644 --- a/drivers/cxl/Kconfig +++ b/drivers/cxl/Kconfig @@ -60,6 +60,7 @@ config CXL_ACPI default CXL_BUS select ACPI_TABLE_LIB select ACPI_HMAT
- select CXL_PORT help Enable support for host managed device memory (HDM) resources published by a platform's ACPI CXL memory layout description. See
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile index db321f48ba52..2caa90fa4bf2 100644 --- a/drivers/cxl/Makefile +++ b/drivers/cxl/Makefile @@ -1,13 +1,21 @@ # SPDX-License-Identifier: GPL-2.0
+# Order is important here for the built-in case: +# - 'core' first for fundamental init +# - 'port' before platform root drivers like 'acpi' so that CXL-root ports +# are immediately enabled +# - 'mem' and 'pmem' before endpoint drivers so that memdevs are +# immediately enabled +# - 'pci' last, also mirrors the hardware enumeration hierarchy obj-y += core/ -obj-$(CONFIG_CXL_PCI) += cxl_pci.o -obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PORT) += cxl_port.o obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o -obj-$(CONFIG_CXL_PORT) += cxl_port.o +obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PCI) += cxl_pci.o -cxl_mem-y := mem.o -cxl_pci-y := pci.o +cxl_port-y := port.o cxl_acpi-y := acpi.o cxl_pmem-y := pmem.o security.o -cxl_port-y := port.o +cxl_mem-y := mem.o +cxl_pci-y := pci.o
Alejandro Lucero Palau wrote:
On 10/23/24 02:43, Dan Williams wrote:
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.
I'm having problems with understanding this. The acpi module is initialised following the initcall levels, as defined by the code with the subsys_initcall(cxl_acpi_init), and the cxl_mem module is not, so AFAIK, there should not be any race there with the acpi module always being initialised first. It I'm right, the problem should be another one we do not know yet ...
This is a valid point, and I do think that cxl_port should also move to subsys_initcall() for completeness.
However, the reason this Makefile change works, even though cxl_acpi finishes init before cxl_port when both are built-in, is due to device discovery order.
With the old Makefile order it is possible for cxl_mem to race cxl_acpi_probe() in a way that defeats the cxl_bus_rescan() that is there to resolve device discovery races.
Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.
... or forcing cxl_port to be built-in is enough. I wonder how, without it, the cxl root ports can be there for cxl_mem ...
It does not need to be there for cxl_mem. It is ok for cxl_mem to load and complete enumeration well before cxl_acpi ever arrives. As long as cxl_bus_rescan() enables those devices after the fact then everything is ok.
The problematic case being fixed is the opposite, i.e. that cxl_bus_rescan() completes and never triggers again after cxl_mem has failed to find the root ports.
On 10/24/24 17:32, Dan Williams wrote:
Alejandro Lucero Palau wrote:
On 10/23/24 02:43, Dan Williams wrote:
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.
I'm having problems with understanding this. The acpi module is initialised following the initcall levels, as defined by the code with the subsys_initcall(cxl_acpi_init), and the cxl_mem module is not, so AFAIK, there should not be any race there with the acpi module always being initialised first. It I'm right, the problem should be another one we do not know yet ...
This is a valid point, and I do think that cxl_port should also move to subsys_initcall() for completeness.
However, the reason this Makefile change works, even though cxl_acpi finishes init before cxl_port when both are built-in, is due to device discovery order.
With the old Makefile order it is possible for cxl_mem to race cxl_acpi_probe() in a way that defeats the cxl_bus_rescan() that is there to resolve device discovery races.
OK. Then rephrasing the commit would help.
Apart from that:
Tested-by: Alejandro Lucero alucerop@amd.com
Reviewed-by: Alejandro Lucero alucerop@amd.com
Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.
... or forcing cxl_port to be built-in is enough. I wonder how, without it, the cxl root ports can be there for cxl_mem ...
It does not need to be there for cxl_mem. It is ok for cxl_mem to load and complete enumeration well before cxl_acpi ever arrives. As long as cxl_bus_rescan() enables those devices after the fact then everything is ok.
The problematic case being fixed is the opposite, i.e. that cxl_bus_rescan() completes and never triggers again after cxl_mem has failed to find the root ports.
Alejandro Lucero Palau wrote:
On 10/24/24 17:32, Dan Williams wrote:
Alejandro Lucero Palau wrote:
On 10/23/24 02:43, Dan Williams wrote:
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.
I'm having problems with understanding this. The acpi module is initialised following the initcall levels, as defined by the code with the subsys_initcall(cxl_acpi_init), and the cxl_mem module is not, so AFAIK, there should not be any race there with the acpi module always being initialised first. It I'm right, the problem should be another one we do not know yet ...
This is a valid point, and I do think that cxl_port should also move to subsys_initcall() for completeness.
However, the reason this Makefile change works, even though cxl_acpi finishes init before cxl_port when both are built-in, is due to device discovery order.
With the old Makefile order it is possible for cxl_mem to race cxl_acpi_probe() in a way that defeats the cxl_bus_rescan() that is there to resolve device discovery races.
OK. Then rephrasing the commit would help.
That and moving cxl_port to subsys_initcall(). Will respin this one.
Apart from that:
Tested-by: Alejandro Lucero alucerop@amd.com
Reviewed-by: Alejandro Lucero alucerop@amd.com
Thanks!
Dan Williams wrote:
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.
Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.
As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case.
Reported-by: Gregory Price gourry@gourry.net Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price gourry@gourry.net Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Cc: stable@vger.kernel.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Vishal Verma vishal.l.verma@intel.com Cc: Ira Weiny ira.weiny@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com
Reviewed-by: Ira Weiny ira.weiny@intel.com
[snip]
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach. If cxl_acpi wins the race, cxl_mem will find the enabled CXL root ports it needs. If cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That flow only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port driver is registered before cxl_acpi_probe() runs.
Fix up the order to prevent initialization failures. Ensure that cxl_port is built-in when cxl_acpi is also built-in, arrange for Makefile order to resolve the subsys_initcall() order of cxl_port and cxl_acpi, and arrange for Makefile order to resolve the device_initcall() (module_init()) order of the remaining objects.
As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case.
Reported-by: Gregory Price gourry@gourry.net Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price gourry@gourry.net Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Cc: stable@vger.kernel.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Vishal Verma vishal.l.verma@intel.com Cc: Ira Weiny ira.weiny@intel.com Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Reviewed-by: Ira Weiny ira.weiny@intel.com Tested-by: Alejandro Lucero alucerop@amd.com Reviewed-by: Alejandro Lucero alucerop@amd.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- Changes since v2 - Partial re-roll patch1 to address comments and collect tags - Move cxl_port to subsys_initcall, clarify changelog (Alejandro)
drivers/cxl/Kconfig | 1 + drivers/cxl/Makefile | 20 ++++++++++++++------ drivers/cxl/port.c | 17 ++++++++++++++++- 3 files changed, 31 insertions(+), 7 deletions(-)
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index 29c192f20082..876469e23f7a 100644 --- a/drivers/cxl/Kconfig +++ b/drivers/cxl/Kconfig @@ -60,6 +60,7 @@ config CXL_ACPI default CXL_BUS select ACPI_TABLE_LIB select ACPI_HMAT + select CXL_PORT help Enable support for host managed device memory (HDM) resources published by a platform's ACPI CXL memory layout description. See diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile index db321f48ba52..2caa90fa4bf2 100644 --- a/drivers/cxl/Makefile +++ b/drivers/cxl/Makefile @@ -1,13 +1,21 @@ # SPDX-License-Identifier: GPL-2.0 + +# Order is important here for the built-in case: +# - 'core' first for fundamental init +# - 'port' before platform root drivers like 'acpi' so that CXL-root ports +# are immediately enabled +# - 'mem' and 'pmem' before endpoint drivers so that memdevs are +# immediately enabled +# - 'pci' last, also mirrors the hardware enumeration hierarchy obj-y += core/ -obj-$(CONFIG_CXL_PCI) += cxl_pci.o -obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PORT) += cxl_port.o obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o -obj-$(CONFIG_CXL_PORT) += cxl_port.o +obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PCI) += cxl_pci.o
-cxl_mem-y := mem.o -cxl_pci-y := pci.o +cxl_port-y := port.o cxl_acpi-y := acpi.o cxl_pmem-y := pmem.o security.o -cxl_port-y := port.o +cxl_mem-y := mem.o +cxl_pci-y := pci.o diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c index 861dde65768f..9dc394295e1f 100644 --- a/drivers/cxl/port.c +++ b/drivers/cxl/port.c @@ -208,7 +208,22 @@ static struct cxl_driver cxl_port_driver = { }, };
-module_cxl_driver(cxl_port_driver); +static int __init cxl_port_init(void) +{ + return cxl_driver_register(&cxl_port_driver); +} +/* + * Be ready to immediately enable ports emitted by the platform CXL root + * (e.g. cxl_acpi) when CONFIG_CXL_PORT=y. + */ +subsys_initcall(cxl_port_init); + +static void __exit cxl_port_exit(void) +{ + cxl_driver_unregister(&cxl_port_driver); +} +module_exit(cxl_port_exit); + MODULE_DESCRIPTION("CXL: Port enumeration and services"); MODULE_LICENSE("GPL v2"); MODULE_IMPORT_NS(CXL);
In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature:
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0: 3) cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace: <TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core]
At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted.
The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them.
In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction.
A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously.
Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: stable@vger.kernel.org Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Ira Weiny ira.weiny@intel.com Cc: Zijun Hu quic_zijuhu@quicinc.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/base/core.c | 35 +++++++++++++++++++++++++++++ drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- drivers/cxl/core/region.c | 48 +++++++++++----------------------------- drivers/cxl/cxl.h | 3 ++- include/linux/device.h | 3 +++ tools/testing/cxl/test/cxl.c | 14 ++++-------- 6 files changed, 100 insertions(+), 53 deletions(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c index a4c853411a6b..e42f1ad73078 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child_reverse);
+/** + * device_for_each_child_reverse_from - device child iterator in reversed order. + * @parent: parent struct device. + * @from: optional starting point in child list + * @fn: function to be called for each device. + * @data: data for the callback. + * + * Iterate over @parent's child devices, starting at @from, and call @fn + * for each, passing it @data. This helper is identical to + * device_for_each_child_reverse() when @from is NULL. + * + * @fn is checked each iteration. If it returns anything other than 0, + * iteration stop and that value is returned to the caller of + * device_for_each_child_reverse_from(); + */ +int device_for_each_child_reverse_from(struct device *parent, + struct device *from, const void *data, + int (*fn)(struct device *, const void *)) +{ + struct klist_iter i; + struct device *child; + int error = 0; + + if (!parent->p) + return 0; + + klist_iter_init_node(&parent->p->klist_children, &i, + (from ? &from->p->knode_parent : NULL)); + while ((child = prev_device(&i)) && !error) + error = fn(child, data); + klist_iter_exit(&i); + return error; +} +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from); + /** * device_find_child - device iterator for locating a particular device. * @parent: parent struct device diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c index 3df10517a327..223c273c0cd1 100644 --- a/drivers/cxl/core/hdm.c +++ b/drivers/cxl/core/hdm.c @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld) return 0; }
-static int cxl_decoder_reset(struct cxl_decoder *cxld) +static int commit_reap(struct device *dev, const void *data) +{ + struct cxl_port *port = to_cxl_port(dev->parent); + struct cxl_decoder *cxld; + + if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev)) + return 0; + + cxld = to_cxl_decoder(dev); + if (port->commit_end == cxld->id && + ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) { + port->commit_end--; + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", + dev_name(&cxld->dev), port->commit_end); + } + + return 0; +} + +void cxl_port_commit_reap(struct cxl_decoder *cxld) +{ + struct cxl_port *port = to_cxl_port(cxld->dev.parent); + + lockdep_assert_held_write(&cxl_region_rwsem); + + /* + * Once the highest committed decoder is disabled, free any other + * decoders that were pinned allocated by out-of-order release. + */ + port->commit_end--; + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", dev_name(&cxld->dev), + port->commit_end); + device_for_each_child_reverse_from(&port->dev, &cxld->dev, NULL, + commit_reap); +} +EXPORT_SYMBOL_NS_GPL(cxl_port_commit_reap, CXL); + +static void cxl_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); struct cxl_hdm *cxlhdm = dev_get_drvdata(&port->dev); @@ -721,14 +758,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) u32 ctrl;
if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) - return 0; + return;
- if (port->commit_end != id) { + if (port->commit_end == id) + cxl_port_commit_reap(cxld); + else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end); - return -EBUSY; - }
down_read(&cxl_dpa_rwsem); ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id)); @@ -741,7 +778,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) writel(0, hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id)); up_read(&cxl_dpa_rwsem);
- port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE;
/* Userspace is now responsible for reconfiguring this decoder */ @@ -751,8 +787,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) cxled = to_cxl_endpoint_decoder(&cxld->dev); cxled->state = CXL_DECODER_STATE_MANUAL; } - - return 0; }
static int cxl_setup_hdm_decoder_from_dvsec( diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index e701e4b04032..3478d2058303 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -232,8 +232,8 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) "Bypassing cpu_cache_invalidate_memregion() for testing!\n"); return 0; } else { - dev_err(&cxlr->dev, - "Failed to synchronize CPU cache state\n"); + dev_WARN(&cxlr->dev, + "Failed to synchronize CPU cache state\n"); return -ENXIO; } } @@ -242,19 +242,17 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) return 0; }
-static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) +static void cxl_region_decode_reset(struct cxl_region *cxlr, int count) { struct cxl_region_params *p = &cxlr->params; - int i, rc = 0; + int i;
/* - * Before region teardown attempt to flush, and if the flush - * fails cancel the region teardown for data consistency - * concerns + * Before region teardown attempt to flush, evict any data cached for + * this region, or scream loudly about missing arch / platform support + * for CXL teardown. */ - rc = cxl_region_invalidate_memregion(cxlr); - if (rc) - return rc; + cxl_region_invalidate_memregion(cxlr);
for (i = count - 1; i >= 0; i--) { struct cxl_endpoint_decoder *cxled = p->targets[i]; @@ -277,23 +275,17 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) cxl_rr = cxl_rr_load(iter, cxlr); cxld = cxl_rr->decoder; if (cxld->reset) - rc = cxld->reset(cxld); - if (rc) - return rc; + cxld->reset(cxld); set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); }
endpoint_reset: - rc = cxled->cxld.reset(&cxled->cxld); - if (rc) - return rc; + cxled->cxld.reset(&cxled->cxld); set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); }
/* all decoders associated with this region have been torn down */ clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); - - return 0; }
static int commit_decoder(struct cxl_decoder *cxld) @@ -409,16 +401,8 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr, * still pending. */ if (p->state == CXL_CONFIG_RESET_PENDING) { - rc = cxl_region_decode_reset(cxlr, p->interleave_ways); - /* - * Revert to committed since there may still be active - * decoders associated with this region, or move forward - * to active to mark the reset successful - */ - if (rc) - p->state = CXL_CONFIG_COMMIT; - else - p->state = CXL_CONFIG_ACTIVE; + cxl_region_decode_reset(cxlr, p->interleave_ways); + p->state = CXL_CONFIG_ACTIVE; } }
@@ -2054,13 +2038,7 @@ static int cxl_region_detach(struct cxl_endpoint_decoder *cxled) get_device(&cxlr->dev);
if (p->state > CXL_CONFIG_ACTIVE) { - /* - * TODO: tear down all impacted regions if a device is - * removed out of order - */ - rc = cxl_region_decode_reset(cxlr, p->interleave_ways); - if (rc) - goto out; + cxl_region_decode_reset(cxlr, p->interleave_ways); p->state = CXL_CONFIG_ACTIVE; }
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 0d8b810a51f0..5406e3ab3d4a 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -359,7 +359,7 @@ struct cxl_decoder { struct cxl_region *region; unsigned long flags; int (*commit)(struct cxl_decoder *cxld); - int (*reset)(struct cxl_decoder *cxld); + void (*reset)(struct cxl_decoder *cxld); };
/* @@ -730,6 +730,7 @@ static inline bool is_cxl_root(struct cxl_port *port) int cxl_num_decoders_committed(struct cxl_port *port); bool is_cxl_port(const struct device *dev); struct cxl_port *to_cxl_port(const struct device *dev); +void cxl_port_commit_reap(struct cxl_decoder *cxld); struct pci_bus; int devm_cxl_register_pci_bus(struct device *host, struct device *uport_dev, struct pci_bus *bus); diff --git a/include/linux/device.h b/include/linux/device.h index b4bde8d22697..667cb6db9019 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1078,6 +1078,9 @@ int device_for_each_child(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); int device_for_each_child_reverse(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); +int device_for_each_child_reverse_from(struct device *parent, + struct device *from, const void *data, + int (*fn)(struct device *, const void *)); struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); struct device *device_find_child_by_name(struct device *parent, diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c index 90d5afd52dd0..c5bbd89b3192 100644 --- a/tools/testing/cxl/test/cxl.c +++ b/tools/testing/cxl/test/cxl.c @@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld) return 0; }
-static int mock_decoder_reset(struct cxl_decoder *cxld) +static void mock_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); int id = cxld->id;
if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) - return 0; + return;
dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev)); - if (port->commit_end != id) { + if (port->commit_end == id) + cxl_port_commit_reap(cxld); + else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end); - return -EBUSY; - } - - port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE; - - return 0; }
static void default_mock_decoder(struct cxl_decoder *cxld)
Dan Williams wrote:
In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature:
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
- cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset
- mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0:
- cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace:
<TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core]
At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted.
The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them.
In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction.
A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously.
Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: stable@vger.kernel.org Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Ira Weiny ira.weiny@intel.com Cc: Zijun Hu quic_zijuhu@quicinc.com Signed-off-by: Dan Williams dan.j.williams@intel.com
Reviewed-by: Ira Weiny ira.weiny@intel.com
[snip]
On 22.10.24 18:43:15, Dan Williams wrote:
Changes since v1 [1]:
- Fix some misspellings missed by checkpatch in changelogs (Jonathan)
- Add comments explaining the order of objects in drivers/cxl/Makefile (Jonathan)
- Rename attach_device => cxl_rescan_attach (Jonathan)
- Fixup Zijun's email (Zijun)
Original cover:
Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to delayed arrival of the CXL "root" infrastructure [1] prompted questions of how the existing mechanism for retrying cxl_mem_probe() could be failing.
I found a similar issue with the region creation.
A region is created with the first endpoint found and immediately added as device which triggers cxl_region_probe(). Now, in interleaving setups the region state comes into commit state only after the last endpoint was probed. So the probe must be repeated until all endpoints were enumerated. I ended up with this change:
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index a07b62254596..c78704e435e5 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -3775,8 +3775,8 @@ static int cxl_region_probe(struct device *dev) }
if (p->state < CXL_CONFIG_COMMIT) { - dev_dbg(&cxlr->dev, "config state: %d\n", p->state); - rc = -ENXIO; + rc = dev_err_probe(&cxlr->dev, -EPROBE_DEFER, + "region config state: %d\n", p->state); goto out; }
On Wed, Oct 23, 2024 at 03:12:07PM +0200, Robert Richter wrote:
On 22.10.24 18:43:15, Dan Williams wrote:
Changes since v1 [1]:
- Fix some misspellings missed by checkpatch in changelogs (Jonathan)
- Add comments explaining the order of objects in drivers/cxl/Makefile (Jonathan)
- Rename attach_device => cxl_rescan_attach (Jonathan)
- Fixup Zijun's email (Zijun)
Original cover:
Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to delayed arrival of the CXL "root" infrastructure [1] prompted questions of how the existing mechanism for retrying cxl_mem_probe() could be failing.
I found a similar issue with the region creation.
A region is created with the first endpoint found and immediately added as device which triggers cxl_region_probe(). Now, in interleaving setups the region state comes into commit state only after the last endpoint was probed. So the probe must be repeated until all endpoints were enumerated. I ended up with this change:
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index a07b62254596..c78704e435e5 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -3775,8 +3775,8 @@ static int cxl_region_probe(struct device *dev) } if (p->state < CXL_CONFIG_COMMIT) {
dev_dbg(&cxlr->dev, "config state: %d\n", p->state);
rc = -ENXIO;
rc = dev_err_probe(&cxlr->dev, -EPROBE_DEFER,
goto out; }"region config state: %d\n", p->state);
I have not experienced any out of order operations since applying Dan's v1 of this patch set, do you still see this after applying the existing set?
Probably this is indicative of needing another SOFTDEP / ordering issue.
~Gregory
-- 2.39.5
I don't see an init order issue here as the mem module is always up before the regions are probed.
-Robert
The critical missing piece in the debug was that Gregory's setup had almost all CXL modules built-in to the kernel.
On the way to that discovery several other bugs and init-order corner cases were discovered.
The main fix is to make sure the drivers/cxl/Makefile object order supports root CXL ports being fully initialized upon cxl_acpi_probe() exit. The modular case has some similar potential holes that are fixed with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update cxl_test to reproduce the original report resulted in the discovery of a separate long standing use after free bug in cxl_region_detach().
Dan Williams (6): cxl/port: Fix CXL port initialization order when the subsystem is built-in cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices() cxl/acpi: Ensure ports ready at cxl_acpi_probe() return cxl/port: Fix use-after-free, permit out-of-order decoder shutdown cxl/port: Prevent out-of-order decoder allocation cxl/test: Improve init-order fidelity relative to real-world systems
drivers/base/core.c | 35 +++++++ drivers/cxl/Kconfig | 1 drivers/cxl/Makefile | 20 +++- drivers/cxl/acpi.c | 7 + drivers/cxl/core/hdm.c | 50 +++++++++-- drivers/cxl/core/port.c | 13 ++- drivers/cxl/core/region.c | 91 ++++++++++--------- drivers/cxl/cxl.h | 3 - include/linux/device.h | 3 + tools/testing/cxl/test/cxl.c | 200 +++++++++++++++++++++++------------------- tools/testing/cxl/test/mem.c | 1 11 files changed, 269 insertions(+), 155 deletions(-)
base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
Robert Richter wrote:
On 22.10.24 18:43:15, Dan Williams wrote:
Changes since v1 [1]:
- Fix some misspellings missed by checkpatch in changelogs (Jonathan)
- Add comments explaining the order of objects in drivers/cxl/Makefile (Jonathan)
- Rename attach_device => cxl_rescan_attach (Jonathan)
- Fixup Zijun's email (Zijun)
Original cover:
Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to delayed arrival of the CXL "root" infrastructure [1] prompted questions of how the existing mechanism for retrying cxl_mem_probe() could be failing.
I found a similar issue with the region creation.
A region is created with the first endpoint found and immediately added as device which triggers cxl_region_probe(). Now, in interleaving setups the region state comes into commit state only after the last endpoint was probed. So the probe must be repeated until all endpoints were enumerated. I ended up with this change:
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index a07b62254596..c78704e435e5 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -3775,8 +3775,8 @@ static int cxl_region_probe(struct device *dev) } if (p->state < CXL_CONFIG_COMMIT) {
dev_dbg(&cxlr->dev, "config state: %d\n", p->state);
rc = -ENXIO;
rc = dev_err_probe(&cxlr->dev, -EPROBE_DEFER,
"region config state: %d\n", p->state);
I would argue EPROBE_DEFER is not appropriate because there is no guarantee that the other members of the region show up, and if they do they will re-trigger probe. So "probe must be repeated until all endpoints were enumerated" is the case either way. I.e. either more endpoint arrival triggers re-probe or EPROBE_DEFER triggers extra redundant probing *and* still results in a probe attempts as endpoints arrive.
So a dev_dbg() plus -ENXIO return on uncommited region state is expected.
goto out;
} -- 2.39.5
I don't see an init order issue here as the mem module is always up before the regions are probed.
Right, cxl_endpoint_port_probe() triggers region discovery and cxl_endpoint_port_probe() currently only triggers after cxl_mem has registered an endpoint port.
The failure this set is address is unwanted cxl_mem_probe() failures.
On 23.10.24 13:34:36, Dan Williams wrote:
Robert Richter wrote:
if (p->state < CXL_CONFIG_COMMIT) {
dev_dbg(&cxlr->dev, "config state: %d\n", p->state);
rc = -ENXIO;
rc = dev_err_probe(&cxlr->dev, -EPROBE_DEFER,
"region config state: %d\n", p->state);
I would argue EPROBE_DEFER is not appropriate because there is no guarantee that the other members of the region show up, and if they do they will re-trigger probe. So "probe must be repeated until all endpoints were enumerated" is the case either way. I.e. either more endpoint arrival triggers re-probe or EPROBE_DEFER triggers extra redundant probing *and* still results in a probe attempts as endpoints arrive.
So a dev_dbg() plus -ENXIO return on uncommited region state is expected.
So, the region device keeps failing a probe until all endpoints are collected. This triggered by cxl_add_to_region() after the region went into CXL_CONFIG_COMMIT state. Looks reasonable.
The setup I was using showed various probe failures so I 'fixed' this issue without noticing the region device was reprobed later successfully. Thanks for explaining.
-Robert
linux-stable-mirror@lists.linaro.org