Jonathan Cameron wrote:
On Tue, 22 Oct 2024 18:43:24 -0700 Dan Williams dan.j.williams@intel.com wrote:
When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.
Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.
As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case.
Reported-by: Gregory Price gourry@gourry.net Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price gourry@gourry.net Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver")
I don't like this due to likely long term fragility, but any other
Please be specific about the fragility and how is this different than any other Makefile order fragility, like the many cases in drivers/Makefile/, or patches fixing up initcall order?
Now, an argument can be made that there are too many CXL sub-objects and more can be merged into a monolithic cxl_core object. The flipside of that is reduced testability, at least via symbol mocking techniques. Just look at the recent case where the fact that drivers/cxl/core/region.c is built into cxl_core.o rather than its own cxl_region.o object results in an in-line code change to support cxl_test [1]. There are tradeoffs.
solution is probably more painful. Long term we should really get a regression test for these ordering issues in place in one of the CIs.
The final patch in this series does improve cxl_test's ability to catch regressions in module init order, and that ordering change did uncover a bug. The system works! 😅
Going further the test mode that is needed, in addition to QEMU emulation and cxl_test interface mocking, is kunit or similar [2] infrastructure for some function-scope unit tests.
[1]: http://lore.kernel.org/20241022030054.258942-1-lizhijian@fujitsu.com [2]: http://lore.kernel.org/170795677066.3697776.12587812713093908173.stgit@ubunt...