On Fri, Jun 26, 2020 at 8:43 PM Dan Williams dan.j.williams@intel.com wrote:
On Fri, Jun 26, 2020 at 7:22 AM Rafael J. Wysocki rafael@kernel.org wrote:
On Fri, Jun 26, 2020 at 2:06 AM Dan Williams dan.j.williams@intel.com wrote:
Quoting the documentation:
Some persistent memory devices run a firmware locally on the device / "DIMM" to perform tasks like media management, capacity provisioning, and health monitoring. The process of updating that firmware typically involves a reboot because it has implications for in-flight memory transactions. However, reboots are disruptive and at least the Intel persistent memory platform implementation, described by the Intel ACPI DSM specification [1], has added support for activating firmware at runtime. [1]: https://docs.pmem.io/persistent-memory/
The approach taken is to abstract the Intel platform specific mechanism behind a libnvdimm-generic sysfs interface. The interface could support runtime-firmware-activation on another architecture without need to change userspace tooling.
The ACPI NFIT implementation involves a set of device-specific-methods (DSMs) to 'arm' individual devices for activation and bus-level 'trigger' method to execute the activation. Informational / enumeration methods are also provided at the bus and device level.
One complicating aspect of the memory device firmware activation is that the memory controller may need to be quiesced, no memory cycles, during the activation. While the platform has mechanisms to support holding off in-flight DMA during the activation, the device response to that delay is potentially undefined. The platform may reject a runtime firmware update if, for example a PCI-E device does not support its completion timeout value being increased to meet the activation time. Outside of device timeouts the quiesce period may also violate application timeouts.
Given the above device and application timeout considerations the implementation defaults to hooking into the suspend path to trigger the activation, i.e. that a suspend-resume cycle (at least up to the syscore suspend point) is required.
Well, that doesn't work if the suspend method for the system is set to suspend-to-idle (for example, via /sys/power/mem_sleep), because the syscore callbacks are not invoked in that case.
Also you probably don't need the device power state toggling that happens during regular suspend/resume (you may not want it even for some devices).
The hibernation freeze/thaw may be a better match and there is some test support in there already that may be kind of co-opted for your use case.
Hmm, yes I guess freeze should be sufficient to quiesce most device-DMA in the general case as applications will stop sending requests.
It is expected to be sufficient to quiesce all of them.
If that is not the case, the integrity of the hibernation image cannot be guaranteed on the system in question.
I do expect some RDMA devices will happily keep on transmitting, but that likely will need explicit mitigation. It also appears the suspend callback for at least one RDMA device mlx5_suspend() is rather violent as it appears to fully teardown the device context, not just suspend operations.
To be clear, what debug interface were you thinking I could glom onto to just trigger firmware-activate at the end of the freeze phase?
Functionally, the same as for suspend, but using the hibernation interface, so "echo platform > /sys/power/pm_test" followed by "echo disk > /sys/power/state".
But it might be cleaner to introduce a special "hibernation mode", ie. is one more item in /sys/power/disk, that will trigger what you need (in analogy with "test_resume").