From: Konstantin Sinyuk konstantin.sinyuk@intel.com
[ Upstream commit a0d866bab184161ba155b352650083bf6695e50e ]
Dirty state can occur when the host VM undergoes a reset while the device does not. In such a case, the driver must reset the device before it can be used again. As part of this reset, the device capabilities are zeroed. Therefore, the driver must read the Preboot status again to learn the Preboot state, capabilities, and security configuration.
Signed-off-by: Konstantin Sinyuk konstantin.sinyuk@intel.com Reviewed-by: Koby Elbaz koby.elbaz@intel.com Signed-off-by: Koby Elbaz koby.elbaz@intel.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES - The new retry at `drivers/accel/habanalabs/gaudi2/gaudi2.c:3508` ensures `hl_fw_read_preboot_status()` is run again immediately after a dirty-state recovery reset; without it, the reset leaves the device’s preboot capability registers cleared, so the driver would continue with stale or zeroed security/capability data and fail to bring the card back after a host-only reboot (the scenario described in the commit message). - `hl_fw_read_preboot_status()` repopulates `asic_prop` fields such as `fw_preboot_cpu_boot_dev_sts[01]`, `dynamic_fw_load`, and `fw_security_enabled` (`drivers/accel/habanalabs/common/firmware_if.c:1564-1605`); these values are what the rest of initialization uses to pick the firmware loading path and security posture, so skipping the re-read after `hw_fini()` leads directly to broken or insecure configuration on the recovered device. - The change is tightly scoped to the Gaudi2 early-init dirty-path, reuses the existing error handling (`goto pci_fini;` and the `reset_on_preboot_fail` guard), and does not touch unrelated subsystems, so regression risk is minimal while it fixes a real user- visible recovery bug.
drivers/accel/habanalabs/gaudi2/gaudi2.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index 5722e4128d3ce..3df72a5d024a6 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -3150,7 +3150,6 @@ static int gaudi2_early_init(struct hl_device *hdev) rc = hl_fw_read_preboot_status(hdev); if (rc) { if (hdev->reset_on_preboot_fail) - /* we are already on failure flow, so don't check if hw_fini fails. */ hdev->asic_funcs->hw_fini(hdev, true, false); goto pci_fini; } @@ -3162,6 +3161,13 @@ static int gaudi2_early_init(struct hl_device *hdev) dev_err(hdev->dev, "failed to reset HW in dirty state (%d)\n", rc); goto pci_fini; } + + rc = hl_fw_read_preboot_status(hdev); + if (rc) { + if (hdev->reset_on_preboot_fail) + hdev->asic_funcs->hw_fini(hdev, true, false); + goto pci_fini; + } }
return 0;