On Mon, May 19, 2025 at 6:23 PM Mathias Nyman mathias.nyman@linux.intel.com wrote:
On 17.5.2025 7.39, Roy Luo wrote:
This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.
Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() helper") was introduced to workaround watchdog timeout issues on some platforms, allowing xhci_reset() to bail out early without waiting for the reset to complete.
Skipping the xhci handshake during a reset is a dangerous move. The xhci specification explicitly states that certain registers cannot be accessed during reset in section 5.4.1 USB Command Register (USBCMD), Host Controller Reset (HCRST) field: "This bit is cleared to '0' by the Host Controller when the reset process is complete. Software cannot terminate the reset process early by writinga '0' to this bit and shall not write any xHC Operational or Runtime registers until while HCRST is '1'."
This behavior causes a regression on SNPS DWC3 USB controller with dual-role capability. When the DWC3 controller exits host mode and removes xhci while a reset is still in progress, and then tries to configure its hardware for device mode, the ongoing reset leads to register access issues; specifically, all register reads returns 0. These issues extend beyond the xhci register space (which is expected during a reset) and affect the entire DWC3 IP block, causing the DWC3 device mode to malfunction.
I agree with you and Thinh that waiting for the HCRST bit to clear during reset is the right thing to do, especially now when we know skipping it causes issues for SNPS DWC3, even if it's only during remove phase.
But reverting this patch will re-introduce the issue originally worked around by Udipto Goswami, causing regression.
Best thing to do would be to wait for HCRST to clear for all other platforms except the one with the issue.
Udipto Goswami, can you recall the platforms that needed this workaroud? and do we have an easy way to detect those?
Hi Mathias,
From what I recall, we saw this issue coming up on our QCOM mobile platforms but it was not consistent. It was only reported in long runs i believe. The most recent instance when I pushed this patch was with platform SM8650, it was a watchdog timeout issue where xhci_reset() -> xhci_handshake() polling read timeout upon xhci remove. Unfortunately I was not able to simulate the scenario for more granular testing and had validated it with long hours stress testing. The callstack was like so:
Full call stack on core6: -000|readl([X19] addr = 0xFFFFFFC03CC08020) -001|xhci_handshake(inline) -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000) -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?) -003|xhci_plat_runtime_resume([locdesc] dev = ?) -004|pm_generic_runtime_resume([locdesc] dev = ?) -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev = 0xFFFFFF890F619C10) -006|rpm_callback(inline) -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10, [NSD:0xFFFFFFC041453AD4] rpmflags = 4) -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4) -008|pm_runtime_get_sync(inline) -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00) -009|platform_remove([X19] _dev = 0xFFFFFF890F619C10) -010|device_remove(inline) -010|__device_release_driver(inline) -010|device_release_driver_internal([X20] dev = 0xFFFFFF890F619C10, [?] drv = ?, [X19] parent = 0x0) -011|device_release_driver(inline) -011|bus_remove_device([X19] dev = 0xFFFFFF890F619C10) -012|device_del([X20] dev = 0xFFFFFF890F619C10) -013|platform_device_del(inline) -013|platform_device_unregister([X19] pdev = 0xFFFFFF890F619C00) -014|dwc3_host_exit(inline) -014|__dwc3_set_mode([X20] work = 0xFFFFFF886072F840) -015|process_one_work([X19] worker = 0xFFFFFF887AEE46C0, [X21] work = 0xFFFFFF886072F840) -016|worker_thread([X19] __worker = 0xFFFFFF887AEE46C0) -017|kthread([X22] _create = 0xFFFFFF8937350600) -018|ret_from_fork(asm) ---|end of frame