On Tue, May 20, 2025 at 9:18 AM Mathias Nyman mathias.nyman@linux.intel.com wrote:
On 19.5.2025 21.13, Udipto Goswami wrote:
On Mon, May 19, 2025 at 6:23 PM Mathias Nyman mathias.nyman@linux.intel.com wrote:
On 17.5.2025 7.39, Roy Luo wrote:
This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.
Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() helper") was introduced to workaround watchdog timeout issues on some platforms, allowing xhci_reset() to bail out early without waiting for the reset to complete.
Skipping the xhci handshake during a reset is a dangerous move. The xhci specification explicitly states that certain registers cannot be accessed during reset in section 5.4.1 USB Command Register (USBCMD), Host Controller Reset (HCRST) field: "This bit is cleared to '0' by the Host Controller when the reset process is complete. Software cannot terminate the reset process early by writinga '0' to this bit and shall not write any xHC Operational or Runtime registers until while HCRST is '1'."
This behavior causes a regression on SNPS DWC3 USB controller with dual-role capability. When the DWC3 controller exits host mode and removes xhci while a reset is still in progress, and then tries to configure its hardware for device mode, the ongoing reset leads to register access issues; specifically, all register reads returns 0. These issues extend beyond the xhci register space (which is expected during a reset) and affect the entire DWC3 IP block, causing the DWC3 device mode to malfunction.
I agree with you and Thinh that waiting for the HCRST bit to clear during reset is the right thing to do, especially now when we know skipping it causes issues for SNPS DWC3, even if it's only during remove phase.
But reverting this patch will re-introduce the issue originally worked around by Udipto Goswami, causing regression.
Best thing to do would be to wait for HCRST to clear for all other platforms except the one with the issue.
Udipto Goswami, can you recall the platforms that needed this workaroud? and do we have an easy way to detect those?
Hi Mathias,
From what I recall, we saw this issue coming up on our QCOM mobile platforms but it was not consistent. It was only reported in long runs i believe. The most recent instance when I pushed this patch was with platform SM8650, it was a watchdog timeout issue where xhci_reset() -> xhci_handshake() polling read timeout upon xhci remove. Unfortunately I was not able to simulate the scenario for more granular testing and had validated it with long hours stress testing. The callstack was like so:
Full call stack on core6: -000|readl([X19] addr = 0xFFFFFFC03CC08020) -001|xhci_handshake(inline) -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000) -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?) -003|xhci_plat_runtime_resume([locdesc] dev = ?) -004|pm_generic_runtime_resume([locdesc] dev = ?) -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev = 0xFFFFFF890F619C10) -006|rpm_callback(inline) -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10, [NSD:0xFFFFFFC041453AD4] rpmflags = 4) -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4) -008|pm_runtime_get_sync(inline) -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
Thank you for clarifying this.
So patch avoids the long timeout by always cutting xhci reinit path short in xhci_resume() if resume was caused by pm_runtime_get_sync() call in xhci_plat_remove()
void xhci_plat_remove(struct platform_device *dev) { xhci->xhc_state |= XHCI_STATE_REMOVING; pm_runtime_get_sync(&dev->dev); ... }
I think we can revert this patch, and just make sure that we don't reset the host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set. Just return immediately instead.
Just to be sure, are you proposing that we skip xhci_reset() within the reinit path of xhci_resume()? If we do that, could that lead to issues with subsequent operations in the reinit sequence, such as xhci_init() or xhci_run()?
Do you prefer to group the change to skip xhci_reset() within the reinit path together with this revert? or do you want it to be sent and reviewed separately?
Thanks, Roy