On Thu, May 22, 2025 at 5:24 AM Mathias Nyman mathias.nyman@intel.com wrote:
On 22.5.2025 5.21, Roy Luo wrote:
Udipto Goswami, can you recall the platforms that needed this workaroud? and do we have an easy way to detect those?
Hi Mathias,
From what I recall, we saw this issue coming up on our QCOM mobile platforms but it was not consistent. It was only reported in long runs i believe. The most recent instance when I pushed this patch was with platform SM8650, it was a watchdog timeout issue where xhci_reset() -> xhci_handshake() polling read timeout upon xhci remove. Unfortunately I was not able to simulate the scenario for more granular testing and had validated it with long hours stress testing. The callstack was like so:
Full call stack on core6: -000|readl([X19] addr = 0xFFFFFFC03CC08020) -001|xhci_handshake(inline) -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000) -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?) -003|xhci_plat_runtime_resume([locdesc] dev = ?) -004|pm_generic_runtime_resume([locdesc] dev = ?) -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev = 0xFFFFFF890F619C10) -006|rpm_callback(inline) -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10, [NSD:0xFFFFFFC041453AD4] rpmflags = 4) -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4) -008|pm_runtime_get_sync(inline) -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
Thank you for clarifying this.
So patch avoids the long timeout by always cutting xhci reinit path short in xhci_resume() if resume was caused by pm_runtime_get_sync() call in xhci_plat_remove()
void xhci_plat_remove(struct platform_device *dev) { xhci->xhc_state |= XHCI_STATE_REMOVING; pm_runtime_get_sync(&dev->dev); ... }
I think we can revert this patch, and just make sure that we don't reset the host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set. Just return immediately instead.
Just to be sure, are you proposing that we skip xhci_reset() within the reinit path of xhci_resume()? If we do that, could that lead to issues with subsequent operations in the reinit sequence, such as xhci_init() or xhci_run()?
I suggest to only skip xhci_reset in xhci_resume() if XHCI_STATE_REMOVING is set.
This should be similar to what is going on already.
xhci_reset() currently returns -ENODEV if XHCI_STATE_REMOVING is set, unless reset completes extremely fast. xhci_resume() bails out if xhci_reset() returns error:
xhci_resume() ... if (power_lost) { ... retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC); spin_unlock_irq(&xhci->lock); if (retval) return retval;
Do you prefer to group the change to skip xhci_reset() within the reinit path together with this revert? or do you want it to be sent and reviewed separately?
First a patch that bails out from xhci_resume() if XHCI_STATE_REMOVING is set and we are in the reinit (power_lost) path about to call xhci_reset();
Then a second patch that reverts 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
Does this sound reasonable?
should avoid the QCOM 10sec watchdog issue as next xhci_rest() called in xhci_remove() path has a short 250ms timeout, and ensure the SNPS DWC3 USB regression won't trigger.
Thanks Mathias
Thanks for the clarification! SGTM. I've sent out a new patchset accordingly https://lore.kernel.org/linux-usb/20250522190912.457583-1-royluo@google.com/
Thanks, Roy