On Mon, 19 May 2025 23:43:21 +0530, Udipto Goswami wrote:
Hi Mathias,
From what I recall, we saw this issue coming up on our QCOM mobile platforms but it was not consistent. It was only reported in long runs i believe. The most recent instance when I pushed this patch was with platform SM8650, it was a watchdog timeout issue where xhci_reset() -> xhci_handshake() polling read timeout upon xhci remove.
Was it some system-wide watchdog, i.e. unrelated tasks were locking up?
It looks similar to that command abort freeze: xhci_resume() calls xhci_reset() under xhci->lock, and xhci_handshake() spins for a few seconds with the spinlock held. Anything else (workers, IRQs) trying to grab the lock will also spin and delay unrelated things.
Not sure why your commit message says "Use this helper in places where xhci_handshake is called unlocked and has a long timeout", because you end up calling it from two places where the lock is (incorrectly) held. That's why adding the early bailout helped, I guess.
Unfortunately I was not able to simulate the scenario for more granular testing and had validated it with long hours stress testing.
Looking at xhci_resume(), it will call xhci_reset() if the controller has known bugs (e.g. the RESET_ON_RESUME quirk) or it fails resuming.
I guess you could simulate this case by forcing the quirk with a module parameter and adding some extra delay to xhci_handshake(), so you are not dependent on the hardware actually failing in any manner.
Full call stack on core6: -000|readl([X19] addr = 0xFFFFFFC03CC08020) -001|xhci_handshake(inline) -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000) -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?) -003|xhci_plat_runtime_resume([locdesc] dev = ?) -004|pm_generic_runtime_resume([locdesc] dev = ?) -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev = 0xFFFFFF890F619C10) -006|rpm_callback(inline) -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10, [NSD:0xFFFFFFC041453AD4] rpmflags = 4) -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4) -008|pm_runtime_get_sync(inline) -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00) -009|platform_remove([X19] _dev = 0xFFFFFF890F619C10) -010|device_remove(inline)