On Thu, 17 Oct 2024 16:10:39 +0300, Mathias Nyman wrote:
Hmm, wouldn't a long and partially cached TD basically become corrupted by this overwrite?
Unlikely but not impossible. We already turn all cancelled TDs that we don't stop on into no-ops, so those would already now experience the same problem.
No, I think they wouldn't. Note in xHCI 1.2, 4.6.9, on page 135 states clearly that xHC shall invalidate cached TRBs besides the current TD.
Same page, point 3, mentions that software "may not modify" the current TD, whatever on earth is that supposed to mean. Unfortunately, I can't find a clear "shall not" in 4.6.9, but I would see it as such.
We stopped the endpoint, and issued a 'Set TR deq' command which is supposed to clear xHC TRB cache. I find it hard to believe xHC would continue by caching some select TRBs of a TD to cache.
The idea is, if Set TR Deq fails, the xHC preserves transfer state and cache and tries to continue. If the TD wasn't fully cached when the xHC stopped, it remains incomplete. Missing TRBs will be filled with No Ops when it restarts, yielding an ivalid TD (e.g. No Op chained at the end).
So it may turn out that instead of "EP TRB ptr not part of current TD" something else would show up, perhaps TRB Errors.
But lets say we end up corrupting the TD. It might still be better than allowing xHC to process the TRBs and write to DMA addresses that might be freed/reused already.
There is some truth to that, I guess. It's bummer that those bugs are here in the first place and no one seems to know where they come from.
Was this tested on HW? I suppose it wouldn't be hard to corrupt a Set TR Deq command to make it fail, stream 0xffff or something like that. It may be harder to come up with a realistic test case with long TDs.
Regards, Michal