Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Check if the XHCI_ETRON_HOST quirk flag is set before invoking the workaround in process_isoc_td().
Fixes: 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") Cc: stable@vger.kernel.org Signed-off-by: Kuangyi Chiang ki.chiang65@gmail.com --- drivers/usb/host/xhci-ring.c | 29 +++++++++++++++++++++-------- 1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 965bffce301e..936fd9151ba8 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2452,8 +2452,10 @@ static void process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, switch (trb_comp_code) { case COMP_SUCCESS: /* Don't overwrite status if TD had an error, see xHCI 4.9.1 */ - if (td->error_mid_td) + if (td->error_mid_td) { + td->error_mid_td = false; break; + } if (remaining) { frame->status = short_framestatus; sum_trbs_for_length = true; @@ -2468,25 +2470,36 @@ static void process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, case COMP_BANDWIDTH_OVERRUN_ERROR: frame->status = -ECOMM; break; + case COMP_USB_TRANSACTION_ERROR: case COMP_BABBLE_DETECTED_ERROR: sum_trbs_for_length = true; fallthrough; case COMP_ISOCH_BUFFER_OVERRUN: frame->status = -EOVERFLOW; + if (trb_comp_code == COMP_USB_TRANSACTION_ERROR) + frame->status = -EPROTO; if (ep_trb != td->end_trb) td->error_mid_td = true; + else + td->error_mid_td = false; + + /* + * If an error is detected on the last TRB of the TD, + * wait for the final event. + */ + if ((xhci->quirks & XHCI_ETRON_HOST) && + td->urb->dev->speed >= USB_SPEED_SUPER && + ep_trb == td->end_trb) + td->error_mid_td = true; break; case COMP_INCOMPATIBLE_DEVICE_ERROR: case COMP_STALL_ERROR: frame->status = -EPROTO; break; - case COMP_USB_TRANSACTION_ERROR: - frame->status = -EPROTO; - sum_trbs_for_length = true; - if (ep_trb != td->end_trb) - td->error_mid_td = true; - break; case COMP_STOPPED: + /* Think of it as the final event if TD had an error */ + if (td->error_mid_td) + td->error_mid_td = false; sum_trbs_for_length = true; break; case COMP_STOPPED_SHORT_PACKET: @@ -2519,7 +2532,7 @@ static void process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
finish_td: /* Don't give back TD yet if we encountered an error mid TD */ - if (td->error_mid_td && ep_trb != td->end_trb) { + if (td->error_mid_td) { xhci_dbg(xhci, "Error mid isoc TD, wait for final completion event\n"); td->urb_length_set = true; return;
On 5.2.2025 7.37, Kuangyi Chiang wrote:
Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
Are there any other issues besides the error message seen?
Thanks Mathias
On 5.2.2025 16.17, Mathias Nyman wrote:
On 5.2.2025 7.37, Kuangyi Chiang wrote:
Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
Are there any other issues besides the error message seen?
Would something like this work: (untested, compiles)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 965bffce301e..81d45ddebace 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2644,6 +2644,22 @@ static int handle_transferless_tx_event(struct xhci_hcd *xhci, struct xhci_virt_ return 0; }
+static bool xhci_spurious_success_tx_event(struct xhci_hcd *xhci, + struct xhci_ring *ring) +{ + switch(ring->last_td_comp_code) { + case COMP_SHORT_PACKET: + return xhci->quirks & XHCI_SPURIOUS_SUCCESS; + case COMP_USB_TRANSACTION_ERROR: + case COMP_BABBLE_DETECTED_ERROR: + case COMP_ISOCH_BUFFER_OVERRUN: + return xhci->quirks & XHCI_ETRON_HOST && + ring->type == TYPE_ISOC; + default: + return false; + } +} + /* * If this function returns an error condition, it means it got a Transfer * event with a corrupted Slot ID, Endpoint ID, or TRB DMA address. @@ -2697,8 +2713,8 @@ static int handle_tx_event(struct xhci_hcd *xhci, case COMP_SUCCESS: if (EVENT_TRB_LEN(le32_to_cpu(event->transfer_len)) != 0) { trb_comp_code = COMP_SHORT_PACKET; - xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td short %d\n", - slot_id, ep_index, ep_ring->last_td_was_short); + xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td comp code %d\n", + slot_id, ep_index, ep_ring->last_td_comp_code); } break; case COMP_SHORT_PACKET: @@ -2846,7 +2862,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ if (trb_comp_code != COMP_STOPPED && trb_comp_code != COMP_STOPPED_LENGTH_INVALID && - !ep_ring->last_td_was_short) { + !xhci_spurious_success_tx_event(xhci, ep_ring)) { xhci_warn(xhci, "Event TRB for slot %u ep %u with no TDs queued\n", slot_id, ep_index); } @@ -2890,11 +2906,10 @@ static int handle_tx_event(struct xhci_hcd *xhci,
/* * Some hosts give a spurious success event after a short - * transfer. Ignore it. + * transfer or error on last TRB. Ignore it. */ - if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) && - ep_ring->last_td_was_short) { - ep_ring->last_td_was_short = false; + if (xhci_spurious_success_tx_event(xhci, ep_ring)) { + ep_ring->last_td_comp_code = trb_comp_code; return 0; }
@@ -2922,10 +2937,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ } while (ep->skip);
- if (trb_comp_code == COMP_SHORT_PACKET) - ep_ring->last_td_was_short = true; - else - ep_ring->last_td_was_short = false; + ep_ring->last_td_comp_code = trb_comp_code;
ep_trb = &ep_seg->trbs[(ep_trb_dma - ep_seg->dma) / sizeof(*ep_trb)]; trace_xhci_handle_transfer(ep_ring, (struct xhci_generic_trb *) ep_trb, ep_trb_dma); diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index 779b01dee068..acc8d3c7a199 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1371,7 +1371,7 @@ struct xhci_ring { unsigned int num_trbs_free; /* used only by xhci DbC */ unsigned int bounce_buf_len; enum xhci_ring_type type; - bool last_td_was_short; + u32 last_td_comp_code; struct radix_tree_root *trb_address_map; };
Hi,
Mathias Nyman mathias.nyman@linux.intel.com 於 2025年2月5日 週三 下午11:16寫道:
On 5.2.2025 16.17, Mathias Nyman wrote:
On 5.2.2025 7.37, Kuangyi Chiang wrote:
Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
Are there any other issues besides the error message seen?
Would something like this work: (untested, compiles)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 965bffce301e..81d45ddebace 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2644,6 +2644,22 @@ static int handle_transferless_tx_event(struct xhci_hcd *xhci, struct xhci_virt_ return 0; }
+static bool xhci_spurious_success_tx_event(struct xhci_hcd *xhci,
struct xhci_ring *ring)
+{
switch(ring->last_td_comp_code) {
case COMP_SHORT_PACKET:
return xhci->quirks & XHCI_SPURIOUS_SUCCESS;
case COMP_USB_TRANSACTION_ERROR:
case COMP_BABBLE_DETECTED_ERROR:
case COMP_ISOCH_BUFFER_OVERRUN:
return xhci->quirks & XHCI_ETRON_HOST &&
ring->type == TYPE_ISOC;
default:
return false;
}
+}
- /*
- If this function returns an error condition, it means it got a Transfer
- event with a corrupted Slot ID, Endpoint ID, or TRB DMA address.
@@ -2697,8 +2713,8 @@ static int handle_tx_event(struct xhci_hcd *xhci, case COMP_SUCCESS: if (EVENT_TRB_LEN(le32_to_cpu(event->transfer_len)) != 0) { trb_comp_code = COMP_SHORT_PACKET;
xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td short %d\n",
slot_id, ep_index, ep_ring->last_td_was_short);
xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td comp code %d\n",
slot_id, ep_index, ep_ring->last_td_comp_code); } break; case COMP_SHORT_PACKET:
@@ -2846,7 +2862,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ if (trb_comp_code != COMP_STOPPED && trb_comp_code != COMP_STOPPED_LENGTH_INVALID &&
!ep_ring->last_td_was_short) {
!xhci_spurious_success_tx_event(xhci, ep_ring)) { xhci_warn(xhci, "Event TRB for slot %u ep %u with no TDs queued\n", slot_id, ep_index); }
@@ -2890,11 +2906,10 @@ static int handle_tx_event(struct xhci_hcd *xhci,
/* * Some hosts give a spurious success event after a short
* transfer. Ignore it.
* transfer or error on last TRB. Ignore it. */
if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
ep_ring->last_td_was_short) {
ep_ring->last_td_was_short = false;
if (xhci_spurious_success_tx_event(xhci, ep_ring)) {
ep_ring->last_td_comp_code = trb_comp_code; return 0; }
@@ -2922,10 +2937,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ } while (ep->skip);
if (trb_comp_code == COMP_SHORT_PACKET)
ep_ring->last_td_was_short = true;
else
ep_ring->last_td_was_short = false;
ep_ring->last_td_comp_code = trb_comp_code; ep_trb = &ep_seg->trbs[(ep_trb_dma - ep_seg->dma) / sizeof(*ep_trb)]; trace_xhci_handle_transfer(ep_ring, (struct xhci_generic_trb *) ep_trb, ep_trb_dma);
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index 779b01dee068..acc8d3c7a199 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1371,7 +1371,7 @@ struct xhci_ring { unsigned int num_trbs_free; /* used only by xhci DbC */ unsigned int bounce_buf_len; enum xhci_ring_type type;
bool last_td_was_short;
};u32 last_td_comp_code; struct radix_tree_root *trb_address_map;
It looks better than my patch v2, I will test it later.
Thanks, Kuangyi Chiang
Kuangyi Chiang ki.chiang65@gmail.com 於 2025年2月7日 週五 上午9:38寫道:
Hi,
Mathias Nyman mathias.nyman@linux.intel.com 於 2025年2月5日 週三 下午11:16寫道:
On 5.2.2025 16.17, Mathias Nyman wrote:
On 5.2.2025 7.37, Kuangyi Chiang wrote:
Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
Are there any other issues besides the error message seen?
Would something like this work: (untested, compiles)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 965bffce301e..81d45ddebace 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2644,6 +2644,22 @@ static int handle_transferless_tx_event(struct xhci_hcd *xhci, struct xhci_virt_ return 0; }
+static bool xhci_spurious_success_tx_event(struct xhci_hcd *xhci,
struct xhci_ring *ring)
+{
switch(ring->last_td_comp_code) {
case COMP_SHORT_PACKET:
return xhci->quirks & XHCI_SPURIOUS_SUCCESS;
case COMP_USB_TRANSACTION_ERROR:
case COMP_BABBLE_DETECTED_ERROR:
case COMP_ISOCH_BUFFER_OVERRUN:
return xhci->quirks & XHCI_ETRON_HOST &&
ring->type == TYPE_ISOC;
default:
return false;
}
+}
- /*
- If this function returns an error condition, it means it got a Transfer
- event with a corrupted Slot ID, Endpoint ID, or TRB DMA address.
@@ -2697,8 +2713,8 @@ static int handle_tx_event(struct xhci_hcd *xhci, case COMP_SUCCESS: if (EVENT_TRB_LEN(le32_to_cpu(event->transfer_len)) != 0) { trb_comp_code = COMP_SHORT_PACKET;
xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td short %d\n",
slot_id, ep_index, ep_ring->last_td_was_short);
xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td comp code %d\n",
slot_id, ep_index, ep_ring->last_td_comp_code); } break; case COMP_SHORT_PACKET:
@@ -2846,7 +2862,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ if (trb_comp_code != COMP_STOPPED && trb_comp_code != COMP_STOPPED_LENGTH_INVALID &&
!ep_ring->last_td_was_short) {
!xhci_spurious_success_tx_event(xhci, ep_ring)) { xhci_warn(xhci, "Event TRB for slot %u ep %u with no TDs queued\n", slot_id, ep_index); }
@@ -2890,11 +2906,10 @@ static int handle_tx_event(struct xhci_hcd *xhci,
/* * Some hosts give a spurious success event after a short
* transfer. Ignore it.
* transfer or error on last TRB. Ignore it. */
if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
ep_ring->last_td_was_short) {
ep_ring->last_td_was_short = false;
if (xhci_spurious_success_tx_event(xhci, ep_ring)) {
ep_ring->last_td_comp_code = trb_comp_code; return 0; }
@@ -2922,10 +2937,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ } while (ep->skip);
if (trb_comp_code == COMP_SHORT_PACKET)
ep_ring->last_td_was_short = true;
else
ep_ring->last_td_was_short = false;
ep_ring->last_td_comp_code = trb_comp_code; ep_trb = &ep_seg->trbs[(ep_trb_dma - ep_seg->dma) / sizeof(*ep_trb)]; trace_xhci_handle_transfer(ep_ring, (struct xhci_generic_trb *) ep_trb, ep_trb_dma);
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index 779b01dee068..acc8d3c7a199 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1371,7 +1371,7 @@ struct xhci_ring { unsigned int num_trbs_free; /* used only by xhci DbC */ unsigned int bounce_buf_len; enum xhci_ring_type type;
bool last_td_was_short;
};u32 last_td_comp_code; struct radix_tree_root *trb_address_map;
It looks better than my patch v2, I will test it later.
Yes, it works. I applied and tested it under Linux-6.13.1. This issue has been resolved. I have tested Etron EJ168 and EJ188, both have the same problem.
What should I do next?
Thanks, Kuangyi Chiang
Thanks, Kuangyi Chiang
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
error_mid_td can cope with hosts which don't produce the extra success event, it was done this way to deal with buggy NECs. The cost is one more ESIT of latency on TDs with error.
I don't know how many Etron chips Kuangyi Chiang tested, but I have one here and it definitely has this double event bug.
These are old chips, not sure if Etron is still making them or if anyone would want to buy (large chip, USB 3.0 only, barely functional streams, no UAS on Linux and on Windows with stock MS drivers).
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
That's a little dodgy because it frees the TD before the HC is completely done with it. *Probably* no problem with data buffers (no sensible reason to DMA into them after an earlier error), but we could overwrite the transfer ring in rare cases and IDK if it would or wouldn't cause problems in this particular case.
Same applies to the "short packet" case existing today. I thought about fixing it, but IIRC I ran into some differences between HCs or out of spec behavior and it got tricky.
Maybe it would make sense to separate giveback (and freeing of the data buffer by class drivers) from transfer ring inc_deq(). Do the former when we reasonably believe the HC won't touch the buffers anymore, do the latter when we are sure that it's in the next TD.
Not ideal, but easier and better than the status quo.
Regards, Michal
On 6.2.2025 0.42, Michał Pecio wrote:
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
error_mid_td can cope with hosts which don't produce the extra success event, it was done this way to deal with buggy NECs. The cost is one more ESIT of latency on TDs with error.
It makes giving back the TD depend on a future event we can't guarantee.
I still think it better fits the spurious success case. It's not an error mid TD, it's a spurious success event sent by host after a completion (error) event for the last TRB in the TD.
Making this change to error_mid_td code also makes that code more confusing and harder to follow.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
That's a little dodgy because it frees the TD before the HC is completely done with it. *Probably* no problem with data buffers (no sensible reason to DMA into them after an earlier error), but we could overwrite the transfer ring in rare cases and IDK if it would or wouldn't cause problems in this particular case.
We did get an event for the last TRB in the TD, so in normal cases this TD should be considered complete, and given back.
I don't think the controller has any reason to touch data buffers at this stage either. Can't recall any iommu/dma issues related to this.
Same applies to the "short packet" case existing today. I thought about fixing it, but IIRC I ran into some differences between HCs or out of spec behavior and it got tricky.
For the short transfer case this is more valid concern. Here we give back the TD after an event mid TD, and we know hardware might still walk the rest of the TD. It shouldn't touch data buffers either as short transfer indicates all data has been written.
Maybe it would make sense to separate giveback (and freeing of the data buffer by class drivers) from transfer ring inc_deq(). Do the former when we reasonably believe the HC won't touch the buffers anymore, do the latter when we are sure that it's in the next TD.
This sounds reasonable, makes sense to keep the software dequeue pointer where hardware last reported its position. Currently we advance it to where we assume hardware will be next.
But this is a separate project. Might need some work around in the driver.
Thanks Mathias
On Fri, 7 Feb 2025 14:06:54 +0200, Mathias Nyman wrote:
On 6.2.2025 0.42, Michał Pecio wrote:
error_mid_td can cope with hosts which don't produce the extra success event, it was done this way to deal with buggy NECs. The cost is one more ESIT of latency on TDs with error.
It makes giving back the TD depend on a future event we can't guarantee.
For the record, this is not the same disaster as failing to give back an unlinked URB. Situation here is no different from usual 'error mid TD' case on buggy HCs (known so far: NEC uPD720200 and most if not all of ASMedia, including at least 1st gen AMD Promontory).
We are owed an event in the next ESIT, worst case it will be something weird like Missed Service or Ring Underrun. I've sent you some patches for that, they also apply to the existing NEC/ASMedia problem.
I still think it better fits the spurious success case. It's not an error mid TD, it's a spurious success event sent by host after a completion (error) event for the last TRB in the TD.
Legally you are right of course, but materially we know what happens: the damn thing still holds an internal reference to the last TRB for some unknown reason. We would need to know that it doesn't actually use it for anything and will not mind the TRB being overwritten.
This may well be true, but I guess I prefer known evils over unknown ones so I immediately suggested using 'erorr mid TD' instead.
Making this change to error_mid_td code also makes that code more confusing and harder to follow.
I will see if I can come up with something clean.
I will also try the patch you sent, it looks like it would work.
One thing I don't like is that it fails to distinguish the known spurious events from other invalid events due to hardware or kernel bugs. In the past last_td_was_short caused me similar problems.
Regards, Michal
Hi,
Thank you for the review.
Mathias Nyman mathias.nyman@linux.intel.com 於 2025年2月5日 週三 下午10:16寫道:
On 5.2.2025 7.37, Kuangyi Chiang wrote:
Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
Yes, this was my initial solution as well. See my patch v2 [1].
[1] https://lore.kernel.org/all/20241028025337.6372-6-ki.chiang65@gmail.com
Are there any other issues besides the error message seen?
There are no other issues.
Thanks Mathias
Thanks, Kuangyi Chiang
case COMP_STOPPED:
/* Think of it as the final event if TD had an error */
if (td->error_mid_td)
sum_trbs_for_length = true; break;td->error_mid_td = false;
What was the reason for this part?
As written it is going to cause problems, the driver will forget about earlier errors if the endpoint is stopped and resumed on the same TD.
I think that the whole patch could be much simpler, like:
case X_ERROR: frame->status = X; td->error_mid_td = true; break; case Y_ERROR: frame->status = Y; td->error_mid_td = true; break;
and then
if (error_mid_td && (ep_trb != td->end_trb || ETRON && SUPERSPEED)) { // error mid TD, wait for final event }
finish_td()
Regards, Michal
Hi,
Thank you for the review.
Michał Pecio michal.pecio@gmail.com 於 2025年2月6日 週四 上午5:45寫道:
case COMP_STOPPED:
/* Think of it as the final event if TD had an error */
if (td->error_mid_td)
td->error_mid_td = false; sum_trbs_for_length = true; break;
What was the reason for this part?
To prevent the driver from printing the following debug message twice:
"Error mid isoc TD, wait for final completion event"
This can happen if the driver queues a Stop Endpoint command after mid isoc TD error occurred, see my debug messages below:
[ 110.514149] xhci_hcd 0000:01:00.0: xhci_queue_isoc_tx [ 110.514164] xhci_hcd 0000:01:00.0: @000000002d119510 59600000 00000000 00009000 800b1725 [ 110.514169] xhci_hcd 0000:01:00.0: @000000002d119520 59609000 00000000 00107000 800b1515 [ 110.514173] xhci_hcd 0000:01:00.0: @000000002d119530 59610000 00000000 00002000 00000625 ... [ 110.530263] xhci_hcd 0000:01:00.0: xhci_handle_event_trb [ 110.530266] xhci_hcd 0000:01:00.0: @000000010afe6350 2d119510 00000000 04009000 01138001 [ 110.530271] xhci_hcd 0000:01:00.0: Error mid isoc TD, wait for final completion event [ 110.530373] xhci_hcd 0000:01:00.0: xhci_handle_event_trb [ 110.530378] xhci_hcd 0000:01:00.0: @000000010afe6360 2d119510 00000000 01009000 01138001 [ 110.530387] xhci_hcd 0000:01:00.0: xhci_handle_event_trb [ 110.530391] xhci_hcd 0000:01:00.0: @000000010afe6370 2d119520 00000000 04007000 01138001 [ 110.530395] xhci_hcd 0000:01:00.0: Error mid isoc TD, wait for final completion event [ 110.530430] xhci_hcd 0000:01:00.0: queue_command [ 110.530434] xhci_hcd 0000:01:00.0: @000000010afe5230 00000000 00000000 00000000 01133c01 [ 110.530470] xhci_hcd 0000:01:00.0: xhci_handle_event_trb [ 110.530473] xhci_hcd 0000:01:00.0: @000000010afe6380 2d119520 00000000 1a000000 01138001 [ 110.530478] xhci_hcd 0000:01:00.0: Error mid isoc TD, wait for final completion event [ 110.530481] xhci_hcd 0000:01:00.0: xhci_handle_event_trb [ 110.530484] xhci_hcd 0000:01:00.0: @000000010afe6390 0afe5230 00000001 01000000 01008401 ...
This may become confusing.
As written it is going to cause problems, the driver will forget about earlier errors if the endpoint is stopped and resumed on the same TD.
Yes, this can happen, I didn't account for this scenario.
I think that the whole patch could be much simpler, like:
case X_ERROR: frame->status = X; td->error_mid_td = true; break; case Y_ERROR: frame->status = Y; td->error_mid_td = true; break;
and then
if (error_mid_td && (ep_trb != td->end_trb || ETRON && SUPERSPEED)) { // error mid TD, wait for final event }
finish_td()
Yes, this is much simpler than my patch, but we still need to fix the issue with debug message being printed twice.
Maybe silence the debug message like this:
if (error_mid_td && (ep_trb != td->end_trb || ETRON && SUPERSPEED)) { if (trb_comp_code != COMP_STOPPED) xhci_dbg(xhci, "Error mid isoc TD, wait for final completion event\n"); ... }
, or do nothing. Could you help with some suggestions?
Regards, Michal
Thanks, Kuangyi Chiang
On Fri, 7 Feb 2025 14:59:25 +0800, Kuangyi Chiang wrote:
case COMP_STOPPED:
/* Think of it as the final event if TD had an error */
if (td->error_mid_td)
td->error_mid_td = false; sum_trbs_for_length = true; break;
What was the reason for this part?
To prevent the driver from printing the following debug message twice:
"Error mid isoc TD, wait for final completion event"
This can happen if the driver queues a Stop Endpoint command after mid isoc TD error occurred, see my debug messages below:
I see. Not sure if it's a big problem, dynamic debug is disabled by default and anyone using it needs to read the code anyway to understand what those messages mean. And when you read the code it becomes obvious why the message can show up twice (or even more, in fact).
I would even say that it is helpful, because it shows that control flow passes exactly as expected when the Stopped event is handled. And it's nothing new, this debug code always worked like that on all HCs.
Regards, Michal
Michał Pecio michal.pecio@gmail.com 於 2025年2月7日 週五 下午5:51寫道:
On Fri, 7 Feb 2025 14:59:25 +0800, Kuangyi Chiang wrote:
case COMP_STOPPED:
/* Think of it as the final event if TD had an error */
if (td->error_mid_td)
td->error_mid_td = false; sum_trbs_for_length = true; break;
What was the reason for this part?
To prevent the driver from printing the following debug message twice:
"Error mid isoc TD, wait for final completion event"
This can happen if the driver queues a Stop Endpoint command after mid isoc TD error occurred, see my debug messages below:
I see. Not sure if it's a big problem, dynamic debug is disabled by default and anyone using it needs to read the code anyway to understand what those messages mean. And when you read the code it becomes obvious why the message can show up twice (or even more, in fact).
I would even say that it is helpful, because it shows that control flow passes exactly as expected when the Stopped event is handled. And it's nothing new, this debug code always worked like that on all HCs.
Got it, thanks for your suggestion.
Regards, Michal
Thanks, Kuangyi Chiang
linux-stable-mirror@lists.linaro.org