From qjv001@qjv001-XeonWs Tue Oct 18 15:37:29 2022 From: qjv001 qjv001@qjv001-XeonWs To: Thinh Nguyen Thinh.Nguyen@synopsys.com Subject: Re: [PATCH v3 2/6] usb: dwc3: gadget: cancel requests instead of release after missed isoc References: 20221017205446.523796-1-w36195@motorola.com 20221017205446.523796-3-w36195@motorola.com 20221017213031.tqb575hdzli7jlbh@synopsys.com Y04K/HoUigF5FYBA@p1g3 20221018184535.3g3sm35picdeuajs@synopsys.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: 20221018184535.3g3sm35picdeuajs@synopsys.com X-Mutt-References: 20221018184535.3g3sm35picdeuajs@synopsys.com X-Mutt-Fcc: ~/sent Status: RO Date: Tue, 18 Oct 2022 15:37:29 -0500 Content-Length: 5434 Lines: 124
Hi Thinh,
On Tue, Oct 18, 2022 at 06:45:40PM +0000, Thinh Nguyen wrote:
Hi Dan,
On Mon, Oct 17, 2022, Dan Vacura wrote:
Hi Thinh,
On Mon, Oct 17, 2022 at 09:30:38PM +0000, Thinh Nguyen wrote:
On Mon, Oct 17, 2022, Dan Vacura wrote:
From: Jeff Vanhoof qjv001@motorola.com
arm-smmu related crashes seen after a Missed ISOC interrupt when no_interrupt=1 is used. This can happen if the hardware is still using the data associated with a TRB after the usb_request's ->complete call has been made. Instead of immediately releasing a request when a Missed ISOC interrupt has occurred, this change will add logic to cancel the request instead where it will eventually be released when the END_TRANSFER command has completed. This logic is similar to some of the cleanup done in dwc3_gadget_ep_dequeue.
This doesn't sound right. How did you determine that the hardware is still using the data associated with the TRB? Did you check the TRB's HWO bit?
The problem we're seeing was mentioned in the summary of this patch series, issue #1. Basically, with the following patch https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... integrated a smmu panic is occurring on our Android device with the 5.15 kernel which is:
<3>[ 718.314900][ T803] arm-smmu 15000000.apps-smmu: Unhandled arm-smmu context fault from a600000.dwc3!
The uvc gadget driver appears to be the first (and only) gadget that uses the no_interrupt=1 logic, so this seems to be a new condition for the dwc3 driver. In our configuration, we have up to 64 requests and the no_interrupt=1 for up to 15 requests. The list size of dep->started_list would get up to that amount when looping through to cleanup the completed requests. From testing and debugging the smmu panic occurs when a -EXDEV status shows up and right after dwc3_gadget_ep_cleanup_completed_request() was visited. The conclusion we had was the requests were getting returned to the gadget too early.
As I mentioned, if the status is updated to missed isoc, that means that the controller returned ownership of the TRB to the driver. At least for the particular request with -EXDEV, its TRBs are completed. I'm not clear on your conclusion.
Do we know where did the crash occur? Is it from dwc3 driver or from uvc driver, and at what line? It'd great if we can see the driver log.
To interject, what should happen in dwc3_gadget_ep_reclaim_completed_trb if the IOC bit is not set (but the IMI bit is) and -EXDEV status is passed into it? If the function returns 0, another attempt to reclaim may occur. If this happens and the next request did have the HWO bit set, the function would return 1 but dwc3_gadget_ep_cleanup_completed_request would still call dwc3_gadget_giveback.
As a test (without this patch), I added a check to see if HWO bit was set in dwc3_gadget_ep_cleanup_completed_requests(). If the usecase was ISOC and the HWO bit was set I avoided calling dwc3_gadget_ep_cleanup_completed_request(). This seemed to also avoid the iommu related crash being seen.
Is there an issue in this area that needs to be corrected instead? Not having interrupts set for each request may be causing some new issues to be uncovered.
As far as the crash seen without this patch, no good stacktrace is given. Line provided for crash varied a bit, but tended to appear towards the end of dwc3_stop_active_transfer() or dwc3_gadget_endpoint_trbs_complete().
Since dwc3_gadget_endpoint_trbs_complete() can be called from multiple locations, I duplicated the function to help identify which path it was likely being called from. At the time of the crashes seen, dwc3_gadget_endpoint_transfer_in_progress() appeared to be the caller.
dwc3_gadget_endpoint_transfer_in_progress() ->dwc3_gadget_endpoint_trbs_complete() (crashed towards end of here) ->dwc3_stop_active_transfer() (sometimes crashed towards end of here)
I hope this clarifies things a bit.
The dwc3 driver would only give back the requests if the TRBs of the associated requests are completed or when the device is disconnected. If the TRB indicated missed isoc, that means that the TRB is completed and its status was updated.
Interesting, the device is not disconnected as we don't get the -ESHUTDOWN status back and with this patch in place things continue after a -EXDEV status is received.
Actually, minor correction here: a recent change b44c0e7fef51 ("usb: dwc3: gadget: conditionally remove requests") changed -ESHUTDOWN request status to -ECONNRESET when disable endpoint. This doesn't look right.
While disabling endpoint may also apply for other cases such as switching alternate interface in addition to disconnect, -ESHUTDOWN seems more fitting there.
btw, we don't have "usb: dwc3: gadget: conditionally remove requests" in our baseline
Hi Michael,
Can you help clarify for the change above? This changed the usage of requests. Now requests returned by disconnection won't be returned as -ESHUTDOWN.
There's a special case which dwc3 may give back requests early is the case of the device disconnecting. The requests should be returned with -ESHUTDOWN, and the gadget driver shouldn't be re-using the requests on de-initialization anyway.
We should not issue End Transfer command just because of missed isoc. We may want issue End Transfer if the gadget driver is too slow and unable to feed requests in time (causing underrun and missed isoc) to resync with the host, but we already handle that.
Hmm, isn't that what happens when we get into this condition in dwc3_gadget_endpoint_trbs_complete():
if (usb_endpoint_xfer_isoc(dep->endpoint.desc) && list_empty(&dep->started_list) && (list_empty(&dep->pending_list) || status == -EXDEV)) dwc3_stop_active_transfer(dep, true, true);
Yes, it's being handled there.
I'm still not clear what's the problem you're seeing. Do you have the crash log? Tracepoints?
Appreciate the support!
Thanks, Thinh
Thanks, Jeff
On Tue, Oct 18, 2022, Jeffrey Vanhoof wrote:
Hi Thinh,
On Tue, Oct 18, 2022 at 06:45:40PM +0000, Thinh Nguyen wrote:
Hi Dan,
On Mon, Oct 17, 2022, Dan Vacura wrote:
Hi Thinh,
On Mon, Oct 17, 2022 at 09:30:38PM +0000, Thinh Nguyen wrote:
On Mon, Oct 17, 2022, Dan Vacura wrote:
From: Jeff Vanhoof qjv001@motorola.com
arm-smmu related crashes seen after a Missed ISOC interrupt when no_interrupt=1 is used. This can happen if the hardware is still using the data associated with a TRB after the usb_request's ->complete call has been made. Instead of immediately releasing a request when a Missed ISOC interrupt has occurred, this change will add logic to cancel the request instead where it will eventually be released when the END_TRANSFER command has completed. This logic is similar to some of the cleanup done in dwc3_gadget_ep_dequeue.
This doesn't sound right. How did you determine that the hardware is still using the data associated with the TRB? Did you check the TRB's HWO bit?
The problem we're seeing was mentioned in the summary of this patch series, issue #1. Basically, with the following patch https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... integrated a smmu panic is occurring on our Android device with the 5.15 kernel which is:
<3>[ 718.314900][ T803] arm-smmu 15000000.apps-smmu: Unhandled arm-smmu context fault from a600000.dwc3!
The uvc gadget driver appears to be the first (and only) gadget that uses the no_interrupt=1 logic, so this seems to be a new condition for the dwc3 driver. In our configuration, we have up to 64 requests and the no_interrupt=1 for up to 15 requests. The list size of dep->started_list would get up to that amount when looping through to cleanup the completed requests. From testing and debugging the smmu panic occurs when a -EXDEV status shows up and right after dwc3_gadget_ep_cleanup_completed_request() was visited. The conclusion we had was the requests were getting returned to the gadget too early.
As I mentioned, if the status is updated to missed isoc, that means that the controller returned ownership of the TRB to the driver. At least for the particular request with -EXDEV, its TRBs are completed. I'm not clear on your conclusion.
Do we know where did the crash occur? Is it from dwc3 driver or from uvc driver, and at what line? It'd great if we can see the driver log.
To interject, what should happen in dwc3_gadget_ep_reclaim_completed_trb if the IOC bit is not set (but the IMI bit is) and -EXDEV status is passed into it?
Hm... we may have overlooked this case for no_interrupt scenario. If IMI is set, then there will be an interrupt when there's missed isoc regardless of whether no_interrupt is set by the gadget driver.
If the function returns 0, another attempt to reclaim may occur. If this happens and the next request did have the HWO bit set, the function would return 1 but dwc3_gadget_ep_cleanup_completed_request would still call dwc3_gadget_giveback.
As a test (without this patch), I added a check to see if HWO bit was set in dwc3_gadget_ep_cleanup_completed_requests(). If the usecase was ISOC and the HWO bit was set I avoided calling dwc3_gadget_ep_cleanup_completed_request(). This seemed to also avoid the iommu related crash being seen.
Is there an issue in this area that needs to be corrected instead? Not having interrupts set for each request may be causing some new issues to be uncovered.
As far as the crash seen without this patch, no good stacktrace is given. Line provided for crash varied a bit, but tended to appear towards the end of dwc3_stop_active_transfer() or dwc3_gadget_endpoint_trbs_complete().
Since dwc3_gadget_endpoint_trbs_complete() can be called from multiple locations, I duplicated the function to help identify which path it was likely being called from. At the time of the crashes seen, dwc3_gadget_endpoint_transfer_in_progress() appeared to be the caller.
dwc3_gadget_endpoint_transfer_in_progress() ->dwc3_gadget_endpoint_trbs_complete() (crashed towards end of here) ->dwc3_stop_active_transfer() (sometimes crashed towards end of here)
I hope this clarifies things a bit.
Can we try this? Let me know if it resolves your issue.
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
+ if (usb_endpoint_xfer_isoc(dep->endpoint.desc) && + (event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain) + return 1; + if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
Thanks, Thinh
Hi Thinh,
On Tue, Oct 18, 2022 at 10:35:30PM +0000, Thinh Nguyen wrote:
On Tue, Oct 18, 2022, Jeffrey Vanhoof wrote:
Hi Thinh,
On Tue, Oct 18, 2022 at 06:45:40PM +0000, Thinh Nguyen wrote:
Hi Dan,
On Mon, Oct 17, 2022, Dan Vacura wrote:
Hi Thinh,
On Mon, Oct 17, 2022 at 09:30:38PM +0000, Thinh Nguyen wrote:
On Mon, Oct 17, 2022, Dan Vacura wrote:
From: Jeff Vanhoof qjv001@motorola.com
arm-smmu related crashes seen after a Missed ISOC interrupt when no_interrupt=1 is used. This can happen if the hardware is still using the data associated with a TRB after the usb_request's ->complete call has been made. Instead of immediately releasing a request when a Missed ISOC interrupt has occurred, this change will add logic to cancel the request instead where it will eventually be released when the END_TRANSFER command has completed. This logic is similar to some of the cleanup done in dwc3_gadget_ep_dequeue.
This doesn't sound right. How did you determine that the hardware is still using the data associated with the TRB? Did you check the TRB's HWO bit?
The problem we're seeing was mentioned in the summary of this patch series, issue #1. Basically, with the following patch https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... integrated a smmu panic is occurring on our Android device with the 5.15 kernel which is:
<3>[ 718.314900][ T803] arm-smmu 15000000.apps-smmu: Unhandled arm-smmu context fault from a600000.dwc3!
The uvc gadget driver appears to be the first (and only) gadget that uses the no_interrupt=1 logic, so this seems to be a new condition for the dwc3 driver. In our configuration, we have up to 64 requests and the no_interrupt=1 for up to 15 requests. The list size of dep->started_list would get up to that amount when looping through to cleanup the completed requests. From testing and debugging the smmu panic occurs when a -EXDEV status shows up and right after dwc3_gadget_ep_cleanup_completed_request() was visited. The conclusion we had was the requests were getting returned to the gadget too early.
As I mentioned, if the status is updated to missed isoc, that means that the controller returned ownership of the TRB to the driver. At least for the particular request with -EXDEV, its TRBs are completed. I'm not clear on your conclusion.
Do we know where did the crash occur? Is it from dwc3 driver or from uvc driver, and at what line? It'd great if we can see the driver log.
To interject, what should happen in dwc3_gadget_ep_reclaim_completed_trb if the IOC bit is not set (but the IMI bit is) and -EXDEV status is passed into it?
Hm... we may have overlooked this case for no_interrupt scenario. If IMI is set, then there will be an interrupt when there's missed isoc regardless of whether no_interrupt is set by the gadget driver.
If the function returns 0, another attempt to reclaim may occur. If this happens and the next request did have the HWO bit set, the function would return 1 but dwc3_gadget_ep_cleanup_completed_request would still call dwc3_gadget_giveback.
As a test (without this patch), I added a check to see if HWO bit was set in dwc3_gadget_ep_cleanup_completed_requests(). If the usecase was ISOC and the HWO bit was set I avoided calling dwc3_gadget_ep_cleanup_completed_request(). This seemed to also avoid the iommu related crash being seen.
Is there an issue in this area that needs to be corrected instead? Not having interrupts set for each request may be causing some new issues to be uncovered.
As far as the crash seen without this patch, no good stacktrace is given. Line provided for crash varied a bit, but tended to appear towards the end of dwc3_stop_active_transfer() or dwc3_gadget_endpoint_trbs_complete().
Since dwc3_gadget_endpoint_trbs_complete() can be called from multiple locations, I duplicated the function to help identify which path it was likely being called from. At the time of the crashes seen, dwc3_gadget_endpoint_transfer_in_progress() appeared to be the caller.
dwc3_gadget_endpoint_transfer_in_progress() ->dwc3_gadget_endpoint_trbs_complete() (crashed towards end of here) ->dwc3_stop_active_transfer() (sometimes crashed towards end of here)
I hope this clarifies things a bit.
Can we try this? Let me know if it resolves your issue.
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
With this change it doesn't seem to crash but unfortunately the output completely hangs after the first missed isoc. At the moment I do not understand why this might happen.
Thanks, Thinh
Note that I haven't quite learned correctly how to reply correct to the mailing list. I appologize for messing up the thread a bit.
Regards, Jeff
On Tue, Oct 18, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Tue, Oct 18, 2022 at 10:35:30PM +0000, Thinh Nguyen wrote:
On Tue, Oct 18, 2022, Jeffrey Vanhoof wrote:
Hi Thinh,
On Tue, Oct 18, 2022 at 06:45:40PM +0000, Thinh Nguyen wrote:
Hi Dan,
On Mon, Oct 17, 2022, Dan Vacura wrote:
Hi Thinh,
On Mon, Oct 17, 2022 at 09:30:38PM +0000, Thinh Nguyen wrote:
On Mon, Oct 17, 2022, Dan Vacura wrote: > From: Jeff Vanhoof qjv001@motorola.com > > arm-smmu related crashes seen after a Missed ISOC interrupt when > no_interrupt=1 is used. This can happen if the hardware is still using > the data associated with a TRB after the usb_request's ->complete call > has been made. Instead of immediately releasing a request when a Missed > ISOC interrupt has occurred, this change will add logic to cancel the > request instead where it will eventually be released when the > END_TRANSFER command has completed. This logic is similar to some of the > cleanup done in dwc3_gadget_ep_dequeue.
This doesn't sound right. How did you determine that the hardware is still using the data associated with the TRB? Did you check the TRB's HWO bit?
The problem we're seeing was mentioned in the summary of this patch series, issue #1. Basically, with the following patch https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... integrated a smmu panic is occurring on our Android device with the 5.15 kernel which is:
<3>[ 718.314900][ T803] arm-smmu 15000000.apps-smmu: Unhandled arm-smmu context fault from a600000.dwc3!
The uvc gadget driver appears to be the first (and only) gadget that uses the no_interrupt=1 logic, so this seems to be a new condition for the dwc3 driver. In our configuration, we have up to 64 requests and the no_interrupt=1 for up to 15 requests. The list size of dep->started_list would get up to that amount when looping through to cleanup the completed requests. From testing and debugging the smmu panic occurs when a -EXDEV status shows up and right after dwc3_gadget_ep_cleanup_completed_request() was visited. The conclusion we had was the requests were getting returned to the gadget too early.
As I mentioned, if the status is updated to missed isoc, that means that the controller returned ownership of the TRB to the driver. At least for the particular request with -EXDEV, its TRBs are completed. I'm not clear on your conclusion.
Do we know where did the crash occur? Is it from dwc3 driver or from uvc driver, and at what line? It'd great if we can see the driver log.
To interject, what should happen in dwc3_gadget_ep_reclaim_completed_trb if the IOC bit is not set (but the IMI bit is) and -EXDEV status is passed into it?
Hm... we may have overlooked this case for no_interrupt scenario. If IMI is set, then there will be an interrupt when there's missed isoc regardless of whether no_interrupt is set by the gadget driver.
If the function returns 0, another attempt to reclaim may occur. If this happens and the next request did have the HWO bit set, the function would return 1 but dwc3_gadget_ep_cleanup_completed_request would still call dwc3_gadget_giveback.
As a test (without this patch), I added a check to see if HWO bit was set in dwc3_gadget_ep_cleanup_completed_requests(). If the usecase was ISOC and the HWO bit was set I avoided calling dwc3_gadget_ep_cleanup_completed_request(). This seemed to also avoid the iommu related crash being seen.
Is there an issue in this area that needs to be corrected instead? Not having interrupts set for each request may be causing some new issues to be uncovered.
As far as the crash seen without this patch, no good stacktrace is given. Line provided for crash varied a bit, but tended to appear towards the end of dwc3_stop_active_transfer() or dwc3_gadget_endpoint_trbs_complete().
Since dwc3_gadget_endpoint_trbs_complete() can be called from multiple locations, I duplicated the function to help identify which path it was likely being called from. At the time of the crashes seen, dwc3_gadget_endpoint_transfer_in_progress() appeared to be the caller.
dwc3_gadget_endpoint_transfer_in_progress() ->dwc3_gadget_endpoint_trbs_complete() (crashed towards end of here) ->dwc3_stop_active_transfer() (sometimes crashed towards end of here)
I hope this clarifies things a bit.
Can we try this? Let me know if it resolves your issue.
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
With this change it doesn't seem to crash but unfortunately the output completely hangs after the first missed isoc. At the moment I do not understand why this might happen.
Can you capture the driver tracepoints with the change above?
Note that I haven't quite learned correctly how to reply correct to the mailing list. I appologize for messing up the thread a bit.
Seems fine to me. As long as I can read and understand, I've no issue. :)
Thanks, Thinh
Hi Thinh,
On Wed, Oct 19, 2022 at 02:02:53AM +0000, Thinh Nguyen wrote:
On Tue, Oct 18, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Tue, Oct 18, 2022 at 10:35:30PM +0000, Thinh Nguyen wrote:
On Tue, Oct 18, 2022, Jeffrey Vanhoof wrote:
Hi Thinh,
On Tue, Oct 18, 2022 at 06:45:40PM +0000, Thinh Nguyen wrote:
Hi Dan,
On Mon, Oct 17, 2022, Dan Vacura wrote:
Hi Thinh,
On Mon, Oct 17, 2022 at 09:30:38PM +0000, Thinh Nguyen wrote: > On Mon, Oct 17, 2022, Dan Vacura wrote: > > From: Jeff Vanhoof qjv001@motorola.com > > > > arm-smmu related crashes seen after a Missed ISOC interrupt when > > no_interrupt=1 is used. This can happen if the hardware is still using > > the data associated with a TRB after the usb_request's ->complete call > > has been made. Instead of immediately releasing a request when a Missed > > ISOC interrupt has occurred, this change will add logic to cancel the > > request instead where it will eventually be released when the > > END_TRANSFER command has completed. This logic is similar to some of the > > cleanup done in dwc3_gadget_ep_dequeue. > > This doesn't sound right. How did you determine that the hardware is > still using the data associated with the TRB? Did you check the TRB's > HWO bit?
The problem we're seeing was mentioned in the summary of this patch series, issue #1. Basically, with the following patch https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... integrated a smmu panic is occurring on our Android device with the 5.15 kernel which is:
<3>[ 718.314900][ T803] arm-smmu 15000000.apps-smmu: Unhandled arm-smmu context fault from a600000.dwc3!
The uvc gadget driver appears to be the first (and only) gadget that uses the no_interrupt=1 logic, so this seems to be a new condition for the dwc3 driver. In our configuration, we have up to 64 requests and the no_interrupt=1 for up to 15 requests. The list size of dep->started_list would get up to that amount when looping through to cleanup the completed requests. From testing and debugging the smmu panic occurs when a -EXDEV status shows up and right after dwc3_gadget_ep_cleanup_completed_request() was visited. The conclusion we had was the requests were getting returned to the gadget too early.
As I mentioned, if the status is updated to missed isoc, that means that the controller returned ownership of the TRB to the driver. At least for the particular request with -EXDEV, its TRBs are completed. I'm not clear on your conclusion.
Do we know where did the crash occur? Is it from dwc3 driver or from uvc driver, and at what line? It'd great if we can see the driver log.
To interject, what should happen in dwc3_gadget_ep_reclaim_completed_trb if the IOC bit is not set (but the IMI bit is) and -EXDEV status is passed into it?
Hm... we may have overlooked this case for no_interrupt scenario. If IMI is set, then there will be an interrupt when there's missed isoc regardless of whether no_interrupt is set by the gadget driver.
If the function returns 0, another attempt to reclaim may occur. If this happens and the next request did have the HWO bit set, the function would return 1 but dwc3_gadget_ep_cleanup_completed_request would still call dwc3_gadget_giveback.
As a test (without this patch), I added a check to see if HWO bit was set in dwc3_gadget_ep_cleanup_completed_requests(). If the usecase was ISOC and the HWO bit was set I avoided calling dwc3_gadget_ep_cleanup_completed_request(). This seemed to also avoid the iommu related crash being seen.
Is there an issue in this area that needs to be corrected instead? Not having interrupts set for each request may be causing some new issues to be uncovered.
As far as the crash seen without this patch, no good stacktrace is given. Line provided for crash varied a bit, but tended to appear towards the end of dwc3_stop_active_transfer() or dwc3_gadget_endpoint_trbs_complete().
Since dwc3_gadget_endpoint_trbs_complete() can be called from multiple locations, I duplicated the function to help identify which path it was likely being called from. At the time of the crashes seen, dwc3_gadget_endpoint_transfer_in_progress() appeared to be the caller.
dwc3_gadget_endpoint_transfer_in_progress() ->dwc3_gadget_endpoint_trbs_complete() (crashed towards end of here) ->dwc3_stop_active_transfer() (sometimes crashed towards end of here)
I hope this clarifies things a bit.
Can we try this? Let me know if it resolves your issue.
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
With this change it doesn't seem to crash but unfortunately the output completely hangs after the first missed isoc. At the moment I do not understand why this might happen.
Can you capture the driver tracepoints with the change above?
Due to the size of the log, have sent it to directly to you.
From what I can gather from the log, with the current changes it seems that after a missed isoc event few requests are staying longer than expected in the started_list (not getting reclaimed) and this is preventing the transmission from stopping/starting again, and opening the door for continuous stream of missed isoc events that cause what appears to the user as a frozen video.
So one thought, if IOC bit is not set every frame, but IMI bit is, when a missed isoc related interrupt occurs it seems likely that more than one trb request will need to be reclaimed, but the current set of changes is not handling this.
In the good transfer case this issue seems to be taken care of since the IOC bit is not set every frame and the reclaimation will loop through every item in the started_list and only stop if there are no additional trbs or if one has its HWO bit set. Although I am a bit surprised that we are not yet seeing iommu related panics here too since the trb is being reclaimed and given back even the HWO bit still set, but maybe I am misunderstanding something. In my earlier testing, if a missed isoc event was received and we attempted to reclaim/giveback a trb with its HWO bit set, a iommu related panic would be seen.
Can you propose an additional change to handle freeing up the extra trbs in the missed isoc case? I think that somehow the HWO bit should be checked before entering dwc3_gadget_ep_cleanup_completed_request in order to avoid the trb being given back too soon.
Your thoughts?
Note that I haven't quite learned correctly how to reply correct to the mailing list. I appologize for messing up the thread a bit.
Seems fine to me. As long as I can read and understand, I've no issue. :)
Thanks, Thinh
Thanks, Jeff
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Wed, Oct 19, 2022 at 02:02:53AM +0000, Thinh Nguyen wrote:
On Tue, Oct 18, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Tue, Oct 18, 2022 at 10:35:30PM +0000, Thinh Nguyen wrote:
On Tue, Oct 18, 2022, Jeffrey Vanhoof wrote:
Hi Thinh,
On Tue, Oct 18, 2022 at 06:45:40PM +0000, Thinh Nguyen wrote:
Hi Dan,
On Mon, Oct 17, 2022, Dan Vacura wrote: > Hi Thinh, > > On Mon, Oct 17, 2022 at 09:30:38PM +0000, Thinh Nguyen wrote: > > On Mon, Oct 17, 2022, Dan Vacura wrote: > > > From: Jeff Vanhoof qjv001@motorola.com > > > > > > arm-smmu related crashes seen after a Missed ISOC interrupt when > > > no_interrupt=1 is used. This can happen if the hardware is still using > > > the data associated with a TRB after the usb_request's ->complete call > > > has been made. Instead of immediately releasing a request when a Missed > > > ISOC interrupt has occurred, this change will add logic to cancel the > > > request instead where it will eventually be released when the > > > END_TRANSFER command has completed. This logic is similar to some of the > > > cleanup done in dwc3_gadget_ep_dequeue. > > > > This doesn't sound right. How did you determine that the hardware is > > still using the data associated with the TRB? Did you check the TRB's > > HWO bit? > > The problem we're seeing was mentioned in the summary of this patch > series, issue #1. Basically, with the following patch > https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... > integrated a smmu panic is occurring on our Android device with the 5.15 > kernel which is: > > <3>[ 718.314900][ T803] arm-smmu 15000000.apps-smmu: Unhandled arm-smmu context fault from a600000.dwc3! > > The uvc gadget driver appears to be the first (and only) gadget that > uses the no_interrupt=1 logic, so this seems to be a new condition for > the dwc3 driver. In our configuration, we have up to 64 requests and the > no_interrupt=1 for up to 15 requests. The list size of dep->started_list > would get up to that amount when looping through to cleanup the > completed requests. From testing and debugging the smmu panic occurs > when a -EXDEV status shows up and right after > dwc3_gadget_ep_cleanup_completed_request() was visited. The conclusion > we had was the requests were getting returned to the gadget too early.
As I mentioned, if the status is updated to missed isoc, that means that the controller returned ownership of the TRB to the driver. At least for the particular request with -EXDEV, its TRBs are completed. I'm not clear on your conclusion.
Do we know where did the crash occur? Is it from dwc3 driver or from uvc driver, and at what line? It'd great if we can see the driver log.
To interject, what should happen in dwc3_gadget_ep_reclaim_completed_trb if the IOC bit is not set (but the IMI bit is) and -EXDEV status is passed into it?
Hm... we may have overlooked this case for no_interrupt scenario. If IMI is set, then there will be an interrupt when there's missed isoc regardless of whether no_interrupt is set by the gadget driver.
If the function returns 0, another attempt to reclaim may occur. If this happens and the next request did have the HWO bit set, the function would return 1 but dwc3_gadget_ep_cleanup_completed_request would still call dwc3_gadget_giveback.
As a test (without this patch), I added a check to see if HWO bit was set in dwc3_gadget_ep_cleanup_completed_requests(). If the usecase was ISOC and the HWO bit was set I avoided calling dwc3_gadget_ep_cleanup_completed_request(). This seemed to also avoid the iommu related crash being seen.
Is there an issue in this area that needs to be corrected instead? Not having interrupts set for each request may be causing some new issues to be uncovered.
As far as the crash seen without this patch, no good stacktrace is given. Line provided for crash varied a bit, but tended to appear towards the end of dwc3_stop_active_transfer() or dwc3_gadget_endpoint_trbs_complete().
Since dwc3_gadget_endpoint_trbs_complete() can be called from multiple locations, I duplicated the function to help identify which path it was likely being called from. At the time of the crashes seen, dwc3_gadget_endpoint_transfer_in_progress() appeared to be the caller.
dwc3_gadget_endpoint_transfer_in_progress() ->dwc3_gadget_endpoint_trbs_complete() (crashed towards end of here) ->dwc3_stop_active_transfer() (sometimes crashed towards end of here)
I hope this clarifies things a bit.
Can we try this? Let me know if it resolves your issue.
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
With this change it doesn't seem to crash but unfortunately the output completely hangs after the first missed isoc. At the moment I do not understand why this might happen.
Can you capture the driver tracepoints with the change above?
Due to the size of the log, have sent it to directly to you.
Just got it. Thanks.
From what I can gather from the log, with the current changes it seems that after a missed isoc event few requests are staying longer than expected in the started_list (not getting reclaimed) and this is preventing the transmission from stopping/starting again, and opening the door for continuous stream of missed isoc events that cause what appears to the user as a frozen video.
So one thought, if IOC bit is not set every frame, but IMI bit is, when a missed isoc related interrupt occurs it seems likely that more than one trb request will need to be reclaimed, but the current set of changes is not handling this.
In the good transfer case this issue seems to be taken care of since the IOC bit is not set every frame and the reclaimation will loop through every item in the started_list and only stop if there are no additional trbs or if one has
It should stop at the request that associated with the interrupt event, whether it's because of IMI or IOC.
its HWO bit set. Although I am a bit surprised that we are not yet seeing iommu related panics here too since the trb is being reclaimed and given back even the HWO bit still set, but maybe I am misunderstanding something. In my earlier testing, if a missed isoc event was received and we attempted to reclaim/giveback a trb with its HWO bit set, a iommu related panic would be seen.
If the controller processed the TRB, it would clear the HWO bit after completion. There are cases which the HWO bit is still set for some TRBs, but the request is completed (e.g. short transfers causing the controller to stop processing further). That those cases, the driver would clear the HWO bit manually.
Can you propose an additional change to handle freeing up the extra trbs in the missed isoc case? I think that somehow the HWO bit should be checked before entering dwc3_gadget_ep_cleanup_completed_request in order to avoid the trb being given back too soon.
We should not check for TRBs of requests beyond the request associated with the interrupt event.
Your thoughts?
The capture shows underrun, and it causes missed isoc.
Let's take a look at the first missed isoc request (req ffffff88f834db00)
<...>-22551 [002] d..2 13985.789684: dwc3_ep_queue: ep2in: req ffffff88f834db00 length 0/49152 zsi ==> -115 <...>-22551 [002] dn.2 13985.789728: dwc3_prepare_trb: ep2in: trb ffffffc016071970 (E152:D149) buf 00000000ea800000 size 1x 49152 ctrl 00000461 (Hlcs:Sc:isoc-first)
Each request uses a one TRB. From above, you can see that there are only 3 intervals prepared (E152 - D149 = 3).
The timestamp of the last request completion is 13985.788919. The timestamp which the queued request is 13985.789728. We can roughly estimate the diff is at least 809us.
3 intervals (3x125us) is less than 809us. So missed isoc is expected:
irq/399-dwc3-15269 [002] d..1 13985.790754: dwc3_event: event (f9acc08a): ep2in: Transfer In Progress [63916] (sIM) irq/399-dwc3-15269 [002] d..1 13985.790758: dwc3_complete_trb: ep2in: trb ffffffc016071970 (E154:D152) buf 00000000ea800000 size 1x 49152 ctrl 3e6a0460 (hlcs:Sc:isoc-first) irq/399-dwc3-15269 [002] d..1 13985.790808: dwc3_gadget_giveback: ep2in: req ffffff88f834db00 length 0/49152 zsi ==> -18
After this point, the uvc driver keeps playing catch up to stay in sync with the host. It tries, but it couldn't catch up. Also, occasionally something seems to be blocking dwc3, creating time gaps. Maybe because of a spin_lock held somewhere?
The logic to detect underrun doesn't trigger because the queued list is always non-empty, but the queued requests are expected to be missed already. So you keep seeing missed isoc.
There are a few things you can mitigate this issue: 1) Don't set IMI if the request indicates no_interrupt. This reduces the time software needs to handle interrupts. 2) Improve the underrun detection logic. 3) Increase the queuing frequency from the uvc to keep the request queue full. Note that reduce/avoid setting no_interrupt will allow the controller driver to update uvc often to keep requeuing new requests.
Best option is 3), but maybe we can do all 3.
For 2), you can set IMI for all isoc request as it is now. On missed isoc, check for the TRB's scheduled uframe (from TRB info) and compare it to the current uframe (from DSTS) for the number of intervals in between. With the number of queued requests, you can calculate whether the gadget driver queued enough requests. If it doesn't, send End Transfer command and cancel all the queued requests. The next set of requests will be in-sync again.
BR, Thinh
PS. On a side note, I notice that there are reports of issue when using SG right? Please note that dwc3 driver only allocates 256 TRBs (including a link TRB). Each SG entry takes a TRB. If a request has many SG entries, that eats up the available TRBs. So, even though the UVC may queue many requests, not all of them are prepared immediately. If the TRB ring is full, the driver need to wait for more TRBs to free up before preparing more. From the log, I see that it's sending 48KB. Let's say the uvc splits it into PAGE_SIZE of 4KB, that would take at least 12 TRB per request. (Side thought: I'm not sure why UVC needs SG in the first place with its current implementation)
Hi Thinh, On Wed, Oct 19, 2022 at 07:08:27PM +0000, Thinh Nguyen wrote:
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Wed, Oct 19, 2022 at 02:02:53AM +0000, Thinh Nguyen wrote:
On Tue, Oct 18, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Tue, Oct 18, 2022 at 10:35:30PM +0000, Thinh Nguyen wrote:
On Tue, Oct 18, 2022, Jeffrey Vanhoof wrote:
Hi Thinh,
On Tue, Oct 18, 2022 at 06:45:40PM +0000, Thinh Nguyen wrote: > Hi Dan, > > On Mon, Oct 17, 2022, Dan Vacura wrote: > > Hi Thinh, > > > > On Mon, Oct 17, 2022 at 09:30:38PM +0000, Thinh Nguyen wrote: > > > On Mon, Oct 17, 2022, Dan Vacura wrote: > > > > From: Jeff Vanhoof qjv001@motorola.com > > > > > > > > arm-smmu related crashes seen after a Missed ISOC interrupt when > > > > no_interrupt=1 is used. This can happen if the hardware is still using > > > > the data associated with a TRB after the usb_request's ->complete call > > > > has been made. Instead of immediately releasing a request when a Missed > > > > ISOC interrupt has occurred, this change will add logic to cancel the > > > > request instead where it will eventually be released when the > > > > END_TRANSFER command has completed. This logic is similar to some of the > > > > cleanup done in dwc3_gadget_ep_dequeue. > > > > > > This doesn't sound right. How did you determine that the hardware is > > > still using the data associated with the TRB? Did you check the TRB's > > > HWO bit? > > > > The problem we're seeing was mentioned in the summary of this patch > > series, issue #1. Basically, with the following patch > > https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... > > integrated a smmu panic is occurring on our Android device with the 5.15 > > kernel which is: > > > > <3>[ 718.314900][ T803] arm-smmu 15000000.apps-smmu: Unhandled arm-smmu context fault from a600000.dwc3! > > > > The uvc gadget driver appears to be the first (and only) gadget that > > uses the no_interrupt=1 logic, so this seems to be a new condition for > > the dwc3 driver. In our configuration, we have up to 64 requests and the > > no_interrupt=1 for up to 15 requests. The list size of dep->started_list > > would get up to that amount when looping through to cleanup the > > completed requests. From testing and debugging the smmu panic occurs > > when a -EXDEV status shows up and right after > > dwc3_gadget_ep_cleanup_completed_request() was visited. The conclusion > > we had was the requests were getting returned to the gadget too early. > > As I mentioned, if the status is updated to missed isoc, that means that > the controller returned ownership of the TRB to the driver. At least for > the particular request with -EXDEV, its TRBs are completed. I'm not > clear on your conclusion. > > Do we know where did the crash occur? Is it from dwc3 driver or from uvc > driver, and at what line? It'd great if we can see the driver log. >
To interject, what should happen in dwc3_gadget_ep_reclaim_completed_trb if the IOC bit is not set (but the IMI bit is) and -EXDEV status is passed into it?
Hm... we may have overlooked this case for no_interrupt scenario. If IMI is set, then there will be an interrupt when there's missed isoc regardless of whether no_interrupt is set by the gadget driver.
If the function returns 0, another attempt to reclaim may occur. If this happens and the next request did have the HWO bit set, the function would return 1 but dwc3_gadget_ep_cleanup_completed_request would still call dwc3_gadget_giveback.
As a test (without this patch), I added a check to see if HWO bit was set in dwc3_gadget_ep_cleanup_completed_requests(). If the usecase was ISOC and the HWO bit was set I avoided calling dwc3_gadget_ep_cleanup_completed_request(). This seemed to also avoid the iommu related crash being seen.
Is there an issue in this area that needs to be corrected instead? Not having interrupts set for each request may be causing some new issues to be uncovered.
As far as the crash seen without this patch, no good stacktrace is given. Line provided for crash varied a bit, but tended to appear towards the end of dwc3_stop_active_transfer() or dwc3_gadget_endpoint_trbs_complete().
Since dwc3_gadget_endpoint_trbs_complete() can be called from multiple locations, I duplicated the function to help identify which path it was likely being called from. At the time of the crashes seen, dwc3_gadget_endpoint_transfer_in_progress() appeared to be the caller.
dwc3_gadget_endpoint_transfer_in_progress() ->dwc3_gadget_endpoint_trbs_complete() (crashed towards end of here) ->dwc3_stop_active_transfer() (sometimes crashed towards end of here)
I hope this clarifies things a bit.
Can we try this? Let me know if it resolves your issue.
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
With this change it doesn't seem to crash but unfortunately the output completely hangs after the first missed isoc. At the moment I do not understand why this might happen.
Can you capture the driver tracepoints with the change above?
Due to the size of the log, have sent it to directly to you.
Just got it. Thanks.
From what I can gather from the log, with the current changes it seems that after a missed isoc event few requests are staying longer than expected in the started_list (not getting reclaimed) and this is preventing the transmission from stopping/starting again, and opening the door for continuous stream of missed isoc events that cause what appears to the user as a frozen video.
So one thought, if IOC bit is not set every frame, but IMI bit is, when a missed isoc related interrupt occurs it seems likely that more than one trb request will need to be reclaimed, but the current set of changes is not handling this.
In the good transfer case this issue seems to be taken care of since the IOC bit is not set every frame and the reclaimation will loop through every item in the started_list and only stop if there are no additional trbs or if one has
It should stop at the request that associated with the interrupt event, whether it's because of IMI or IOC.
In this case I was concerned that if multipled queued reqs did not have IOC bit set, but there was a missed isoc on one of the last reqs, whether or not we would reclaim all of the requests up to the missed isoc related req. I'm not sure if my concern is valid or not.
its HWO bit set. Although I am a bit surprised that we are not yet seeing iommu related panics here too since the trb is being reclaimed and given back even the HWO bit still set, but maybe I am misunderstanding something. In my earlier testing, if a missed isoc event was received and we attempted to reclaim/giveback a trb with its HWO bit set, a iommu related panic would be seen.
If the controller processed the TRB, it would clear the HWO bit after completion. There are cases which the HWO bit is still set for some TRBs, but the request is completed (e.g. short transfers causing the controller to stop processing further). That those cases, the driver would clear the HWO bit manually.
Can you propose an additional change to handle freeing up the extra trbs in the missed isoc case? I think that somehow the HWO bit should be checked before entering dwc3_gadget_ep_cleanup_completed_request in order to avoid the trb being given back too soon.
We should not check for TRBs of requests beyond the request associated with the interrupt event.
Your thoughts?
The capture shows underrun, and it causes missed isoc.
Let's take a look at the first missed isoc request (req ffffff88f834db00)
<...>-22551 [002] d..2 13985.789684: dwc3_ep_queue: ep2in: req ffffff88f834db00 length 0/49152 zsi ==> -115 <...>-22551 [002] dn.2 13985.789728: dwc3_prepare_trb: ep2in: trb ffffffc016071970 (E152:D149) buf 00000000ea800000 size 1x 49152 ctrl 00000461 (Hlcs:Sc:isoc-first)
Each request uses a one TRB. From above, you can see that there are only 3 intervals prepared (E152 - D149 = 3).
The timestamp of the last request completion is 13985.788919. The timestamp which the queued request is 13985.789728. We can roughly estimate the diff is at least 809us.
3 intervals (3x125us) is less than 809us. So missed isoc is expected:
irq/399-dwc3-15269 [002] d..1 13985.790754: dwc3_event: event (f9acc08a): ep2in: Transfer In Progress [63916] (sIM) irq/399-dwc3-15269 [002] d..1 13985.790758: dwc3_complete_trb: ep2in: trb ffffffc016071970 (E154:D152) buf 00000000ea800000 size 1x 49152 ctrl 3e6a0460 (hlcs:Sc:isoc-first) irq/399-dwc3-15269 [002] d..1 13985.790808: dwc3_gadget_giveback: ep2in: req ffffff88f834db00 length 0/49152 zsi ==> -18
After this point, the uvc driver keeps playing catch up to stay in sync with the host. It tries, but it couldn't catch up. Also, occasionally something seems to be blocking dwc3, creating time gaps. Maybe because of a spin_lock held somewhere?
Could the time gaps be created by the interrupt frequency changes? They completely change up the timing of when the transfers are kicked in dwc3 and when uvc_video_complete/uvcg_video_pump is called.
The logic to detect underrun doesn't trigger because the queued list is always non-empty, but the queued requests are expected to be missed already. So you keep seeing missed isoc.
There are a few things you can mitigate this issue:
- Don't set IMI if the request indicates no_interrupt. This reduces the time software needs to handle interrupts.
I did try this out earlier and it did not prevent the video freeze. It does make sense what you are suggesting, but because it didn't work for me it made me think that not all reqs are being reclaimed after a missed isoc is seen. I'll revisit this area again.
- Improve the underrun detection logic.
I like this idea a lot but I'm not up to the task just yet. Will attempt to follow your recommendations below and see where I get with this.
- Increase the queuing frequency from the uvc to keep the request queue full. Note that reduce/avoid setting no_interrupt will allow the controller driver to update uvc often to keep requeuing new requests.
Best option is 3), but maybe we can do all 3.
I think that this is our best option too. Dan provided a patch to make the interrupt skip logic configurable in the uvc driver: https://patchwork.kernel.org/project/linux-usb/patch/20221018215044.765044-6... https://lore.kernel.org/linux-usb/20221018215044.765044-6-w36195@motorola.co...
For 2), you can set IMI for all isoc request as it is now. On missed isoc, check for the TRB's scheduled uframe (from TRB info) and compare it to the current uframe (from DSTS) for the number of intervals in between. With the number of queued requests, you can calculate whether the gadget driver queued enough requests. If it doesn't, send End Transfer command and cancel all the queued requests. The next set of requests will be in-sync again.
BR, Thinh
PS. On a side note, I notice that there are reports of issue when using SG right? Please note that dwc3 driver only allocates 256 TRBs (including a link TRB). Each SG entry takes a TRB. If a request has many SG entries, that eats up the available TRBs. So, even though the UVC may queue many requests, not all of them are prepared immediately. If the TRB ring is full, the driver need to wait for more TRBs to free up before preparing more. From the log, I see that it's sending 48KB. Let's say the uvc splits it into PAGE_SIZE of 4KB, that would take at least 12 TRB per request. (Side thought: I'm not sure why UVC needs SG in the first place with its current implementation)
On our platform I am seeing 2 items in the sg list being sent out from the uvc driver. The 1st item in the list is a 2 byte uvc header and the 2nd item is the 48KBs of data. To me this seems inefficient but sort of makes sense why it was done, likely to avoid a memory copy just for the 2 byte header. But I share your concern here, it's possible that other users wont be so lucky and will wind up having a lot of page sized items in the sg list.
We are also hoping to make the use of sg configurable with the change: https://patchwork.kernel.org/project/linux-usb/patch/20221018215044.765044-7... https://lore.kernel.org/linux-usb/20221018215044.765044-7-w36195@motorola.co...
Thanks, Jeff
Hi,
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
Hi Thinh, On Wed, Oct 19, 2022 at 07:08:27PM +0000, Thinh Nguyen wrote:
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
<snip>
From what I can gather from the log, with the current changes it seems that after a missed isoc event few requests are staying longer than expected in the started_list (not getting reclaimed) and this is preventing the transmission from stopping/starting again, and opening the door for continuous stream of missed isoc events that cause what appears to the user as a frozen video.
So one thought, if IOC bit is not set every frame, but IMI bit is, when a missed isoc related interrupt occurs it seems likely that more than one trb request will need to be reclaimed, but the current set of changes is not handling this.
In the good transfer case this issue seems to be taken care of since the IOC bit is not set every frame and the reclaimation will loop through every item in the started_list and only stop if there are no additional trbs or if one has
It should stop at the request that associated with the interrupt event, whether it's because of IMI or IOC.
In this case I was concerned that if multipled queued reqs did not have IOC bit set, but there was a missed isoc on one of the last reqs, whether or not we would reclaim all of the requests up to the missed isoc related req. I'm not sure if my concern is valid or not.
There should be no problem. If there's an interrupt event indicating a TRB completion, the driver will give back all the requests up to the request associated with the interrupt event, and the controller will continue processing the remaining TRBs. On the next TRB completion event, the driver will again give back all the requests up to the request associated with that event.
its HWO bit set. Although I am a bit surprised that we are not yet seeing iommu related panics here too since the trb is being reclaimed and given back even the HWO bit still set, but maybe I am misunderstanding something. In my earlier testing, if a missed isoc event was received and we attempted to reclaim/giveback a trb with its HWO bit set, a iommu related panic would be seen.
If the controller processed the TRB, it would clear the HWO bit after completion. There are cases which the HWO bit is still set for some TRBs, but the request is completed (e.g. short transfers causing the controller to stop processing further). That those cases, the driver would clear the HWO bit manually.
Can you propose an additional change to handle freeing up the extra trbs in the missed isoc case? I think that somehow the HWO bit should be checked before entering dwc3_gadget_ep_cleanup_completed_request in order to avoid the trb being given back too soon.
We should not check for TRBs of requests beyond the request associated with the interrupt event.
Your thoughts?
The capture shows underrun, and it causes missed isoc.
Let's take a look at the first missed isoc request (req ffffff88f834db00)
<...>-22551 [002] d..2 13985.789684: dwc3_ep_queue: ep2in: req ffffff88f834db00 length 0/49152 zsi ==> -115 <...>-22551 [002] dn.2 13985.789728: dwc3_prepare_trb: ep2in: trb ffffffc016071970 (E152:D149) buf 00000000ea800000 size 1x 49152 ctrl 00000461 (Hlcs:Sc:isoc-first)
Each request uses a one TRB. From above, you can see that there are only 3 intervals prepared (E152 - D149 = 3).
The timestamp of the last request completion is 13985.788919. The timestamp which the queued request is 13985.789728. We can roughly estimate the diff is at least 809us.
3 intervals (3x125us) is less than 809us. So missed isoc is expected:
irq/399-dwc3-15269 [002] d..1 13985.790754: dwc3_event: event (f9acc08a): ep2in: Transfer In Progress [63916] (sIM) irq/399-dwc3-15269 [002] d..1 13985.790758: dwc3_complete_trb: ep2in: trb ffffffc016071970 (E154:D152) buf 00000000ea800000 size 1x 49152 ctrl 3e6a0460 (hlcs:Sc:isoc-first) irq/399-dwc3-15269 [002] d..1 13985.790808: dwc3_gadget_giveback: ep2in: req ffffff88f834db00 length 0/49152 zsi ==> -18
After this point, the uvc driver keeps playing catch up to stay in sync with the host. It tries, but it couldn't catch up. Also, occasionally something seems to be blocking dwc3, creating time gaps. Maybe because of a spin_lock held somewhere?
Could the time gaps be created by the interrupt frequency changes? They completely change up the timing of when the transfers are kicked in dwc3 and when uvc_video_complete/uvcg_video_pump is called.
You can check all the "fastrpc_msg" tracepoints in the log. These steps seem to slow down the queuing process in uvc.
The logic to detect underrun doesn't trigger because the queued list is always non-empty, but the queued requests are expected to be missed already. So you keep seeing missed isoc.
There are a few things you can mitigate this issue:
- Don't set IMI if the request indicates no_interrupt. This reduces the time software needs to handle interrupts.
I did try this out earlier and it did not prevent the video freeze. It does make sense what you are suggesting, but because it didn't work for me it made me think that not all reqs are being reclaimed after a missed isoc is seen. I'll revisit this area again.
I don't expect this to improve much.
- Improve the underrun detection logic.
I like this idea a lot but I'm not up to the task just yet. Will attempt to follow your recommendations below and see where I get with this.
- Increase the queuing frequency from the uvc to keep the request queue full. Note that reduce/avoid setting no_interrupt will allow the controller driver to update uvc often to keep requeuing new requests.
Best option is 3), but maybe we can do all 3.
I think that this is our best option too. Dan provided a patch to make the interrupt skip logic configurable in the uvc driver: https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20221018215044...
For 2), you can set IMI for all isoc request as it is now. On missed isoc, check for the TRB's scheduled uframe (from TRB info) and compare it to the current uframe (from DSTS) for the number of intervals in between. With the number of queued requests, you can calculate whether the gadget driver queued enough requests. If it doesn't, send End Transfer command and cancel all the queued requests. The next set of requests will be in-sync again.
BR, Thinh
PS. On a side note, I notice that there are reports of issue when using SG right? Please note that dwc3 driver only allocates 256 TRBs (including a link TRB). Each SG entry takes a TRB. If a request has many SG entries, that eats up the available TRBs. So, even though the UVC may queue many requests, not all of them are prepared immediately. If the TRB ring is full, the driver need to wait for more TRBs to free up before preparing more. From the log, I see that it's sending 48KB. Let's say the uvc splits it into PAGE_SIZE of 4KB, that would take at least 12 TRB per request. (Side thought: I'm not sure why UVC needs SG in the first place with its current implementation)
On our platform I am seeing 2 items in the sg list being sent out from the uvc driver. The 1st item in the list is a 2 byte uvc header and the 2nd item is the 48KBs of data. To me this seems inefficient but sort of makes sense why it was done, likely to avoid a memory copy just for the 2 byte header. But I
That doesn't make sense to me, but I'm sure there's a real reason. Just curious, that's all. :)
share your concern here, it's possible that other users wont be so lucky and will wind up having a lot of page sized items in the sg list.
We are also hoping to make the use of sg configurable with the change: https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20221018215044...
Ok.
Thanks, Thinh
Hi Thinh,
On Wed, Oct 19, 2022 at 11:06:08PM +0000, Thinh Nguyen wrote:
Hi,
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
Hi Thinh, On Wed, Oct 19, 2022 at 07:08:27PM +0000, Thinh Nguyen wrote:
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
<snip>
From what I can gather from the log, with the current changes it seems that after a missed isoc event few requests are staying longer than expected in the started_list (not getting reclaimed) and this is preventing the transmission from stopping/starting again, and opening the door for continuous stream of missed isoc events that cause what appears to the user as a frozen video.
So one thought, if IOC bit is not set every frame, but IMI bit is, when a missed isoc related interrupt occurs it seems likely that more than one trb request will need to be reclaimed, but the current set of changes is not handling this.
In the good transfer case this issue seems to be taken care of since the IOC bit is not set every frame and the reclaimation will loop through every item in the started_list and only stop if there are no additional trbs or if one has
It should stop at the request that associated with the interrupt event, whether it's because of IMI or IOC.
In this case I was concerned that if multipled queued reqs did not have IOC bit set, but there was a missed isoc on one of the last reqs, whether or not we would reclaim all of the requests up to the missed isoc related req. I'm not sure if my concern is valid or not.
There should be no problem. If there's an interrupt event indicating a TRB completion, the driver will give back all the requests up to the request associated with the interrupt event, and the controller will continue processing the remaining TRBs. On the next TRB completion event, the driver will again give back all the requests up to the request associated with that event.
I was testing with the following patch you suggested:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
At this time the IMI bit was set for every frame. With these changes it appeared in case of missed isoc that sometimes not all requests would be reclaimed (enqueued != dequeued even 100ms after the last interrupt was handled). If the 1st req in the started_list was fine (IMI set, but not IOC), and a later req was the one actually missed, because of this status check the reclaimation could stop early and not clean up to the appropriate req. As suggested yesterday, I also tried only setting the IMI bit when no_interrupt is not set, however I was still seeing the complete freezes. After analyzing this issue a bit, I have updated the diff to look more like this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..bb800a81815b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -1230,8 +1230,9 @@ static void __dwc3_prepare_one_trb(struct dwc3_ep *dep, struct dwc3_trb *trb, trb->ctrl = DWC3_TRBCTL_ISOCHRONOUS; }
- /* always enable Interrupt on Missed ISOC */ - trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI; + /* enable Interrupt on Missed ISOC */ + if ((!no_interrupt && !chain) || must_interrupt) + trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI; break;
case USB_ENDPOINT_XFER_BULK: @@ -3195,6 +3196,11 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
+ if (usb_endpoint_xfer_isoc(dep->endpoint.desc) && + (event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain + && (trb->ctrl & DWC3_TRB_CTRL_ISP_IMI)) + return 1; + if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
Where the trb must have the IMI set before returning early. This seemed to make the freezes recoverable.
its HWO bit set. Although I am a bit surprised that we are not yet seeing iommu related panics here too since the trb is being reclaimed and given back even the HWO bit still set, but maybe I am misunderstanding something. In my earlier testing, if a missed isoc event was received and we attempted to reclaim/giveback a trb with its HWO bit set, a iommu related panic would be seen.
If the controller processed the TRB, it would clear the HWO bit after completion. There are cases which the HWO bit is still set for some TRBs, but the request is completed (e.g. short transfers causing the controller to stop processing further). That those cases, the driver would clear the HWO bit manually.
Can you propose an additional change to handle freeing up the extra trbs in the missed isoc case? I think that somehow the HWO bit should be checked before entering dwc3_gadget_ep_cleanup_completed_request in order to avoid the trb being given back too soon.
We should not check for TRBs of requests beyond the request associated with the interrupt event.
Your thoughts?
The capture shows underrun, and it causes missed isoc.
Let's take a look at the first missed isoc request (req ffffff88f834db00)
<...>-22551 [002] d..2 13985.789684: dwc3_ep_queue: ep2in: req ffffff88f834db00 length 0/49152 zsi ==> -115 <...>-22551 [002] dn.2 13985.789728: dwc3_prepare_trb: ep2in: trb ffffffc016071970 (E152:D149) buf 00000000ea800000 size 1x 49152 ctrl 00000461 (Hlcs:Sc:isoc-first)
Each request uses a one TRB. From above, you can see that there are only 3 intervals prepared (E152 - D149 = 3).
The timestamp of the last request completion is 13985.788919. The timestamp which the queued request is 13985.789728. We can roughly estimate the diff is at least 809us.
3 intervals (3x125us) is less than 809us. So missed isoc is expected:
irq/399-dwc3-15269 [002] d..1 13985.790754: dwc3_event: event (f9acc08a): ep2in: Transfer In Progress [63916] (sIM) irq/399-dwc3-15269 [002] d..1 13985.790758: dwc3_complete_trb: ep2in: trb ffffffc016071970 (E154:D152) buf 00000000ea800000 size 1x 49152 ctrl 3e6a0460 (hlcs:Sc:isoc-first) irq/399-dwc3-15269 [002] d..1 13985.790808: dwc3_gadget_giveback: ep2in: req ffffff88f834db00 length 0/49152 zsi ==> -18
After this point, the uvc driver keeps playing catch up to stay in sync with the host. It tries, but it couldn't catch up. Also, occasionally something seems to be blocking dwc3, creating time gaps. Maybe because of a spin_lock held somewhere?
Could the time gaps be created by the interrupt frequency changes? They completely change up the timing of when the transfers are kicked in dwc3 and when uvc_video_complete/uvcg_video_pump is called.
You can check all the "fastrpc_msg" tracepoints in the log. These steps seem to slow down the queuing process in uvc.
The large time gaps appear related to the low fps used for the camera configuration. I was able to confirm this by adding additional trace points in the uvc driver to see if pump was being called after new work was queued up by v4l2.
Some additional thoughts:
On the scheduling side, one issue I've seen: 1) uvc/dwc3 may take a few millisecs to queue up the requests for a single frame. How quickly this is completed could make for some of the differences seen from one platform to another.
2) If the isoc transfer has not yet started and uvc begins to queue up requests to dwc3, it appears that the isoc transfer will almost immediately start. And if the remaining requests for the frame do not come quickly enough, there is a potential for the transfer to stop or a missed isoc event to occur.
Maybe this happens frequently on my platform because the debug kernel I am using has a bit too much overhead. Will allowing a bit more time for initially queuing up some data before kicking off the transfer be prudent? Is 500us really enough for a mobile platform? We've always have had issues with random frame drops, but on much older platforms I was able to reduce the frequency of them by delaying the start by pushing out the frame_number by 16 or 32.
Also, one problem I am seeing specifically with the skip interrupts: 1) If a missed isoc event occurs, dwc3_gadget_giveback will call the request's ->complete callback where the status -EXDEV is passed. In uvc_video_complete it will begin to cancel the queue if this status is seen. If this occurs while in the middle of queuing up the requests for the same frame, then it's quite possible that the last set of requests will have had no_interrupt=1 set.
2) In dwc3 after the missed isoc event, if the remaining set of requests in the started_list do not have the IOC bit set, then no further interrupts will occur. And because the started_list is not emptied the transfer will not be stopped. Only after uvc queues up the requests for the next frame (which may come much later, i.e. 30-100ms) will the transfers appear to continue, however additional missed isoc events will be received and the cycle could continue.
I believe that this is a problem that should be addressed somehow. Not sure if the uvc driver needs to send a dummy request when an error is received, or if it should just send one more additional request with the no_interrupt bit not set, or if it needs to just send the complete frame in spite of the error. Hopefully I'm not just completely lost on this stuff.
Using the above patch, I grabbed the following trace snippets to show what is being seen:
Normal looking missed isoc related trace where enqueued == dequeued and transfer ends as expected: kworker/u17:5-8169 [000] d..2 187.669572: dwc3_gadget_ep_cmd: ep2in: cmd 'Start Transfer' [49540406] params 00000000 efff5f30 00000000 --> status: Successful irq/398-dwc3-9882 [000] d..1 187.672504: dwc3_complete_trb: ep2in: trb ffffffc016396f30 (E250:D244) buf 00000000ed470000 size 1x 0 ctrl 12550c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d..1 187.672804: dwc3_complete_trb: ep2in: trb ffffffc016396f40 (E250:D245) buf 00000000ec5e0000 size 1x 0 ctrl 12558060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 187.672885: dwc3_complete_trb: ep2in: trb ffffffc016396f50 (E250:D246) buf 00000000ecd30000 size 1x 0 ctrl 12560060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 187.672942: dwc3_complete_trb: ep2in: trb ffffffc016396f60 (E250:D247) buf 00000000ecd40000 size 1x 0 ctrl 12568060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 187.672993: dwc3_complete_trb: ep2in: trb ffffffc016396f70 (E250:D248) buf 00000000edfb0000 size 1x 0 ctrl 12570060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 187.673049: dwc3_complete_trb: ep2in: trb ffffffc016396f80 (E250:D249) buf 00000000edfa0000 size 1x 0 ctrl 12578c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d..1 187.674615: dwc3_complete_trb: ep2in: trb ffffffc016396f90 (E254:D250) buf 00000000eca80000 size 1x 0 ctrl 12580060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 187.674722: dwc3_complete_trb: ep2in: trb ffffffc016396fa0 (E254:D251) buf 00000000edfa0000 size 1x 49152 ctrl 12588060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 187.674793: usb_gadget_giveback_request: ep2in: req ffffff894498a600 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 187.675008: dwc3_complete_trb: ep2in: trb ffffffc016396fb0 (E254:D252) buf 00000000edfb0000 size 1x 49152 ctrl 12590060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 187.675093: usb_gadget_giveback_request: ep2in: req ffffff892bcd3500 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 187.675120: dwc3_complete_trb: ep2in: trb ffffffc016396fc0 (E254:D253) buf 00000000ecd40000 size 1x 49152 ctrl 12598060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 187.675207: usb_gadget_giveback_request: ep2in: req ffffff8918f48200 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 187.675225: dwc3_complete_trb: ep2in: trb ffffffc016396fd0 (E254:D254) buf 00000000ecd30000 size 1x 49152 ctrl 125a0c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 187.675274: usb_gadget_giveback_request: ep2in: req ffffff892bcd3600 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 irq/398-dwc3-9882 [000] d..1 187.675364: dwc3_gadget_ep_cmd: ep2in: cmd 'End Transfer' [50d08] params 00000000 00000000 00000000 --> status: Successful
Abnormal trace where the 'End Transfer' is delayed by a few frames: kworker/u17:5-8169 [000] dn.2 190.804817: dwc3_gadget_ep_cmd: ep2in: cmd 'Start Transfer' [ab500406] params 00000000 efff5290 00000000 --> status: Successful irq/398-dwc3-9882 [000] d..1 190.809070: dwc3_complete_trb: ep2in: trb ffffffc016396290 (E51:D42) buf 00000000ef6b0000 size 1x 0 ctrl 2ad40c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d..1 190.809478: dwc3_complete_trb: ep2in: trb ffffffc0163962a0 (E53:D43) buf 00000000ef060000 size 1x 49152 ctrl 2ad48060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.809534: usb_gadget_giveback_request: ep2in: req ffffff8918f48200 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.809752: dwc3_complete_trb: ep2in: trb ffffffc0163962b0 (E54:D44) buf 00000000ec5a0000 size 1x 49152 ctrl 2ad50060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.809878: usb_gadget_giveback_request: ep2in: req ffffff8859af4e00 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.809966: dwc3_complete_trb: ep2in: trb ffffffc0163962c0 (E54:D45) buf 00000000ef180000 size 1x 49152 ctrl 2ad58060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.810037: usb_gadget_giveback_request: ep2in: req ffffff8918f48000 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.810058: dwc3_complete_trb: ep2in: trb ffffffc0163962d0 (E54:D46) buf 00000000ef070000 size 1x 49152 ctrl 2ad60060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.810139: usb_gadget_giveback_request: ep2in: req ffffff894498a500 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.810159: dwc3_complete_trb: ep2in: trb ffffffc0163962e0 (E54:D47) buf 00000000eeae0000 size 1x 49152 ctrl 2ad68c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 190.810238: usb_gadget_giveback_request: ep2in: req ffffff88bddb9f00 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.810293: dwc3_complete_trb: ep2in: trb ffffffc0163962f0 (E54:D48) buf 00000000ec0d0000 size 1x 49152 ctrl 2ad70060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.810368: usb_gadget_giveback_request: ep2in: req ffffff892bcd3500 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.810389: dwc3_complete_trb: ep2in: trb ffffffc016396300 (E54:D49) buf 00000000ec0e0000 size 1x 0 ctrl 2ad78060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 190.810454: dwc3_complete_trb: ep2in: trb ffffffc016396310 (E54:D50) buf 00000000ec0f0000 size 1x 0 ctrl 2ad80060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 190.810526: dwc3_complete_trb: ep2in: trb ffffffc016396320 (E54:D51) buf 00000000efa20000 size 1x 49152 ctrl 2ad88060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.810591: usb_gadget_giveback_request: ep2in: req ffffff8859af5800 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.810609: dwc3_complete_trb: ep2in: trb ffffffc016396330 (E54:D52) buf 00000000efa30000 size 1x 49152 ctrl 2ad90c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 190.810676: usb_gadget_giveback_request: ep2in: req ffffff88bddb8300 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 ... irq/398-dwc3-9882 [000] d..1 190.919720: dwc3_complete_trb: ep2in: trb ffffffc016396340 (E56:D53) buf 00000000efa40000 size 1x 49152 ctrl 2ad98060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.919804: usb_gadget_giveback_request: ep2in: req ffffff894498a900 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.920065: dwc3_complete_trb: ep2in: trb ffffffc016396350 (E57:D54) buf 00000000efa50000 size 1x 49152 ctrl 2ada0060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.920116: usb_gadget_giveback_request: ep2in: req ffffff88bddb8b00 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.920175: dwc3_complete_trb: ep2in: trb ffffffc016396360 (E57:D55) buf 00000000efa60000 size 1x 49152 ctrl 2ada8c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 190.920223: usb_gadget_giveback_request: ep2in: req ffffff892bcd3600 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 ... irq/398-dwc3-9882 [000] d..1 190.992902: dwc3_complete_trb: ep2in: trb ffffffc016396370 (E59:D56) buf 00000000efa90000 size 1x 49152 ctrl 2adb0060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.993014: usb_gadget_giveback_request: ep2in: req ffffff88a0b9b400 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.993365: dwc3_complete_trb: ep2in: trb ffffffc016396380 (E60:D57) buf 00000000efaa0000 size 1x 49152 ctrl 2adb8060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.993439: usb_gadget_giveback_request: ep2in: req ffffff8859af4f00 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.993528: dwc3_complete_trb: ep2in: trb ffffffc016396390 (E60:D58) buf 00000000efab0000 size 1x 49152 ctrl 2adc0c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 190.993622: usb_gadget_giveback_request: ep2in: req ffffff8859af4e00 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 ... irq/398-dwc3-9882 [000] d..1 191.104270: dwc3_complete_trb: ep2in: trb ffffffc0163963a0 (E61:D59) buf 00000000ee950000 size 1x 49152 ctrl 2adc8060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 191.104410: usb_gadget_giveback_request: ep2in: req ffffff8918f48000 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 191.104648: dwc3_complete_trb: ep2in: trb ffffffc0163963b0 (E61:D60) buf 00000000ed530000 size 1x 49152 ctrl 2add0060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 191.104745: usb_gadget_giveback_request: ep2in: req ffffff894498a500 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 191.104834: dwc3_complete_trb: ep2in: trb ffffffc0163963c0 (E61:D61) buf 00000000ef610000 size 1x 49152 ctrl 2add8c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 191.104925: usb_gadget_giveback_request: ep2in: req ffffff8859af5800 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 irq/398-dwc3-9882 [000] d..1 191.105098: dwc3_gadget_ep_cmd: ep2in: cmd 'End Transfer' [50d08] params 00000000 00000000 00000000 --> status: Successful
Where you can see that in between frames enqueued != dequeued, and this can cause the video to freeze/skip for more than a single frame. The length of the freeze is all timing related and depends on if an interrupt occurs on the last req or not in the started_list.
So, to sort of rehash things again, we came up with the patch: [PATCH v3 2/6] usb: dwc3: gadget: cancel requests instead of release after missed isoc to workaround the iommu related crashes being seen with the skip interrupt solution.
As an alternative solution, we are now using the diff mentioned above, and it seems to avoid the crash, however we are seeing longer freezes with it, I believe due to the explaination I provided above.
If you agree with my analysis, what should we do from here? What's the best way to ensure that the started_list can get cleaned up after a missed isoc with the skip interrupt solution enabled?
The logic to detect underrun doesn't trigger because the queued list is always non-empty, but the queued requests are expected to be missed already. So you keep seeing missed isoc.
There are a few things you can mitigate this issue:
- Don't set IMI if the request indicates no_interrupt. This reduces the time software needs to handle interrupts.
I did try this out earlier and it did not prevent the video freeze. It does make sense what you are suggesting, but because it didn't work for me it made me think that not all reqs are being reclaimed after a missed isoc is seen. I'll revisit this area again.
I don't expect this to improve much.
- Improve the underrun detection logic.
I like this idea a lot but I'm not up to the task just yet. Will attempt to follow your recommendations below and see where I get with this.
- Increase the queuing frequency from the uvc to keep the request queue full. Note that reduce/avoid setting no_interrupt will allow the controller driver to update uvc often to keep requeuing new requests.
Best option is 3), but maybe we can do all 3.
I think that this is our best option too. Dan provided a patch to make the interrupt skip logic configurable in the uvc driver: https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20221018215044...
For 2), you can set IMI for all isoc request as it is now. On missed isoc, check for the TRB's scheduled uframe (from TRB info) and compare it to the current uframe (from DSTS) for the number of intervals in between. With the number of queued requests, you can calculate whether the gadget driver queued enough requests. If it doesn't, send End Transfer command and cancel all the queued requests. The next set of requests will be in-sync again.
BR, Thinh
PS. On a side note, I notice that there are reports of issue when using SG right? Please note that dwc3 driver only allocates 256 TRBs (including a link TRB). Each SG entry takes a TRB. If a request has many SG entries, that eats up the available TRBs. So, even though the UVC may queue many requests, not all of them are prepared immediately. If the TRB ring is full, the driver need to wait for more TRBs to free up before preparing more. From the log, I see that it's sending 48KB. Let's say the uvc splits it into PAGE_SIZE of 4KB, that would take at least 12 TRB per request. (Side thought: I'm not sure why UVC needs SG in the first place with its current implementation)
On our platform I am seeing 2 items in the sg list being sent out from the uvc driver. The 1st item in the list is a 2 byte uvc header and the 2nd item is the 48KBs of data. To me this seems inefficient but sort of makes sense why it was done, likely to avoid a memory copy just for the 2 byte header. But I
That doesn't make sense to me, but I'm sure there's a real reason. Just curious, that's all. :)
share your concern here, it's possible that other users wont be so lucky and will wind up having a lot of page sized items in the sg list.
We are also hoping to make the use of sg configurable with the change: https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20221018215044...
Ok.
Thanks, Thinh
Thanks, Jeff
Hi Thinh,
On Wed, Oct 19, 2022 at 11:06:08PM +0000, Thinh Nguyen wrote:
Hi,
<snip>
The logic to detect underrun doesn't trigger because the queued list is always non-empty, but the queued requests are expected to be missed already. So you keep seeing missed isoc.
There are a few things you can mitigate this issue:
- Don't set IMI if the request indicates no_interrupt. This reduces the time software needs to handle interrupts.
- Improve the underrun detection logic.
- Increase the queuing frequency from the uvc to keep the request queue full. Note that reduce/avoid setting no_interrupt will allow the controller driver to update uvc often to keep requeuing new requests.
Best option is 3), but maybe we can do all 3.
I forgot about your option 2. Will start looking into it.
Thanks, Jeff
On Thu, Oct 20, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Wed, Oct 19, 2022 at 11:06:08PM +0000, Thinh Nguyen wrote:
Hi,
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
Hi Thinh, On Wed, Oct 19, 2022 at 07:08:27PM +0000, Thinh Nguyen wrote:
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
<snip>
From what I can gather from the log, with the current changes it seems that after a missed isoc event few requests are staying longer than expected in the started_list (not getting reclaimed) and this is preventing the transmission from stopping/starting again, and opening the door for continuous stream of missed isoc events that cause what appears to the user as a frozen video.
So one thought, if IOC bit is not set every frame, but IMI bit is, when a missed isoc related interrupt occurs it seems likely that more than one trb request will need to be reclaimed, but the current set of changes is not handling this.
In the good transfer case this issue seems to be taken care of since the IOC bit is not set every frame and the reclaimation will loop through every item in the started_list and only stop if there are no additional trbs or if one has
It should stop at the request that associated with the interrupt event, whether it's because of IMI or IOC.
In this case I was concerned that if multipled queued reqs did not have IOC bit set, but there was a missed isoc on one of the last reqs, whether or not we would reclaim all of the requests up to the missed isoc related req. I'm not sure if my concern is valid or not.
There should be no problem. If there's an interrupt event indicating a TRB completion, the driver will give back all the requests up to the request associated with the interrupt event, and the controller will continue processing the remaining TRBs. On the next TRB completion event, the driver will again give back all the requests up to the request associated with that event.
I was testing with the following patch you suggested:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
At this time the IMI bit was set for every frame. With these changes it appeared in case of missed isoc that sometimes not all requests would be reclaimed (enqueued != dequeued even 100ms after the last interrupt was handled). If the 1st req in the started_list was fine (IMI set, but not IOC), and a later req was the one actually missed, because of this status check the reclaimation could stop early and not clean up to the appropriate req. As
Oops. You're right.
suggested yesterday, I also tried only setting the IMI bit when no_interrupt is not set, however I was still seeing the complete freezes. After analyzing this issue a bit, I have updated the diff to look more like this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..bb800a81815b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -1230,8 +1230,9 @@ static void __dwc3_prepare_one_trb(struct dwc3_ep *dep, struct dwc3_trb *trb, trb->ctrl = DWC3_TRBCTL_ISOCHRONOUS; }
/* always enable Interrupt on Missed ISOC */
trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
/* enable Interrupt on Missed ISOC */
if ((!no_interrupt && !chain) || must_interrupt)
break;trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
Either all or none of the TRBs of a request is set with IMI, and not some.
case USB_ENDPOINT_XFER_BULK: @@ -3195,6 +3196,11 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain
&& (trb->ctrl & DWC3_TRB_CTRL_ISP_IMI))
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
Where the trb must have the IMI set before returning early. This seemed to make the freezes recoverable.
Can you try this revised change:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..a69d8c28d86b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3654,7 +3654,7 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if ((trb->ctrl & DWC3_TRB_CTRL_HWO) && status != -ESHUTDOWN) return 1;
- if (event->status & DEPEVT_STATUS_SHORT && !chain) + if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain) return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || @@ -3673,6 +3673,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i; + bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3681,12 +3682,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
+ if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC) + missed_isoc = true; + ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
+ if (missed_isoc) + ret = 1; + return ret; }
its HWO bit set. Although I am a bit surprised that we are not yet seeing iommu related panics here too since the trb is being reclaimed and given back even the HWO bit still set, but maybe I am misunderstanding something. In my earlier testing, if a missed isoc event was received and we attempted to reclaim/giveback a trb with its HWO bit set, a iommu related panic would be seen.
If the controller processed the TRB, it would clear the HWO bit after completion. There are cases which the HWO bit is still set for some TRBs, but the request is completed (e.g. short transfers causing the controller to stop processing further). That those cases, the driver would clear the HWO bit manually.
Can you propose an additional change to handle freeing up the extra trbs in the missed isoc case? I think that somehow the HWO bit should be checked before entering dwc3_gadget_ep_cleanup_completed_request in order to avoid the trb being given back too soon.
We should not check for TRBs of requests beyond the request associated with the interrupt event.
Your thoughts?
The capture shows underrun, and it causes missed isoc.
Let's take a look at the first missed isoc request (req ffffff88f834db00)
<...>-22551 [002] d..2 13985.789684: dwc3_ep_queue: ep2in: req ffffff88f834db00 length 0/49152 zsi ==> -115 <...>-22551 [002] dn.2 13985.789728: dwc3_prepare_trb: ep2in: trb ffffffc016071970 (E152:D149) buf 00000000ea800000 size 1x 49152 ctrl 00000461 (Hlcs:Sc:isoc-first)
Each request uses a one TRB. From above, you can see that there are only 3 intervals prepared (E152 - D149 = 3).
The timestamp of the last request completion is 13985.788919. The timestamp which the queued request is 13985.789728. We can roughly estimate the diff is at least 809us.
3 intervals (3x125us) is less than 809us. So missed isoc is expected:
irq/399-dwc3-15269 [002] d..1 13985.790754: dwc3_event: event (f9acc08a): ep2in: Transfer In Progress [63916] (sIM) irq/399-dwc3-15269 [002] d..1 13985.790758: dwc3_complete_trb: ep2in: trb ffffffc016071970 (E154:D152) buf 00000000ea800000 size 1x 49152 ctrl 3e6a0460 (hlcs:Sc:isoc-first) irq/399-dwc3-15269 [002] d..1 13985.790808: dwc3_gadget_giveback: ep2in: req ffffff88f834db00 length 0/49152 zsi ==> -18
After this point, the uvc driver keeps playing catch up to stay in sync with the host. It tries, but it couldn't catch up. Also, occasionally something seems to be blocking dwc3, creating time gaps. Maybe because of a spin_lock held somewhere?
Could the time gaps be created by the interrupt frequency changes? They completely change up the timing of when the transfers are kicked in dwc3 and when uvc_video_complete/uvcg_video_pump is called.
You can check all the "fastrpc_msg" tracepoints in the log. These steps seem to slow down the queuing process in uvc.
The large time gaps appear related to the low fps used for the camera configuration. I was able to confirm this by adding additional trace points in the uvc driver to see if pump was being called after new work was queued up by v4l2.
Some additional thoughts:
On the scheduling side, one issue I've seen:
- uvc/dwc3 may take a few millisecs to queue up the requests for a single
frame. How quickly this is completed could make for some of the differences seen from one platform to another.
Are you referring to a video frame or usb periodic frame/microframe? The dwc3 driver doesn't take much time at all taking queued requests to prepare and pass TRBs to the controller (much less than a ms). The UVC has to keep up and feed requests to the interval that the endpoint is configured, which is 125us per request for UVC.
The controller doesn't take much time completing the TRB either. The time for DMA to complete and the interrupt to be asserted should be fast relative to an interval. However, you're at the mercy of when the OS handle the interrupt. Different setup will have different interrupt handler latency, and it may not be consistent. So, your gadget driver queuing frequency needs to account for that (ie. maintain the frequency at a little less than 125us per request).
- If the isoc transfer has not yet started and uvc begins to queue up requests to
dwc3, it appears that the isoc transfer will almost immediately start. And if the remaining requests for the frame do not come quickly enough, there is a potential for the transfer to stop or a missed isoc event to occur.
The isoc transfers start the moment the SetInterface control request completes. That tells the host to start polling (sending ACK TP) for isoc transfers, but there may not be any real data. If there's no data, the controller will send ZLP (for IN direction) for every poll.
When there's real data, the driver will schedule at what uframe that first data should be sent. The gadget driver is supposed to maintain a constant amount of data for the endpoint after that. Otherwise we get underrun and missed isoc.
Maybe this happens frequently on my platform because the debug kernel I am using has a bit too much overhead. Will allowing a bit more time for initially queuing up some data before kicking off the transfer be prudent? Is 500us really enough for a mobile platform? We've always have had issues with random frame drops, but on much older platforms I was able to reduce the frequency of them by delaying the start by pushing out the frame_number by 16 or 32.
The 500us should be enough. As long as you maintain a constant flow of requests to avoid underrun, missed isoc is less likely. The odd thing with UVC is that it treats isoc endpoint as if it's bulk, and "pump" every once a while with a set of requests. So the endpoint has to restart for every pump to avoid underrun.
Also, one problem I am seeing specifically with the skip interrupts:
- If a missed isoc event occurs, dwc3_gadget_giveback will call the request's
->complete callback where the status -EXDEV is passed. In uvc_video_complete it will begin to cancel the queue if this status is seen. If this occurs while in the middle of queuing up the requests for the same frame, then it's quite possible that the last set of requests will have had no_interrupt=1 set.
That's not good. Also, it seems like it's blocking the queuing?
- In dwc3 after the missed isoc event, if the remaining set of requests in the
started_list do not have the IOC bit set, then no further interrupts will occur. And because the started_list is not emptied the transfer will not be stopped. Only after uvc queues up the requests for the next frame (which may come much later, i.e. 30-100ms) will the transfers appear to continue, however additional missed isoc events will be received and the cycle could continue.
If the gadget driver decides to not receive interrupt, then yes, this can happen. If there's no interrupt telling the dwc3 driver to start cleaning up the used TRBs, then dwc3 driver is not going to automatically to do so. However, if the gadget driver keeps queuing more request until the TRB ring is full, the last TRB of the TRB ring will be set with IOC, so there will be an interrupt.
I believe that this is a problem that should be addressed somehow. Not sure if
This seems to be more of an issue from the gadget driver right? It should track and maintain the list of request ensuring that at least one of the request in the queue has IOC.
One thing we can change is don't set give back the request with no_interrupt and keep it in a list until a request with IOC to return all.
the uvc driver needs to send a dummy request when an error is received, or if it should just send one more additional request with the no_interrupt bit not set, or if it needs to just send the complete frame in spite of the error. Hopefully I'm not just completely lost on this stuff.
Using the above patch, I grabbed the following trace snippets to show what is being seen:
Normal looking missed isoc related trace where enqueued == dequeued and transfer ends as expected: kworker/u17:5-8169 [000] d..2 187.669572: dwc3_gadget_ep_cmd: ep2in: cmd 'Start Transfer' [49540406] params 00000000 efff5f30 00000000 --> status: Successful irq/398-dwc3-9882 [000] d..1 187.672504: dwc3_complete_trb: ep2in: trb ffffffc016396f30 (E250:D244) buf 00000000ed470000 size 1x 0 ctrl 12550c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d..1 187.672804: dwc3_complete_trb: ep2in: trb ffffffc016396f40 (E250:D245) buf 00000000ec5e0000 size 1x 0 ctrl 12558060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 187.672885: dwc3_complete_trb: ep2in: trb ffffffc016396f50 (E250:D246) buf 00000000ecd30000 size 1x 0 ctrl 12560060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 187.672942: dwc3_complete_trb: ep2in: trb ffffffc016396f60 (E250:D247) buf 00000000ecd40000 size 1x 0 ctrl 12568060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 187.672993: dwc3_complete_trb: ep2in: trb ffffffc016396f70 (E250:D248) buf 00000000edfb0000 size 1x 0 ctrl 12570060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 187.673049: dwc3_complete_trb: ep2in: trb ffffffc016396f80 (E250:D249) buf 00000000edfa0000 size 1x 0 ctrl 12578c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d..1 187.674615: dwc3_complete_trb: ep2in: trb ffffffc016396f90 (E254:D250) buf 00000000eca80000 size 1x 0 ctrl 12580060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 187.674722: dwc3_complete_trb: ep2in: trb ffffffc016396fa0 (E254:D251) buf 00000000edfa0000 size 1x 49152 ctrl 12588060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 187.674793: usb_gadget_giveback_request: ep2in: req ffffff894498a600 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 187.675008: dwc3_complete_trb: ep2in: trb ffffffc016396fb0 (E254:D252) buf 00000000edfb0000 size 1x 49152 ctrl 12590060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 187.675093: usb_gadget_giveback_request: ep2in: req ffffff892bcd3500 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 187.675120: dwc3_complete_trb: ep2in: trb ffffffc016396fc0 (E254:D253) buf 00000000ecd40000 size 1x 49152 ctrl 12598060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 187.675207: usb_gadget_giveback_request: ep2in: req ffffff8918f48200 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 187.675225: dwc3_complete_trb: ep2in: trb ffffffc016396fd0 (E254:D254) buf 00000000ecd30000 size 1x 49152 ctrl 125a0c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 187.675274: usb_gadget_giveback_request: ep2in: req ffffff892bcd3600 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 irq/398-dwc3-9882 [000] d..1 187.675364: dwc3_gadget_ep_cmd: ep2in: cmd 'End Transfer' [50d08] params 00000000 00000000 00000000 --> status: Successful
Abnormal trace where the 'End Transfer' is delayed by a few frames: kworker/u17:5-8169 [000] dn.2 190.804817: dwc3_gadget_ep_cmd: ep2in: cmd 'Start Transfer' [ab500406] params 00000000 efff5290 00000000 --> status: Successful irq/398-dwc3-9882 [000] d..1 190.809070: dwc3_complete_trb: ep2in: trb ffffffc016396290 (E51:D42) buf 00000000ef6b0000 size 1x 0 ctrl 2ad40c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d..1 190.809478: dwc3_complete_trb: ep2in: trb ffffffc0163962a0 (E53:D43) buf 00000000ef060000 size 1x 49152 ctrl 2ad48060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.809534: usb_gadget_giveback_request: ep2in: req ffffff8918f48200 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.809752: dwc3_complete_trb: ep2in: trb ffffffc0163962b0 (E54:D44) buf 00000000ec5a0000 size 1x 49152 ctrl 2ad50060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.809878: usb_gadget_giveback_request: ep2in: req ffffff8859af4e00 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.809966: dwc3_complete_trb: ep2in: trb ffffffc0163962c0 (E54:D45) buf 00000000ef180000 size 1x 49152 ctrl 2ad58060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.810037: usb_gadget_giveback_request: ep2in: req ffffff8918f48000 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.810058: dwc3_complete_trb: ep2in: trb ffffffc0163962d0 (E54:D46) buf 00000000ef070000 size 1x 49152 ctrl 2ad60060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.810139: usb_gadget_giveback_request: ep2in: req ffffff894498a500 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.810159: dwc3_complete_trb: ep2in: trb ffffffc0163962e0 (E54:D47) buf 00000000eeae0000 size 1x 49152 ctrl 2ad68c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 190.810238: usb_gadget_giveback_request: ep2in: req ffffff88bddb9f00 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.810293: dwc3_complete_trb: ep2in: trb ffffffc0163962f0 (E54:D48) buf 00000000ec0d0000 size 1x 49152 ctrl 2ad70060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.810368: usb_gadget_giveback_request: ep2in: req ffffff892bcd3500 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.810389: dwc3_complete_trb: ep2in: trb ffffffc016396300 (E54:D49) buf 00000000ec0e0000 size 1x 0 ctrl 2ad78060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 190.810454: dwc3_complete_trb: ep2in: trb ffffffc016396310 (E54:D50) buf 00000000ec0f0000 size 1x 0 ctrl 2ad80060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d..1 190.810526: dwc3_complete_trb: ep2in: trb ffffffc016396320 (E54:D51) buf 00000000efa20000 size 1x 49152 ctrl 2ad88060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.810591: usb_gadget_giveback_request: ep2in: req ffffff8859af5800 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.810609: dwc3_complete_trb: ep2in: trb ffffffc016396330 (E54:D52) buf 00000000efa30000 size 1x 49152 ctrl 2ad90c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 190.810676: usb_gadget_giveback_request: ep2in: req ffffff88bddb8300 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 ... irq/398-dwc3-9882 [000] d..1 190.919720: dwc3_complete_trb: ep2in: trb ffffffc016396340 (E56:D53) buf 00000000efa40000 size 1x 49152 ctrl 2ad98060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.919804: usb_gadget_giveback_request: ep2in: req ffffff894498a900 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.920065: dwc3_complete_trb: ep2in: trb ffffffc016396350 (E57:D54) buf 00000000efa50000 size 1x 49152 ctrl 2ada0060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.920116: usb_gadget_giveback_request: ep2in: req ffffff88bddb8b00 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.920175: dwc3_complete_trb: ep2in: trb ffffffc016396360 (E57:D55) buf 00000000efa60000 size 1x 49152 ctrl 2ada8c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 190.920223: usb_gadget_giveback_request: ep2in: req ffffff892bcd3600 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 ... irq/398-dwc3-9882 [000] d..1 190.992902: dwc3_complete_trb: ep2in: trb ffffffc016396370 (E59:D56) buf 00000000efa90000 size 1x 49152 ctrl 2adb0060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.993014: usb_gadget_giveback_request: ep2in: req ffffff88a0b9b400 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.993365: dwc3_complete_trb: ep2in: trb ffffffc016396380 (E60:D57) buf 00000000efaa0000 size 1x 49152 ctrl 2adb8060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 190.993439: usb_gadget_giveback_request: ep2in: req ffffff8859af4f00 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 190.993528: dwc3_complete_trb: ep2in: trb ffffffc016396390 (E60:D58) buf 00000000efab0000 size 1x 49152 ctrl 2adc0c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 190.993622: usb_gadget_giveback_request: ep2in: req ffffff8859af4e00 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 ... irq/398-dwc3-9882 [000] d..1 191.104270: dwc3_complete_trb: ep2in: trb ffffffc0163963a0 (E61:D59) buf 00000000ee950000 size 1x 49152 ctrl 2adc8060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 191.104410: usb_gadget_giveback_request: ep2in: req ffffff8918f48000 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 191.104648: dwc3_complete_trb: ep2in: trb ffffffc0163963b0 (E61:D60) buf 00000000ed530000 size 1x 49152 ctrl 2add0060 (hlcs:sc:isoc-first) irq/398-dwc3-9882 [000] d... 191.104745: usb_gadget_giveback_request: ep2in: req ffffff894498a500 length 0/49152 sgs 0/0 stream 0 zsi status -18 --> 0 irq/398-dwc3-9882 [000] d..1 191.104834: dwc3_complete_trb: ep2in: trb ffffffc0163963c0 (E61:D61) buf 00000000ef610000 size 1x 49152 ctrl 2add8c60 (hlcs:SC:isoc-first) irq/398-dwc3-9882 [000] d... 191.104925: usb_gadget_giveback_request: ep2in: req ffffff8859af5800 length 0/49152 sgs 0/0 stream 0 zsI status -18 --> 0 irq/398-dwc3-9882 [000] d..1 191.105098: dwc3_gadget_ep_cmd: ep2in: cmd 'End Transfer' [50d08] params 00000000 00000000 00000000 --> status: Successful
Where you can see that in between frames enqueued != dequeued, and this can cause the video to freeze/skip for more than a single frame. The length of the freeze is all timing related and depends on if an interrupt occurs on the last req or not in the started_list.
So, to sort of rehash things again, we came up with the patch: [PATCH v3 2/6] usb: dwc3: gadget: cancel requests instead of release after missed isoc to workaround the iommu related crashes being seen with the skip interrupt solution.
We shouldn't just End Transfer when there's missed isoc. That's not good.
As an alternative solution, we are now using the diff mentioned above, and it seems to avoid the crash, however we are seeing longer freezes with it, I believe due to the explaination I provided above.
If you agree with my analysis, what should we do from here? What's the best way to ensure that the started_list can get cleaned up after a missed isoc with the skip interrupt solution enabled?
See my previous comments.
The logic to detect underrun doesn't trigger because the queued list is always non-empty, but the queued requests are expected to be missed already. So you keep seeing missed isoc.
There are a few things you can mitigate this issue:
- Don't set IMI if the request indicates no_interrupt. This reduces the time software needs to handle interrupts.
I did try this out earlier and it did not prevent the video freeze. It does make sense what you are suggesting, but because it didn't work for me it made me think that not all reqs are being reclaimed after a missed isoc is seen. I'll revisit this area again.
I don't expect this to improve much.
- Improve the underrun detection logic.
I like this idea a lot but I'm not up to the task just yet. Will attempt to follow your recommendations below and see where I get with this.
- Increase the queuing frequency from the uvc to keep the request queue full. Note that reduce/avoid setting no_interrupt will allow the controller driver to update uvc often to keep requeuing new requests.
Best option is 3), but maybe we can do all 3.
I think that this is our best option too. Dan provided a patch to make the interrupt skip logic configurable in the uvc driver: https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20221018215044...
For 2), you can set IMI for all isoc request as it is now. On missed isoc, check for the TRB's scheduled uframe (from TRB info) and compare it to the current uframe (from DSTS) for the number of intervals in between. With the number of queued requests, you can calculate whether the gadget driver queued enough requests. If it doesn't, send End Transfer command and cancel all the queued requests. The next set of requests will be in-sync again.
BR, Thinh
PS. On a side note, I notice that there are reports of issue when using SG right? Please note that dwc3 driver only allocates 256 TRBs (including a link TRB). Each SG entry takes a TRB. If a request has many SG entries, that eats up the available TRBs. So, even though the UVC may queue many requests, not all of them are prepared immediately. If the TRB ring is full, the driver need to wait for more TRBs to free up before preparing more. From the log, I see that it's sending 48KB. Let's say the uvc splits it into PAGE_SIZE of 4KB, that would take at least 12 TRB per request. (Side thought: I'm not sure why UVC needs SG in the first place with its current implementation)
On our platform I am seeing 2 items in the sg list being sent out from the uvc driver. The 1st item in the list is a 2 byte uvc header and the 2nd item is the 48KBs of data. To me this seems inefficient but sort of makes sense why it was done, likely to avoid a memory copy just for the 2 byte header. But I
That doesn't make sense to me, but I'm sure there's a real reason. Just curious, that's all. :)
share your concern here, it's possible that other users wont be so lucky and will wind up having a lot of page sized items in the sg list.
We are also hoping to make the use of sg configurable with the change: https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/p... https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20221018215044...
Ok.
Thanks, Thinh
On Thu, Oct 20, 2022, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Wed, Oct 19, 2022 at 11:06:08PM +0000, Thinh Nguyen wrote:
Hi,
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
Hi Thinh, On Wed, Oct 19, 2022 at 07:08:27PM +0000, Thinh Nguyen wrote:
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
<snip>
From what I can gather from the log, with the current changes it seems that after a missed isoc event few requests are staying longer than expected in the started_list (not getting reclaimed) and this is preventing the transmission from stopping/starting again, and opening the door for continuous stream of missed isoc events that cause what appears to the user as a frozen video.
So one thought, if IOC bit is not set every frame, but IMI bit is, when a missed isoc related interrupt occurs it seems likely that more than one trb request will need to be reclaimed, but the current set of changes is not handling this.
In the good transfer case this issue seems to be taken care of since the IOC bit is not set every frame and the reclaimation will loop through every item in the started_list and only stop if there are no additional trbs or if one has
It should stop at the request that associated with the interrupt event, whether it's because of IMI or IOC.
In this case I was concerned that if multipled queued reqs did not have IOC bit set, but there was a missed isoc on one of the last reqs, whether or not we would reclaim all of the requests up to the missed isoc related req. I'm not sure if my concern is valid or not.
There should be no problem. If there's an interrupt event indicating a TRB completion, the driver will give back all the requests up to the request associated with the interrupt event, and the controller will continue processing the remaining TRBs. On the next TRB completion event, the driver will again give back all the requests up to the request associated with that event.
I was testing with the following patch you suggested:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
At this time the IMI bit was set for every frame. With these changes it appeared in case of missed isoc that sometimes not all requests would be reclaimed (enqueued != dequeued even 100ms after the last interrupt was handled). If the 1st req in the started_list was fine (IMI set, but not IOC), and a later req was the one actually missed, because of this status check the reclaimation could stop early and not clean up to the appropriate req. As
Oops. You're right.
suggested yesterday, I also tried only setting the IMI bit when no_interrupt is not set, however I was still seeing the complete freezes. After analyzing this issue a bit, I have updated the diff to look more like this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..bb800a81815b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -1230,8 +1230,9 @@ static void __dwc3_prepare_one_trb(struct dwc3_ep *dep, struct dwc3_trb *trb, trb->ctrl = DWC3_TRBCTL_ISOCHRONOUS; }
/* always enable Interrupt on Missed ISOC */
trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
/* enable Interrupt on Missed ISOC */
if ((!no_interrupt && !chain) || must_interrupt)
break;trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
Either all or none of the TRBs of a request is set with IMI, and not some.
case USB_ENDPOINT_XFER_BULK: @@ -3195,6 +3196,11 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain
&& (trb->ctrl & DWC3_TRB_CTRL_ISP_IMI))
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
Where the trb must have the IMI set before returning early. This seemed to make the freezes recoverable.
Can you try this revised change:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..a69d8c28d86b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3654,7 +3654,7 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if ((trb->ctrl & DWC3_TRB_CTRL_HWO) && status != -ESHUTDOWN) return 1;
- if (event->status & DEPEVT_STATUS_SHORT && !chain)
I accidentally deleted a couple of lines here.
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain) return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) ||
I meant to do this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..cb65371572ee 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
+ if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain) + return 1; + if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1; @@ -3673,6 +3676,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i; + bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3681,12 +3685,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
+ if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC) + missed_isoc = true; + ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
+ if (missed_isoc) + ret = 1; + return ret; }
BR, Thinh
Hi Thinh,
On Fri, Oct 21, 2022 at 12:55:51AM +0000, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Wed, Oct 19, 2022 at 11:06:08PM +0000, Thinh Nguyen wrote:
Hi,
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
Hi Thinh, On Wed, Oct 19, 2022 at 07:08:27PM +0000, Thinh Nguyen wrote:
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
<snip>
> > From what I can gather from the log, with the current changes it seems that > after a missed isoc event few requests are staying longer than expected in the > started_list (not getting reclaimed) and this is preventing the transmission > from stopping/starting again, and opening the door for continuous stream of > missed isoc events that cause what appears to the user as a frozen video. > > So one thought, if IOC bit is not set every frame, but IMI bit is, when a > missed isoc related interrupt occurs it seems likely that more than one trb > request will need to be reclaimed, but the current set of changes is not > handling this. > > In the good transfer case this issue seems to be taken care of since the IOC > bit is not set every frame and the reclaimation will loop through every item in > the started_list and only stop if there are no additional trbs or if one has
It should stop at the request that associated with the interrupt event, whether it's because of IMI or IOC.
In this case I was concerned that if multipled queued reqs did not have IOC bit set, but there was a missed isoc on one of the last reqs, whether or not we would reclaim all of the requests up to the missed isoc related req. I'm not sure if my concern is valid or not.
There should be no problem. If there's an interrupt event indicating a TRB completion, the driver will give back all the requests up to the request associated with the interrupt event, and the controller will continue processing the remaining TRBs. On the next TRB completion event, the driver will again give back all the requests up to the request associated with that event.
I was testing with the following patch you suggested:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
At this time the IMI bit was set for every frame. With these changes it appeared in case of missed isoc that sometimes not all requests would be reclaimed (enqueued != dequeued even 100ms after the last interrupt was handled). If the 1st req in the started_list was fine (IMI set, but not IOC), and a later req was the one actually missed, because of this status check the reclaimation could stop early and not clean up to the appropriate req. As
Oops. You're right.
suggested yesterday, I also tried only setting the IMI bit when no_interrupt is not set, however I was still seeing the complete freezes. After analyzing this issue a bit, I have updated the diff to look more like this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..bb800a81815b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -1230,8 +1230,9 @@ static void __dwc3_prepare_one_trb(struct dwc3_ep *dep, struct dwc3_trb *trb, trb->ctrl = DWC3_TRBCTL_ISOCHRONOUS; }
/* always enable Interrupt on Missed ISOC */
trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
/* enable Interrupt on Missed ISOC */
if ((!no_interrupt && !chain) || must_interrupt)
break;trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
Either all or none of the TRBs of a request is set with IMI, and not some.
case USB_ENDPOINT_XFER_BULK: @@ -3195,6 +3196,11 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain
&& (trb->ctrl & DWC3_TRB_CTRL_ISP_IMI))
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
Where the trb must have the IMI set before returning early. This seemed to make the freezes recoverable.
Can you try this revised change:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..a69d8c28d86b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3654,7 +3654,7 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if ((trb->ctrl & DWC3_TRB_CTRL_HWO) && status != -ESHUTDOWN) return 1;
- if (event->status & DEPEVT_STATUS_SHORT && !chain)
I accidentally deleted a couple of lines here.
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain) return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) ||
I meant to do this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..cb65371572ee 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
@@ -3673,6 +3676,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i;
- bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3681,12 +3685,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC)
missed_isoc = true;
- ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
- if (missed_isoc)
ret = 1;
- return ret;
}
BR, Thinh
I tried out the following patch diff you provided and I did not see any iommu related crashes with these changes:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..50287437d6de 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3195,6 +3195,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
+ if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain) + return 1; + if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1; @@ -3211,6 +3214,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i; + bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3219,12 +3223,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
+ if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC) + missed_isoc = true; + ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
+ if (missed_isoc) + ret = 1; + return ret; }
As we discussed earlier, when uvc's complete function is called, if an -EXDEV is returned in the request's status, the uvc driver will begin to cancel its queue. With the current skip interrupt implementation in the uvc driver, if this occurs while the uvc driver is pumping the current frame, then there is no guarentee that the last request(s) will have had 'no_interrupt=0'. If the last requests passed to dwc3 had 'no_interrupt=1', these requests would eventually be placed at the end of the started_list. Since the IOC bit will not be set, and if no missed isoc event occurs on these requests, then the dwc3 driver will not be interrupted, leaving those remaining requests sitting in the started_list, and dwc3 will not perform an 'End Transfer' as expected. Once the uvc driver begins to pump the requests for the next frame, then it most likely will result in additional missed isoc events, with the result being an extended video freeze seen by the user.
I hope that other uvc driver maintainers can chime in here to help determine the correct path forward. With the skip interrupt implementation, the uvc driver should guarentee that the last request sent to dwc3 has 'no_interrupt=0', otherwise if a missed isoc error occurs, it becomes very likely that the next immediate set of frames could be dropped/cancelled because the dwc3 driver could not perform a timely 'End Transfer'.
For testing I implemented the following changes to see what I could do for this issue. Note that I am on an older implementation and it's missing a lot of the sg related implementation. The idea here is that if the queue is empty, and that req_int_count is non-zero then the last request likely had 'no_interrupt=1' set. And if this is the case then we will want to send some dummy request to dwc3 with 'no_interrupt=0' set to make sure that no requests get stuck in its started_list.
diff --git a/drivers/usb/gadget/function/uvc_video.c b/drivers/usb/gadget/function/uvc_video.c index a00786bc6e60..cb82af6bf453 100644 --- a/drivers/usb/gadget/function/uvc_video.c +++ b/drivers/usb/gadget/function/uvc_video.c @@ -253,6 +253,70 @@ uvc_video_alloc_requests(struct uvc_video *video) * Video streaming */
+/* + * uvcg_video_check_pump_fake - Pumps fake/empty video data into a USB request + * + * This function fills an available USB requests (listed in req_free) with + * fake/empty video data to ensure that the last request had no_interrupt == 0 + */ +static void uvcg_video_check_pump_fake(struct work_struct *work) +{ + struct uvc_video *video = container_of(work, struct uvc_video, pump); + struct uvc_video_queue *queue = &video->queue; + struct usb_request *req = NULL; + unsigned long flags; + uint8_t *data = NULL; + int ret; + + if (video->ep->enabled && list_empty(&queue->irqqueue) && + video->req_int_count != 0) { + /* Retrieve the first available USB request, protected by the + * request lock. + */ + spin_lock_irqsave(&video->req_lock, flags); + if (list_empty(&video->req_free)) { + spin_unlock_irqrestore(&video->req_lock, flags); + return; + } + req = list_first_entry(&video->req_free, struct usb_request, + list); + list_del(&req->list); + spin_unlock_irqrestore(&video->req_lock, flags); + + /* Set up the fake/empty uvc request */ + data = req->buf; + data[0] = 2; + data[1] = UVC_STREAM_EOH | video->fid | UVC_STREAM_EOF; + req->length = 2; + req->no_interrupt = 0; + + /* Queue the USB request */ + spin_lock_irqsave(&queue->irqlock, flags); + ret = uvcg_video_ep_queue(video, req); + spin_unlock_irqrestore(&queue->irqlock, flags); + + if (ret < 0) { + uvcg_queue_cancel(queue, 0); + } else { + /* Endpoint now owns the request */ + req = NULL; + + /* Reset req_int_count to force an interrupt on the next request + * and to avoid going through uvcg_video_check_pump_fake again. + */ + video->req_int_count = 0; + } + } + + if (!req) + return; + + spin_lock_irqsave(&video->req_lock, flags); + list_add_tail(&req->list, &video->req_free); + spin_unlock_irqrestore(&video->req_lock, flags); + return; +} + /* * uvcg_video_pump - Pump video data into the USB requests * @@ -268,6 +332,8 @@ static void uvcg_video_pump(struct work_struct *work) unsigned long flags; int ret;
+ uvcg_video_check_pump_fake(work); + while (video->ep->enabled) { /* Retrieve the first available USB request, protected by the * request lock. @@ -318,7 +384,8 @@ static void uvcg_video_pump(struct work_struct *work) /* Endpoint now owns the request */ req = NULL; } - video->req_int_count++; + if (buf->state != UVC_BUF_STATE_DONE) + video->req_int_count++; }
if (!req)
Alternatively we may just not want to cancel the queue upon receiving -EXDEV and this could solve the problem too, but I don't think that it's such a great idea, especially if things start falling behind.
I hope that someone more fluent in this area of code can take a crack at improving/fixing this issue.
The changes above do seem to help dwc3 timely end its transfers, but mainly for cases where some requests are missed but the next immediate ones are not (i'm talking within a couple of hundred microseconds). Most of the time if missed isocs occurs for a frame that the remaining reqs in the started_list will likely also error out and the list will be emptied and dwc3 will still timely send 'End Transfer'. In reality this is to cover a corner case that can adversely affect the quality of the video being watched. Just wanted to be upfront with these details.
Thinh, any pointers on how we should proceed from here? It looks like your changes are working well.
Thanks, Jeff
On Fri, Oct 21, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Fri, Oct 21, 2022 at 12:55:51AM +0000, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Wed, Oct 19, 2022 at 11:06:08PM +0000, Thinh Nguyen wrote:
Hi,
On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
Hi Thinh, On Wed, Oct 19, 2022 at 07:08:27PM +0000, Thinh Nguyen wrote: > On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
<snip>
> > > > From what I can gather from the log, with the current changes it seems that > > after a missed isoc event few requests are staying longer than expected in the > > started_list (not getting reclaimed) and this is preventing the transmission > > from stopping/starting again, and opening the door for continuous stream of > > missed isoc events that cause what appears to the user as a frozen video. > > > > So one thought, if IOC bit is not set every frame, but IMI bit is, when a > > missed isoc related interrupt occurs it seems likely that more than one trb > > request will need to be reclaimed, but the current set of changes is not > > handling this. > > > > In the good transfer case this issue seems to be taken care of since the IOC > > bit is not set every frame and the reclaimation will loop through every item in > > the started_list and only stop if there are no additional trbs or if one has > > It should stop at the request that associated with the interrupt event, > whether it's because of IMI or IOC.
In this case I was concerned that if multipled queued reqs did not have IOC bit set, but there was a missed isoc on one of the last reqs, whether or not we would reclaim all of the requests up to the missed isoc related req. I'm not sure if my concern is valid or not.
There should be no problem. If there's an interrupt event indicating a TRB completion, the driver will give back all the requests up to the request associated with the interrupt event, and the controller will continue processing the remaining TRBs. On the next TRB completion event, the driver will again give back all the requests up to the request associated with that event.
I was testing with the following patch you suggested:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
At this time the IMI bit was set for every frame. With these changes it appeared in case of missed isoc that sometimes not all requests would be reclaimed (enqueued != dequeued even 100ms after the last interrupt was handled). If the 1st req in the started_list was fine (IMI set, but not IOC), and a later req was the one actually missed, because of this status check the reclaimation could stop early and not clean up to the appropriate req. As
Oops. You're right.
suggested yesterday, I also tried only setting the IMI bit when no_interrupt is not set, however I was still seeing the complete freezes. After analyzing this issue a bit, I have updated the diff to look more like this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..bb800a81815b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -1230,8 +1230,9 @@ static void __dwc3_prepare_one_trb(struct dwc3_ep *dep, struct dwc3_trb *trb, trb->ctrl = DWC3_TRBCTL_ISOCHRONOUS; }
/* always enable Interrupt on Missed ISOC */
trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
/* enable Interrupt on Missed ISOC */
if ((!no_interrupt && !chain) || must_interrupt)
break;trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
Either all or none of the TRBs of a request is set with IMI, and not some.
case USB_ENDPOINT_XFER_BULK: @@ -3195,6 +3196,11 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain
&& (trb->ctrl & DWC3_TRB_CTRL_ISP_IMI))
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
Where the trb must have the IMI set before returning early. This seemed to make the freezes recoverable.
Can you try this revised change:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..a69d8c28d86b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3654,7 +3654,7 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if ((trb->ctrl & DWC3_TRB_CTRL_HWO) && status != -ESHUTDOWN) return 1;
- if (event->status & DEPEVT_STATUS_SHORT && !chain)
I accidentally deleted a couple of lines here.
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain) return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) ||
I meant to do this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..cb65371572ee 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
@@ -3673,6 +3676,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i;
- bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3681,12 +3685,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC)
missed_isoc = true;
- ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
- if (missed_isoc)
ret = 1;
- return ret;
}
BR, Thinh
I tried out the following patch diff you provided and I did not see any iommu related crashes with these changes:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..50287437d6de 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3195,6 +3195,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain)
return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
@@ -3211,6 +3214,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i;
bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3219,12 +3223,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC)
missed_isoc = true;
ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
if (missed_isoc)
ret = 1;
return ret;
}
As we discussed earlier, when uvc's complete function is called, if an -EXDEV is returned in the request's status, the uvc driver will begin to cancel its queue. With the current skip interrupt implementation in the uvc driver, if this occurs while the uvc driver is pumping the current frame, then there is no guarentee that the last request(s) will have had 'no_interrupt=0'. If the last requests passed to dwc3 had 'no_interrupt=1', these requests would eventually be placed at the end of the started_list. Since the IOC bit will not be set, and if no missed isoc event occurs on these requests, then the dwc3 driver will not be interrupted, leaving those remaining requests sitting in the started_list, and dwc3 will not perform an 'End Transfer' as expected. Once the uvc driver begins to pump the requests for the next frame, then it most likely will result in additional missed isoc events, with the result being an extended video freeze seen by the user.
I hope that other uvc driver maintainers can chime in here to help determine the correct path forward. With the skip interrupt implementation, the uvc driver should guarentee that the last request sent to dwc3 has 'no_interrupt=0', otherwise
Rather than guarenteeing no_interrupt or not, it's more important that the UVC maintains a constant queue of requests to the controller driver. Isoc transfers are meant to be sent at a constant rate which the endpoint is configured.
I recalled Dan mentioned that UVC gadget driver can queue up to 64 requests with no_interrupt=1 up to 15 requests. But I keep seeing that the gadget driver only "pumps" 16 requests and doesn't continue until they are completed. We can almost guarantee that it's going to be underrun. Can UVC "pumps" multiple times at once?
if a missed isoc error occurs, it becomes very likely that the next immediate set of frames could be dropped/cancelled because the dwc3 driver could not perform a timely 'End Transfer'.
For testing I implemented the following changes to see what I could do for this issue. Note that I am on an older implementation and it's missing a lot of the
Please use the latest kernel, there are a lot of fixes/improvement to dwc3 every kernel version.
sg related implementation. The idea here is that if the queue is empty, and that req_int_count is non-zero then the last request likely had 'no_interrupt=1' set. And if this is the case then we will want to send some dummy request to dwc3 with 'no_interrupt=0' set to make sure that no requests get stuck in its started_list.
This is not efficient and unnecessary.
<snip>
Alternatively we may just not want to cancel the queue upon receiving -EXDEV and this could solve the problem too, but I don't think that it's such a great idea, especially if things start falling behind.
I hope that someone more fluent in this area of code can take a crack at improving/fixing this issue.
The changes above do seem to help dwc3 timely end its transfers, but mainly for cases where some requests are missed but the next immediate ones are not (i'm talking within a couple of hundred microseconds). Most of the time if missed isocs occurs for a frame that the remaining reqs in the started_list will likely also error out and the list will be emptied and dwc3 will still timely send 'End Transfer'. In reality this is to cover a corner case that can adversely affect the quality of the video being watched. Just wanted to be upfront with these details.
Thinh, any pointers on how we should proceed from here? It looks like your changes are working well.
You can add the underrun detection check to dwc3 whenever it receives a new request.
ie. When the new request comes, check if the last prepared TRB's HWO bit is cleared and if the endpoint is started, send End Transfer command to reschedule the isoc transfers for the incoming requests.
This is probably the simpler workaround to the underrun issue of UVC.
BR, Thinh
Hi Thinh,
On Fri, Oct 21, 2022 at 04:43:52PM +0000, Thinh Nguyen wrote:
On Fri, Oct 21, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Fri, Oct 21, 2022 at 12:55:51AM +0000, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Wed, Oct 19, 2022 at 11:06:08PM +0000, Thinh Nguyen wrote:
Hi,
On Wed, Oct 19, 2022, Jeff Vanhoof wrote: > Hi Thinh, > On Wed, Oct 19, 2022 at 07:08:27PM +0000, Thinh Nguyen wrote: > > On Wed, Oct 19, 2022, Jeff Vanhoof wrote:
<snip>
> > > > > > From what I can gather from the log, with the current changes it seems that > > > after a missed isoc event few requests are staying longer than expected in the > > > started_list (not getting reclaimed) and this is preventing the transmission > > > from stopping/starting again, and opening the door for continuous stream of > > > missed isoc events that cause what appears to the user as a frozen video. > > > > > > So one thought, if IOC bit is not set every frame, but IMI bit is, when a > > > missed isoc related interrupt occurs it seems likely that more than one trb > > > request will need to be reclaimed, but the current set of changes is not > > > handling this. > > > > > > In the good transfer case this issue seems to be taken care of since the IOC > > > bit is not set every frame and the reclaimation will loop through every item in > > > the started_list and only stop if there are no additional trbs or if one has > > > > It should stop at the request that associated with the interrupt event, > > whether it's because of IMI or IOC. > > In this case I was concerned that if multipled queued reqs did not have IOC bit > set, but there was a missed isoc on one of the last reqs, whether or not we would > reclaim all of the requests up to the missed isoc related req. I'm not sure if > my concern is valid or not. >
There should be no problem. If there's an interrupt event indicating a TRB completion, the driver will give back all the requests up to the request associated with the interrupt event, and the controller will continue processing the remaining TRBs. On the next TRB completion event, the driver will again give back all the requests up to the request associated with that event.
I was testing with the following patch you suggested:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..8352f4b5dd9f 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
At this time the IMI bit was set for every frame. With these changes it appeared in case of missed isoc that sometimes not all requests would be reclaimed (enqueued != dequeued even 100ms after the last interrupt was handled). If the 1st req in the started_list was fine (IMI set, but not IOC), and a later req was the one actually missed, because of this status check the reclaimation could stop early and not clean up to the appropriate req. As
Oops. You're right.
suggested yesterday, I also tried only setting the IMI bit when no_interrupt is not set, however I was still seeing the complete freezes. After analyzing this issue a bit, I have updated the diff to look more like this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..bb800a81815b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -1230,8 +1230,9 @@ static void __dwc3_prepare_one_trb(struct dwc3_ep *dep, struct dwc3_trb *trb, trb->ctrl = DWC3_TRBCTL_ISOCHRONOUS; }
/* always enable Interrupt on Missed ISOC */
trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
/* enable Interrupt on Missed ISOC */
if ((!no_interrupt && !chain) || must_interrupt)
break;trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
Either all or none of the TRBs of a request is set with IMI, and not some.
case USB_ENDPOINT_XFER_BULK: @@ -3195,6 +3196,11 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain
&& (trb->ctrl & DWC3_TRB_CTRL_ISP_IMI))
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
Where the trb must have the IMI set before returning early. This seemed to make the freezes recoverable.
Can you try this revised change:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..a69d8c28d86b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3654,7 +3654,7 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if ((trb->ctrl & DWC3_TRB_CTRL_HWO) && status != -ESHUTDOWN) return 1;
- if (event->status & DEPEVT_STATUS_SHORT && !chain)
I accidentally deleted a couple of lines here.
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain) return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) ||
I meant to do this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..cb65371572ee 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
@@ -3673,6 +3676,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i;
- bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3681,12 +3685,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC)
missed_isoc = true;
- ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
- if (missed_isoc)
ret = 1;
- return ret;
}
BR, Thinh
I tried out the following patch diff you provided and I did not see any iommu related crashes with these changes:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..50287437d6de 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3195,6 +3195,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain)
return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
@@ -3211,6 +3214,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i;
bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3219,12 +3223,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC)
missed_isoc = true;
ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
if (missed_isoc)
ret = 1;
return ret;
}
As we discussed earlier, when uvc's complete function is called, if an -EXDEV is returned in the request's status, the uvc driver will begin to cancel its queue. With the current skip interrupt implementation in the uvc driver, if this occurs while the uvc driver is pumping the current frame, then there is no guarentee that the last request(s) will have had 'no_interrupt=0'. If the last requests passed to dwc3 had 'no_interrupt=1', these requests would eventually be placed at the end of the started_list. Since the IOC bit will not be set, and if no missed isoc event occurs on these requests, then the dwc3 driver will not be interrupted, leaving those remaining requests sitting in the started_list, and dwc3 will not perform an 'End Transfer' as expected. Once the uvc driver begins to pump the requests for the next frame, then it most likely will result in additional missed isoc events, with the result being an extended video freeze seen by the user.
I hope that other uvc driver maintainers can chime in here to help determine the correct path forward. With the skip interrupt implementation, the uvc driver should guarentee that the last request sent to dwc3 has 'no_interrupt=0', otherwise
Rather than guarenteeing no_interrupt or not, it's more important that the UVC maintains a constant queue of requests to the controller driver. Isoc transfers are meant to be sent at a constant rate which the endpoint is configured.
I agree with you on this, but it will probably always be a race with uvc queuing up one req at a time and dwc3 starting to transmit almost immediately. We can configure the streaming_interval on a product to kind of slow down or delay the usb transfers, but between dwc3 and uvc driver it would be nice to have an interface that would allow pre-queuing a certain number of reqs before the transfer is actually started. If that is not possible, then uvc could instead prepare a number of reqs ahead of time and attempt to queue them each as fast a possible in a very tight loop.
I recalled Dan mentioned that UVC gadget driver can queue up to 64 requests with no_interrupt=1 up to 15 requests. But I keep seeing that the gadget driver only "pumps" 16 requests and doesn't continue until they are completed. We can almost guarantee that it's going to be underrun. Can UVC "pumps" multiple times at once?
uvc will usually pump when new frames comes in or when a req's complete gets called. uvc should fill up front all the reqs required to transfer a frame (up to 64), but once the available reqs are filled or the frame completely queued up, it would take a new incoming frame or a kick via the complete call to have it attempt to fill any remaining reqs in the queue again. To me this looks ok to do, but for heavy transfers we have a somewhat smaller queue as a buffer, 48 reqs vs 64 reqs (64 - 16).
To note, for the very last request of a frame/buffer (the end of it) the uvc driver does set the no_interrupt to 0 for this request.
if a missed isoc error occurs, it becomes very likely that the next immediate set of frames could be dropped/cancelled because the dwc3 driver could not perform a timely 'End Transfer'.
For testing I implemented the following changes to see what I could do for this issue. Note that I am on an older implementation and it's missing a lot of the
Please use the latest kernel, there are a lot of fixes/improvement to dwc3 every kernel version.
I've been debugging the sg implementation and the skip interrupt implementation seperately, backporting what can be backported. I'm working off of a 5.10 kernel debugging various issues Dan Vacura was seeing on a 5.15 kernel on a newer product. What we had on our 5.10 based kernel was stable for uvc/dwc3, so we needed to understand what came in since that time that broke stability.
sg related implementation. The idea here is that if the queue is empty, and that req_int_count is non-zero then the last request likely had 'no_interrupt=1' set. And if this is the case then we will want to send some dummy request to dwc3 with 'no_interrupt=0' set to make sure that no requests get stuck in its started_list.
This is not efficient and unnecessary.
Agree, but to fix this in the uvc driver the correct way seemed a bit more complicated at first. I was thinking that the driver would always send one last request of the frame buffer once an error is seen, but I'm now thinking of a simpler solution. If we can update the uvc pump to prepare a number of requests and make sure that the last request has no_interrupt set to 0 before queuing them all up in a tight loop to the dwc3 driver, this would effectively solve this problem too. A bit of extra smarts this area might also address some of your concerns about uvc not pumping a lot of data at once (especially for the beginning of a frame). I believe we should prepare more reqs up front if the req queue is empty and less reqs if its already busy.
I'm hoping that someone can step up to help here :). If not, then this will be my next activity.
<snip>
Alternatively we may just not want to cancel the queue upon receiving -EXDEV and this could solve the problem too, but I don't think that it's such a great idea, especially if things start falling behind.
I hope that someone more fluent in this area of code can take a crack at improving/fixing this issue.
The changes above do seem to help dwc3 timely end its transfers, but mainly for cases where some requests are missed but the next immediate ones are not (i'm talking within a couple of hundred microseconds). Most of the time if missed isocs occurs for a frame that the remaining reqs in the started_list will likely also error out and the list will be emptied and dwc3 will still timely send 'End Transfer'. In reality this is to cover a corner case that can adversely affect the quality of the video being watched. Just wanted to be upfront with these details.
Thinh, any pointers on how we should proceed from here? It looks like your changes are working well.
You can add the underrun detection check to dwc3 whenever it receives a new request.
ie. When the new request comes, check if the last prepared TRB's HWO bit is cleared and if the endpoint is started, send End Transfer command to reschedule the isoc transfers for the incoming requests.
This is probably the simpler workaround to the underrun issue of UVC.
This sounds like a good optimization too by itself. Would it be possible for you to implement something here to help get me started? Even if it's not perfect, I'll take what I can get. We are running up against the clock for trying to close things out (changes must be released to mainline and backported to 5.15 for Android).
BR, Thinh
Thanks, Jeff
On Fri, Oct 21, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Fri, Oct 21, 2022 at 04:43:52PM +0000, Thinh Nguyen wrote:
On Fri, Oct 21, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Fri, Oct 21, 2022 at 12:55:51AM +0000, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Wed, Oct 19, 2022 at 11:06:08PM +0000, Thinh Nguyen wrote: > Hi, > > On Wed, Oct 19, 2022, Jeff Vanhoof wrote: > > Hi Thinh, > > On Wed, Oct 19, 2022 at 07:08:27PM +0000, Thinh Nguyen wrote: > > > On Wed, Oct 19, 2022, Jeff Vanhoof wrote: > > <snip> > > > > > > > > > From what I can gather from the log, with the current changes it seems that > > > > after a missed isoc event few requests are staying longer than expected in the > > > > started_list (not getting reclaimed) and this is preventing the transmission > > > > from stopping/starting again, and opening the door for continuous stream of > > > > missed isoc events that cause what appears to the user as a frozen video. > > > > > > > > So one thought, if IOC bit is not set every frame, but IMI bit is, when a > > > > missed isoc related interrupt occurs it seems likely that more than one trb > > > > request will need to be reclaimed, but the current set of changes is not > > > > handling this. > > > > > > > > In the good transfer case this issue seems to be taken care of since the IOC > > > > bit is not set every frame and the reclaimation will loop through every item in > > > > the started_list and only stop if there are no additional trbs or if one has > > > > > > It should stop at the request that associated with the interrupt event, > > > whether it's because of IMI or IOC. > > > > In this case I was concerned that if multipled queued reqs did not have IOC bit > > set, but there was a missed isoc on one of the last reqs, whether or not we would > > reclaim all of the requests up to the missed isoc related req. I'm not sure if > > my concern is valid or not. > > > > There should be no problem. If there's an interrupt event indicating a > TRB completion, the driver will give back all the requests up to the > request associated with the interrupt event, and the controller will > continue processing the remaining TRBs. On the next TRB completion > event, the driver will again give back all the requests up to the > request associated with that event. >
I was testing with the following patch you suggested:
> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c > index 61fba2b7389b..8352f4b5dd9f 100644 > --- a/drivers/usb/dwc3/gadget.c > +++ b/drivers/usb/dwc3/gadget.c > @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, > if (event->status & DEPEVT_STATUS_SHORT && !chain) > return 1; > > + if (usb_endpoint_xfer_isoc(dep->endpoint.desc) && > + (event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain) > + return 1; > + > if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || > (trb->ctrl & DWC3_TRB_CTRL_LST)) > return 1; >
At this time the IMI bit was set for every frame. With these changes it appeared in case of missed isoc that sometimes not all requests would be reclaimed (enqueued != dequeued even 100ms after the last interrupt was handled). If the 1st req in the started_list was fine (IMI set, but not IOC), and a later req was the one actually missed, because of this status check the reclaimation could stop early and not clean up to the appropriate req. As
Oops. You're right.
suggested yesterday, I also tried only setting the IMI bit when no_interrupt is not set, however I was still seeing the complete freezes. After analyzing this issue a bit, I have updated the diff to look more like this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..bb800a81815b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -1230,8 +1230,9 @@ static void __dwc3_prepare_one_trb(struct dwc3_ep *dep, struct dwc3_trb *trb, trb->ctrl = DWC3_TRBCTL_ISOCHRONOUS; }
/* always enable Interrupt on Missed ISOC */
trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
/* enable Interrupt on Missed ISOC */
if ((!no_interrupt && !chain) || must_interrupt)
break;trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI;
Either all or none of the TRBs of a request is set with IMI, and not some.
case USB_ENDPOINT_XFER_BULK: @@ -3195,6 +3196,11 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (usb_endpoint_xfer_isoc(dep->endpoint.desc) &&
(event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain
&& (trb->ctrl & DWC3_TRB_CTRL_ISP_IMI))
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
Where the trb must have the IMI set before returning early. This seemed to make the freezes recoverable.
Can you try this revised change:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..a69d8c28d86b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3654,7 +3654,7 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if ((trb->ctrl & DWC3_TRB_CTRL_HWO) && status != -ESHUTDOWN) return 1;
- if (event->status & DEPEVT_STATUS_SHORT && !chain)
I accidentally deleted a couple of lines here.
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain) return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) ||
I meant to do this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..cb65371572ee 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
@@ -3673,6 +3676,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i;
- bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3681,12 +3685,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC)
missed_isoc = true;
- ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
- if (missed_isoc)
ret = 1;
- return ret;
}
BR, Thinh
I tried out the following patch diff you provided and I did not see any iommu related crashes with these changes:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..50287437d6de 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3195,6 +3195,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain)
return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
@@ -3211,6 +3214,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i;
bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3219,12 +3223,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC)
missed_isoc = true;
ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
if (missed_isoc)
ret = 1;
return ret;
}
As we discussed earlier, when uvc's complete function is called, if an -EXDEV is returned in the request's status, the uvc driver will begin to cancel its queue. With the current skip interrupt implementation in the uvc driver, if this occurs while the uvc driver is pumping the current frame, then there is no guarentee that the last request(s) will have had 'no_interrupt=0'. If the last requests passed to dwc3 had 'no_interrupt=1', these requests would eventually be placed at the end of the started_list. Since the IOC bit will not be set, and if no missed isoc event occurs on these requests, then the dwc3 driver will not be interrupted, leaving those remaining requests sitting in the started_list, and dwc3 will not perform an 'End Transfer' as expected. Once the uvc driver begins to pump the requests for the next frame, then it most likely will result in additional missed isoc events, with the result being an extended video freeze seen by the user.
I hope that other uvc driver maintainers can chime in here to help determine the correct path forward. With the skip interrupt implementation, the uvc driver should guarentee that the last request sent to dwc3 has 'no_interrupt=0', otherwise
Rather than guarenteeing no_interrupt or not, it's more important that the UVC maintains a constant queue of requests to the controller driver. Isoc transfers are meant to be sent at a constant rate which the endpoint is configured.
I agree with you on this, but it will probably always be a race with uvc queuing up one req at a time and dwc3 starting to transmit almost immediately. We can configure the streaming_interval on a product to kind of slow down or
No, it doesn't work that way. The controller would only send the data when the host requests for it. The host will request for data every 125us. So a UVC request will complete roughly every 125us. There should be no race with uvc.
delay the usb transfers, but between dwc3 and uvc driver it would be nice to have an interface that would allow pre-queuing a certain number of reqs before the transfer is actually started. If that is not possible, then uvc could instead prepare a number of reqs ahead of time and attempt to queue them each as fast a possible in a very tight loop.
All UVC gadget driver has to do is to maintain multiple requests prepared ahead of time and don't starve the controller driver. That is, don't let the queue reaches 0. Let's say uvc can only pump 16 requests at a time, split at least every 8 request with no_interrupt=0. So that uvc will have time to feed more requests when it gets the notification of 8 requests completed. It would have roughly 8x125us to queue more requests before underrun.
I recalled Dan mentioned that UVC gadget driver can queue up to 64 requests with no_interrupt=1 up to 15 requests. But I keep seeing that the gadget driver only "pumps" 16 requests and doesn't continue until they are completed. We can almost guarantee that it's going to be underrun. Can UVC "pumps" multiple times at once?
uvc will usually pump when new frames comes in or when a req's complete gets called. uvc should fill up front all the reqs required to transfer a frame (up to 64), but once the available reqs are filled or the frame completely queued up, it would take a new incoming frame or a kick via the complete call to have it attempt to fill any remaining reqs in the queue again. To me this looks ok to do, but for heavy transfers we have a somewhat smaller queue as a buffer, 48 reqs vs 64 reqs (64 - 16).
To note, for the very last request of a frame/buffer (the end of it) the uvc driver does set the no_interrupt to 0 for this request.
if a missed isoc error occurs, it becomes very likely that the next immediate set of frames could be dropped/cancelled because the dwc3 driver could not perform a timely 'End Transfer'.
For testing I implemented the following changes to see what I could do for this issue. Note that I am on an older implementation and it's missing a lot of the
Please use the latest kernel, there are a lot of fixes/improvement to dwc3 every kernel version.
I've been debugging the sg implementation and the skip interrupt implementation seperately, backporting what can be backported. I'm working off of a 5.10 kernel debugging various issues Dan Vacura was seeing on a 5.15 kernel on a newer product. What we had on our 5.10 based kernel was stable for uvc/dwc3, so we needed to understand what came in since that time that broke stability.
sg related implementation. The idea here is that if the queue is empty, and that req_int_count is non-zero then the last request likely had 'no_interrupt=1' set. And if this is the case then we will want to send some dummy request to dwc3 with 'no_interrupt=0' set to make sure that no requests get stuck in its started_list.
This is not efficient and unnecessary.
Agree, but to fix this in the uvc driver the correct way seemed a bit more complicated at first. I was thinking that the driver would always send one last request of the frame buffer once an error is seen, but I'm now thinking of a simpler solution. If we can update the uvc pump to prepare a number of requests and make sure that the last request has no_interrupt set to 0 before queuing them all up in a tight loop to the dwc3 driver, this would effectively solve this problem too. A bit of extra smarts this area might also address some of your concerns about uvc not pumping a lot of data at once (especially for the beginning of a frame). I believe we should prepare more reqs up front if the req queue is empty and less reqs if its already busy.
I'm hoping that someone can step up to help here :). If not, then this will be my next activity.
<snip>
Alternatively we may just not want to cancel the queue upon receiving -EXDEV and this could solve the problem too, but I don't think that it's such a great idea, especially if things start falling behind.
I hope that someone more fluent in this area of code can take a crack at improving/fixing this issue.
The changes above do seem to help dwc3 timely end its transfers, but mainly for cases where some requests are missed but the next immediate ones are not (i'm talking within a couple of hundred microseconds). Most of the time if missed isocs occurs for a frame that the remaining reqs in the started_list will likely also error out and the list will be emptied and dwc3 will still timely send 'End Transfer'. In reality this is to cover a corner case that can adversely affect the quality of the video being watched. Just wanted to be upfront with these details.
Thinh, any pointers on how we should proceed from here? It looks like your changes are working well.
You can add the underrun detection check to dwc3 whenever it receives a new request.
ie. When the new request comes, check if the last prepared TRB's HWO bit is cleared and if the endpoint is started, send End Transfer command to reschedule the isoc transfers for the incoming requests.
This is probably the simpler workaround to the underrun issue of UVC.
This sounds like a good optimization too by itself. Would it be possible for you to implement something here to help get me started? Even if it's not perfect, I'll take what I can get. We are running up against the clock for trying to close things out (changes must be released to mainline and backported to 5.15 for Android).
At the moment, I have very limited bandwidth to implement and run tests. However, if you or someone provide a patch, I can help review.
It should look something like this (pseudo code):
ep_queue(request) { add_pending_list(request) if (ep is isoc and enabled) { if (prev_trb.HWO == 0) { send_cmd(End Transfer) return; } } }
On End Transfer command completion: reclaim_and_giveback_requests(started_list)
BR, Thinh
On Fri, Oct 21, 2022 at 07:09:55PM +0000, Thinh Nguyen wrote:
On Fri, Oct 21, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Fri, Oct 21, 2022 at 04:43:52PM +0000, Thinh Nguyen wrote:
On Fri, Oct 21, 2022, Jeff Vanhoof wrote:
Hi Thinh,
On Fri, Oct 21, 2022 at 12:55:51AM +0000, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Thinh Nguyen wrote:
On Thu, Oct 20, 2022, Jeff Vanhoof wrote: > Hi Thinh, > > On Wed, Oct 19, 2022 at 11:06:08PM +0000, Thinh Nguyen wrote: > > Hi, > > > > On Wed, Oct 19, 2022, Jeff Vanhoof wrote: > > > Hi Thinh, > > > On Wed, Oct 19, 2022 at 07:08:27PM +0000, Thinh Nguyen wrote: > > > > On Wed, Oct 19, 2022, Jeff Vanhoof wrote: > > > > <snip> > > > > > > > > > > > > From what I can gather from the log, with the current changes it seems that > > > > > after a missed isoc event few requests are staying longer than expected in the > > > > > started_list (not getting reclaimed) and this is preventing the transmission > > > > > from stopping/starting again, and opening the door for continuous stream of > > > > > missed isoc events that cause what appears to the user as a frozen video. > > > > > > > > > > So one thought, if IOC bit is not set every frame, but IMI bit is, when a > > > > > missed isoc related interrupt occurs it seems likely that more than one trb > > > > > request will need to be reclaimed, but the current set of changes is not > > > > > handling this. > > > > > > > > > > In the good transfer case this issue seems to be taken care of since the IOC > > > > > bit is not set every frame and the reclaimation will loop through every item in > > > > > the started_list and only stop if there are no additional trbs or if one has > > > > > > > > It should stop at the request that associated with the interrupt event, > > > > whether it's because of IMI or IOC. > > > > > > In this case I was concerned that if multipled queued reqs did not have IOC bit > > > set, but there was a missed isoc on one of the last reqs, whether or not we would > > > reclaim all of the requests up to the missed isoc related req. I'm not sure if > > > my concern is valid or not. > > > > > > > There should be no problem. If there's an interrupt event indicating a > > TRB completion, the driver will give back all the requests up to the > > request associated with the interrupt event, and the controller will > > continue processing the remaining TRBs. On the next TRB completion > > event, the driver will again give back all the requests up to the > > request associated with that event. > > > > I was testing with the following patch you suggested: > > > diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c > > index 61fba2b7389b..8352f4b5dd9f 100644 > > --- a/drivers/usb/dwc3/gadget.c > > +++ b/drivers/usb/dwc3/gadget.c > > @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, > > if (event->status & DEPEVT_STATUS_SHORT && !chain) > > return 1; > > > > + if (usb_endpoint_xfer_isoc(dep->endpoint.desc) && > > + (event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain) > > + return 1; > > + > > if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || > > (trb->ctrl & DWC3_TRB_CTRL_LST)) > > return 1; > > > > At this time the IMI bit was set for every frame. With these changes it > appeared in case of missed isoc that sometimes not all requests would be > reclaimed (enqueued != dequeued even 100ms after the last interrupt was > handled). If the 1st req in the started_list was fine (IMI set, but not IOC), > and a later req was the one actually missed, because of this status check the > reclaimation could stop early and not clean up to the appropriate req. As
Oops. You're right.
> suggested yesterday, I also tried only setting the IMI bit when no_interrupt is > not set, however I was still seeing the complete freezes. After analyzing this > issue a bit, I have updated the diff to look more like this: > > diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c > index dfaf9ac24c4f..bb800a81815b 100644 > --- a/drivers/usb/dwc3/gadget.c > +++ b/drivers/usb/dwc3/gadget.c > @@ -1230,8 +1230,9 @@ static void __dwc3_prepare_one_trb(struct dwc3_ep *dep, struct dwc3_trb *trb, > trb->ctrl = DWC3_TRBCTL_ISOCHRONOUS; > } > > - /* always enable Interrupt on Missed ISOC */ > - trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI; > + /* enable Interrupt on Missed ISOC */ > + if ((!no_interrupt && !chain) || must_interrupt) > + trb->ctrl |= DWC3_TRB_CTRL_ISP_IMI; > break;
Either all or none of the TRBs of a request is set with IMI, and not some.
> > case USB_ENDPOINT_XFER_BULK: > @@ -3195,6 +3196,11 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, > if (event->status & DEPEVT_STATUS_SHORT && !chain) > return 1; > > + if (usb_endpoint_xfer_isoc(dep->endpoint.desc) && > + (event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain > + && (trb->ctrl & DWC3_TRB_CTRL_ISP_IMI)) > + return 1; > + > if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || > (trb->ctrl & DWC3_TRB_CTRL_LST)) > return 1; > > Where the trb must have the IMI set before returning early. This seemed to make > the freezes recoverable.
Can you try this revised change:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..a69d8c28d86b 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3654,7 +3654,7 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if ((trb->ctrl & DWC3_TRB_CTRL_HWO) && status != -ESHUTDOWN) return 1;
- if (event->status & DEPEVT_STATUS_SHORT && !chain)
I accidentally deleted a couple of lines here.
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain) return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) ||
I meant to do this:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 61fba2b7389b..cb65371572ee 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3657,6 +3657,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
- if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain)
return 1;
- if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
@@ -3673,6 +3676,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i;
- bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3681,12 +3685,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC)
missed_isoc = true;
- ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
- if (missed_isoc)
ret = 1;
- return ret;
}
BR, Thinh
I tried out the following patch diff you provided and I did not see any iommu related crashes with these changes:
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index dfaf9ac24c4f..50287437d6de 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -3195,6 +3195,9 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, if (event->status & DEPEVT_STATUS_SHORT && !chain) return 1;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC && !chain)
return 1;
if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || (trb->ctrl & DWC3_TRB_CTRL_LST)) return 1;
@@ -3211,6 +3214,7 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, struct scatterlist *s; unsigned int num_queued = req->num_queued_sgs; unsigned int i;
bool missed_isoc = false; int ret = 0;
for_each_sg(sg, s, num_queued, i) { @@ -3219,12 +3223,18 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct dwc3_ep *dep, req->sg = sg_next(s); req->num_queued_sgs--;
if (DWC3_TRB_SIZE_TRBSTS(trb->size) == DWC3_TRBSTS_MISSED_ISOC)
missed_isoc = true;
ret = dwc3_gadget_ep_reclaim_completed_trb(dep, req, trb, event, status, true); if (ret) break; }
if (missed_isoc)
ret = 1;
return ret;
}
As we discussed earlier, when uvc's complete function is called, if an -EXDEV is returned in the request's status, the uvc driver will begin to cancel its queue. With the current skip interrupt implementation in the uvc driver, if this occurs while the uvc driver is pumping the current frame, then there is no guarentee that the last request(s) will have had 'no_interrupt=0'. If the last requests passed to dwc3 had 'no_interrupt=1', these requests would eventually be placed at the end of the started_list. Since the IOC bit will not be set, and if no missed isoc event occurs on these requests, then the dwc3 driver will not be interrupted, leaving those remaining requests sitting in the started_list, and dwc3 will not perform an 'End Transfer' as expected. Once the uvc driver begins to pump the requests for the next frame, then it most likely will result in additional missed isoc events, with the result being an extended video freeze seen by the user.
I hope that other uvc driver maintainers can chime in here to help determine the correct path forward. With the skip interrupt implementation, the uvc driver should guarentee that the last request sent to dwc3 has 'no_interrupt=0', otherwise
Rather than guarenteeing no_interrupt or not, it's more important that the UVC maintains a constant queue of requests to the controller driver. Isoc transfers are meant to be sent at a constant rate which the endpoint is configured.
I agree with you on this, but it will probably always be a race with uvc queuing up one req at a time and dwc3 starting to transmit almost immediately. We can configure the streaming_interval on a product to kind of slow down or
No, it doesn't work that way. The controller would only send the data when the host requests for it. The host will request for data every 125us. So a UVC request will complete roughly every 125us. There should be no race with uvc.
I wasn't very clear here. I was talking about the UVC Gadget Driver, about how uvcg_video_pump queues up requests to the dwc3 driver via uvcg_video_ep_queue (->usb_ep_queue->dwc3_gadget_ep_queue).
delay the usb transfers, but between dwc3 and uvc driver it would be nice to have an interface that would allow pre-queuing a certain number of reqs before the transfer is actually started. If that is not possible, then uvc could instead prepare a number of reqs ahead of time and attempt to queue them each as fast a possible in a very tight loop.
All UVC gadget driver has to do is to maintain multiple requests prepared ahead of time and don't starve the controller driver. That is, don't let the queue reaches 0. Let's say uvc can only pump 16 requests at a time, split at least every 8 request with no_interrupt=0. So that uvc will have time to feed more requests when it gets the notification of 8 requests completed. It would have roughly 8x125us to queue more requests before underrun.
I think we are saying the same thing. I'll try to be more clear going forward. :)
I recalled Dan mentioned that UVC gadget driver can queue up to 64 requests with no_interrupt=1 up to 15 requests. But I keep seeing that the gadget driver only "pumps" 16 requests and doesn't continue until they are completed. We can almost guarantee that it's going to be underrun. Can UVC "pumps" multiple times at once?
uvc will usually pump when new frames comes in or when a req's complete gets called. uvc should fill up front all the reqs required to transfer a frame (up to 64), but once the available reqs are filled or the frame completely queued up, it would take a new incoming frame or a kick via the complete call to have it attempt to fill any remaining reqs in the queue again. To me this looks ok to do, but for heavy transfers we have a somewhat smaller queue as a buffer, 48 reqs vs 64 reqs (64 - 16).
To note, for the very last request of a frame/buffer (the end of it) the uvc driver does set the no_interrupt to 0 for this request.
if a missed isoc error occurs, it becomes very likely that the next immediate set of frames could be dropped/cancelled because the dwc3 driver could not perform a timely 'End Transfer'.
For testing I implemented the following changes to see what I could do for this issue. Note that I am on an older implementation and it's missing a lot of the
Please use the latest kernel, there are a lot of fixes/improvement to dwc3 every kernel version.
I've been debugging the sg implementation and the skip interrupt implementation seperately, backporting what can be backported. I'm working off of a 5.10 kernel debugging various issues Dan Vacura was seeing on a 5.15 kernel on a newer product. What we had on our 5.10 based kernel was stable for uvc/dwc3, so we needed to understand what came in since that time that broke stability.
sg related implementation. The idea here is that if the queue is empty, and that req_int_count is non-zero then the last request likely had 'no_interrupt=1' set. And if this is the case then we will want to send some dummy request to dwc3 with 'no_interrupt=0' set to make sure that no requests get stuck in its started_list.
This is not efficient and unnecessary.
Agree, but to fix this in the uvc driver the correct way seemed a bit more complicated at first. I was thinking that the driver would always send one last request of the frame buffer once an error is seen, but I'm now thinking of a simpler solution. If we can update the uvc pump to prepare a number of requests and make sure that the last request has no_interrupt set to 0 before queuing them all up in a tight loop to the dwc3 driver, this would effectively solve this problem too. A bit of extra smarts this area might also address some of your concerns about uvc not pumping a lot of data at once (especially for the beginning of a frame). I believe we should prepare more reqs up front if the req queue is empty and less reqs if its already busy.
I'm hoping that someone can step up to help here :). If not, then this will be my next activity.
<snip>
Alternatively we may just not want to cancel the queue upon receiving -EXDEV and this could solve the problem too, but I don't think that it's such a great idea, especially if things start falling behind.
I hope that someone more fluent in this area of code can take a crack at improving/fixing this issue.
The changes above do seem to help dwc3 timely end its transfers, but mainly for cases where some requests are missed but the next immediate ones are not (i'm talking within a couple of hundred microseconds). Most of the time if missed isocs occurs for a frame that the remaining reqs in the started_list will likely also error out and the list will be emptied and dwc3 will still timely send 'End Transfer'. In reality this is to cover a corner case that can adversely affect the quality of the video being watched. Just wanted to be upfront with these details.
Thinh, any pointers on how we should proceed from here? It looks like your changes are working well.
You can add the underrun detection check to dwc3 whenever it receives a new request.
ie. When the new request comes, check if the last prepared TRB's HWO bit is cleared and if the endpoint is started, send End Transfer command to reschedule the isoc transfers for the incoming requests.
This is probably the simpler workaround to the underrun issue of UVC.
This sounds like a good optimization too by itself. Would it be possible for you to implement something here to help get me started? Even if it's not perfect, I'll take what I can get. We are running up against the clock for trying to close things out (changes must be released to mainline and backported to 5.15 for Android).
At the moment, I have very limited bandwidth to implement and run tests. However, if you or someone provide a patch, I can help review.
It should look something like this (pseudo code):
ep_queue(request) { add_pending_list(request) if (ep is isoc and enabled) { if (prev_trb.HWO == 0) { send_cmd(End Transfer) return; } } }
On End Transfer command completion: reclaim_and_giveback_requests(started_list)
BR, Thinh
Thanks, Jeff
linux-stable-mirror@lists.linaro.org