Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event

18 Jul 2025

      Hi Jakub,
On 7/16/25 11:39 PM, Jakub Kicinski wrote:
...
On Wed, 16 Jul 2025 14:54:46 +0000 John Ernberg wrote:
...
I ended up with the following log:
[   23.823289] cdc_ether 1-1.1:1.8 wwan0: network connection 0
[   23.830874] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=1880
[   23.840148] cdc_ether 1-1.1:1.8 wwan0: unlink urb counted 5
[   25.356741] cdc_ether 1-1.1:1.8 wwan0: network connection 1
[   25.364745] cdc_ether 1-1.1:1.8 wwan0: network connection 0
[   25.371106] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=880
[   25.378710] cdc_ether 1-1.1:1.8 wwan0: network connection 1
[   51.422757] rcu: INFO: rcu_sched self-detected stall on CPU
[   51.429081] rcu:     0-....: (6499 ticks this GP)
idle=da7c/1/0x4000000000000000 softirq=2067/2067 fqs=2668
[   51.439717] rcu:              hardirqs   softirqs   csw/system
[   51.445897] rcu:      number:    62096      59017            0
[   51.452107] rcu:     cputime:        0      11397         1470   ==>
12996(ms)
[   51.459852] rcu:     (t=6500 jiffies g=2397 q=663 ncpus=2)
From a USB capture where the stall didn't happen I can see:

A bunch of CDC_NETWORK_CONNECTION events with Disconnected state (0).
Then a CDC_NETWORK_CONNECTION event with Connected state (1) once the

WWAN interface is turned on by the modem.

Followed by a Disconnected in the next USB INTR poll.
Followed by a Connected in the next USB INTR poll.

(I'm not sure if I can achieve a different timing with enough captures
or a faster system)
Which makes the off and on LINK_CHANGE events race on our system (ARM64
based, iMX8QXP) as they cannot be handled fast enough. Nothing stops
usbnet_link_change() from being called while the deferred work is running.
As Oliver points out usbnet_resume_rx() causes scheduling which seems
unnecessary or maybe even inappropriate for all cases except when the
carrier was turned on during the race.
I gave the ZTE modem quirk a go anyway, despite the comment explaining a
different situation than what I am seeing, and it has no observable
effect on this RCU stall.
Currently drawing a blank on what the correct fix would be.
Thanks for the analysis, I think I may have misread the code.
What I was saying is that we are restoring the carrier while
we are still processing the previous carrier off event in
the workqueue. My thinking was that if we deferred the
netif_carrier_on() to the workqueue this race couldn't happen.
usbnet_bh() already checks netif_carrier_ok() - we're kinda duplicating
the carrier state with this RX_PAUSED workaround.
I don't feel strongly about this, but deferring the carrier_on()
the the workqueue would be a cleaner solution IMO.
I've been thinking about this idea, but I'm concerned for the opposite 
direction. I cannot think of a way to fully guarantee that the carrier 
isn't turned on again incorrectly if an off gets queued.
The most I came up with was adding an extra flag bit to set carrier on, 
and then test_and_clear_bit() it in the __handle_link_change() function.
And also clear_bit() in the usbnet_link_change() function if an off 
arrives. I cannot convince myself that there isn't a way for that to go 
sideways. But perhaps that would be robust enough?
I've also considered the possibility of just not re-submitting the INTR 
poll URB until the last one was fully processed when handling a link 
change. But that might cause havoc with ASIX and Sierra devices as they 
are calling usbnet_link_change() in other ways than through the 
.status-callback. I don't have any of these devices so I cannot test 
them for regressions. So this path feels quite dangerous.
With a sub-driver property to enable this behavior it might work out?
Thanks! // John Ernberg

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event