Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event

16 Jul 2025

Hi Jakub, Oliver,
On 7/15/25 3:54 PM, Jakub Kicinski wrote:
...
On Tue, 15 Jul 2025 07:15:51 +0000 John Ernberg wrote:
...
...
I'm worried that this is still racy.
Since usbnet_bh checks if carrier is ok and __handle_link_change()
checks the opposite something must be out of sync if both run.
Most likely something restored the carrier while we're still handling
the previous carrier loss.
There could definitely be other factors, I'll try to dig some in
cdc_ether and see if something there could be causing problems for the
usbnet core.
I honestly kinda stopped digging when I found unlink_urbs() being
wrapped with a pause/unpause at another place tied to a commit seeing a
similar issue.
Looking at cdc_ether:
static void usbnet_cdc_zte_status(struct usbnet *dev, struct urb *urb)
[...]
         if (event->wValue &&
             netif_carrier_ok(dev->net))
                 netif_carrier_off(dev->net);
This looks sus. Is the Gemalto Cinterion PLS83-W ZTE based?
The modem isn't ZTE-based, but I did spot that function as well when 
checking cdc_ether.
The modem is advertising USB High Speed with a bInterval of 5, which in 
usbnet gets bumped to 7 to avoid too aggressive polling, if I read that 
code right.
Through some rudimentary logging with the inlined patch on 6.12.20 (mail 
client mangled):
-----------------8<----------------------

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index a6469235d904..7b4827d2c797 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -418,8 +418,10 @@ void usbnet_cdc_status(struct usbnet *dev, struct 
urb *urb)
         case USB_CDC_NOTIFY_NETWORK_CONNECTION:
                 netif_dbg(dev, timer, dev->net, "CDC: carrier %s\n",
                           event->wValue ? "on" : "off");
-               if (netif_carrier_ok(dev->net) != !!event->wValue)
+               if (netif_carrier_ok(dev->net) != !!event->wValue) {
+                       netdev_err(dev->net, "network connection %d\n", 
(int)event->wValue);
                         usbnet_link_change(dev, !!event->wValue, 0);
+               }
                 break;
         case USB_CDC_NOTIFY_SPEED_CHANGE:       /* tx/rx rates */
                 netif_dbg(dev, timer, dev->net, "CDC: speed change (len 
%d)\n",
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 1e99a6ef4945..dc3e3d87b4f7 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -721,6 +721,7 @@ static int unlink_urbs (struct usbnet *dev, struct 
sk_buff_head *q)
         struct sk_buff          *skb;
         int                     count = 0;
+       netdev_err(dev->net, "unlink urb start: %d devflags=%lx\n", 
skb_queue_len_lockless(q), dev->flags);
         spin_lock_irqsave (&q->lock, flags);
         while (!skb_queue_empty(q)) {
                 struct skb_data         *entry;
@@ -757,6 +758,7 @@ static int unlink_urbs (struct usbnet *dev, struct 
sk_buff_head *q)
                 spin_lock_irqsave(&q->lock, flags);
         }
         spin_unlock_irqrestore (&q->lock, flags);
+       netdev_err(dev->net, "unlink urb counted %d\n", count);
         return count;
  }
-----------------8<----------------------
I ended up with the following log:
[   23.823289] cdc_ether 1-1.1:1.8 wwan0: network connection 0
[   23.830874] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=1880
[   23.840148] cdc_ether 1-1.1:1.8 wwan0: unlink urb counted 5
[   25.356741] cdc_ether 1-1.1:1.8 wwan0: network connection 1
[   25.364745] cdc_ether 1-1.1:1.8 wwan0: network connection 0
[   25.371106] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=880
[   25.378710] cdc_ether 1-1.1:1.8 wwan0: network connection 1
[   51.422757] rcu: INFO: rcu_sched self-detected stall on CPU
[   51.429081] rcu:     0-....: (6499 ticks this GP) 
idle=da7c/1/0x4000000000000000 softirq=2067/2067 fqs=2668
[   51.439717] rcu:              hardirqs   softirqs   csw/system
[   51.445897] rcu:      number:    62096      59017            0
[   51.452107] rcu:     cputime:        0      11397         1470   ==> 
12996(ms)
[   51.459852] rcu:     (t=6500 jiffies g=2397 q=663 ncpus=2)
From a USB capture where the stall didn't happen I can see:
* A bunch of CDC_NETWORK_CONNECTION events with Disconnected state (0).
* Then a CDC_NETWORK_CONNECTION event with Connected state (1) once the 
WWAN interface is turned on by the modem.
* Followed by a Disconnected in the next USB INTR poll.
* Followed by a Connected in the next USB INTR poll.
(I'm not sure if I can achieve a different timing with enough captures 
or a faster system)
Which makes the off and on LINK_CHANGE events race on our system (ARM64 
based, iMX8QXP) as they cannot be handled fast enough. Nothing stops 
usbnet_link_change() from being called while the deferred work is running.
As Oliver points out usbnet_resume_rx() causes scheduling which seems 
unnecessary or maybe even inappropriate for all cases except when the 
carrier was turned on during the race.
I gave the ZTE modem quirk a go anyway, despite the comment explaining a 
different situation than what I am seeing, and it has no observable 
effect on this RCU stall.
Currently drawing a blank on what the correct fix would be.
Best regards // John Ernberg

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event