Re: INFO: rcu detected stall in wg_packet_tx_worker

syzbot has bisected this bug to:

commit e7096c131e5161fa3b8e52a650d7719d2857adfd Author: Jason A. Donenfeld Jason@zx2c4.com Date: Sun Dec 8 23:27:34 2019 +0000
net: WireGuard secure network tunnel
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000 start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/.. git tree: upstream final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000 console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000 kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68 dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a userspace arch: i386 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000

Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

I have not looked at the repro closely, but WireGuard has some workers that might loop forever, cond_resched() might help a bit.

diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c index da3b782ab7d31df11e381529b144bcc494234a38..349a71e1907e081c61967c77c9f25a6ec5e57a24 100644 --- a/drivers/net/wireguard/receive.c +++ b/drivers/net/wireguard/receive.c @@ -518,6 +518,7 @@ void wg_packet_decrypt_worker(struct work_struct *work) &PACKET_CB(skb)->keypair->receiving)) ? PACKET_STATE_CRYPTED : PACKET_STATE_DEAD; wg_queue_enqueue_per_peer_napi(skb, state); + cond_resched(); } }

diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c index 7348c10cbae3db54bfcb31f23c2753185735f876..f5b88693176c84b4bfdf8c4e05071481a3ce45b5 100644 --- a/drivers/net/wireguard/send.c +++ b/drivers/net/wireguard/send.c @@ -281,6 +281,7 @@ void wg_packet_tx_worker(struct work_struct *work)

wg_noise_keypair_put(keypair, false); wg_peer_put(peer); + cond_resched(); } }

Jason A. Donenfeld

7:42 p.m.

New subject: INFO: rcu detected stall in wg_packet_tx_worker

On Sun, Apr 26, 2020 at 1:40 PM Eric Dumazet eric.dumazet@gmail.com wrote:

...

On 4/26/20 10:57 AM, syzbot wrote:

...
syzbot has bisected this bug to:

commit e7096c131e5161fa3b8e52a650d7719d2857adfd Author: Jason A. Donenfeld Jason@zx2c4.com Date: Sun Dec 8 23:27:34 2019 +0000
net: WireGuard secure network tunnel
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000 start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/.. git tree: upstream final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000 console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000 kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68 dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a userspace arch: i386 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000

Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection
I have not looked at the repro closely, but WireGuard has some workers that might loop forever, cond_resched() might help a bit.

I'm working on this right now. Having a bit difficult of a time getting it to reproduce locally...

The reports show the stall happening always at:

static struct sk_buff * sfq_dequeue(struct Qdisc *sch) { struct sfq_sched_data *q = qdisc_priv(sch); struct sk_buff *skb; sfq_index a, next_a; struct sfq_slot *slot;

/* No active slots */ if (q->tail == NULL) return NULL;

next_slot: a = q->tail->next; slot = &q->slots[a];

Which is kind of interesting, because it's not like that should block or anything, unless there's some kasan faulting happening.

Jason A. Donenfeld

7:52 p.m.

New subject: INFO: rcu detected stall in wg_packet_tx_worker

It looks like part of the issue might be that I call udp_tunnel6_xmit_skb while holding rcu_read_lock_bh, in drivers/net/wireguard/socket.c. But I think there's good reason to do so, and udp_tunnel6_xmit_skb should be rcu safe. In fact, every.single.other user of udp_tunnel6_xmit_skb in the kernel uses it with rcu locked. So, hm...

Jason A. Donenfeld

7:58 p.m.

New subject: INFO: rcu detected stall in wg_packet_tx_worker

On Sun, Apr 26, 2020 at 1:52 PM Jason A. Donenfeld Jason@zx2c4.com wrote:

...

It looks like part of the issue might be that I call udp_tunnel6_xmit_skb while holding rcu_read_lock_bh, in drivers/net/wireguard/socket.c. But I think there's good reason to do so, and udp_tunnel6_xmit_skb should be rcu safe. In fact, every.single.other user of udp_tunnel6_xmit_skb in the kernel uses it with rcu locked. So, hm...

In the syzkaller log, it looks like several runs are hitting:

run #0: crashed: INFO: rcu detected stall in netlink_sendmsg

And other runs are hitting yet different functions. So actually, it's not clear that this is the fault of the call to udp_tunnel6_xmit_skb.

Eric Dumazet

8:26 p.m.

New subject: INFO: rcu detected stall in wg_packet_tx_worker

On 4/26/20 12:42 PM, Jason A. Donenfeld wrote:

...

On Sun, Apr 26, 2020 at 1:40 PM Eric Dumazet eric.dumazet@gmail.com wrote:

...
On 4/26/20 10:57 AM, syzbot wrote:

...
syzbot has bisected this bug to:

commit e7096c131e5161fa3b8e52a650d7719d2857adfd Author: Jason A. Donenfeld Jason@zx2c4.com Date: Sun Dec 8 23:27:34 2019 +0000
net: WireGuard secure network tunnel
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000 start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/.. git tree: upstream final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000 console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000 kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68 dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a userspace arch: i386 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000

Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection
I have not looked at the repro closely, but WireGuard has some workers that might loop forever, cond_resched() might help a bit.
I'm working on this right now. Having a bit difficult of a time getting it to reproduce locally...

The reports show the stall happening always at:

static struct sk_buff * sfq_dequeue(struct Qdisc *sch) { struct sfq_sched_data *q = qdisc_priv(sch); struct sk_buff *skb; sfq_index a, next_a; struct sfq_slot *slot;
   /* No active slots */
   if (q->tail == NULL)
           return NULL;
next_slot: a = q->tail->next; slot = &q->slots[a];

Which is kind of interesting, because it's not like that should block or anything, unless there's some kasan faulting happening.

I am not really sure WireGuard is involved, the repro does not rely on it anyway.

Eric Dumazet

8:38 p.m.

New subject: INFO: rcu detected stall in wg_packet_tx_worker

On 4/26/20 1:26 PM, Eric Dumazet wrote:

...

On 4/26/20 12:42 PM, Jason A. Donenfeld wrote:

...
On Sun, Apr 26, 2020 at 1:40 PM Eric Dumazet eric.dumazet@gmail.com wrote:

...
On 4/26/20 10:57 AM, syzbot wrote:

...
syzbot has bisected this bug to:

commit e7096c131e5161fa3b8e52a650d7719d2857adfd Author: Jason A. Donenfeld Jason@zx2c4.com Date: Sun Dec 8 23:27:34 2019 +0000
net: WireGuard secure network tunnel
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000 start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/.. git tree: upstream final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000 console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000 kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68 dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a userspace arch: i386 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000

Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection
I have not looked at the repro closely, but WireGuard has some workers that might loop forever, cond_resched() might help a bit.
I'm working on this right now. Having a bit difficult of a time getting it to reproduce locally...

The reports show the stall happening always at:

static struct sk_buff * sfq_dequeue(struct Qdisc *sch) { struct sfq_sched_data *q = qdisc_priv(sch); struct sk_buff *skb; sfq_index a, next_a; struct sfq_slot *slot;
   /* No active slots */
   if (q->tail == NULL)
           return NULL;
next_slot: a = q->tail->next; slot = &q->slots[a];

Which is kind of interesting, because it's not like that should block or anything, unless there's some kasan faulting happening.
I am not really sure WireGuard is involved, the repro does not rely on it anyway.

Yes, do not spend too much time on this.

syzbot found its way into crazy qdisc settings these last days.

( I sent a patch yesterday for choke qdisc, it seems similar checks are needed in sfq )

Jason A. Donenfeld

8:46 p.m.

New subject: INFO: rcu detected stall in wg_packet_tx_worker

On Sun, Apr 26, 2020 at 2:38 PM Eric Dumazet eric.dumazet@gmail.com wrote:

...

On 4/26/20 1:26 PM, Eric Dumazet wrote:

...
On 4/26/20 12:42 PM, Jason A. Donenfeld wrote:

...
On Sun, Apr 26, 2020 at 1:40 PM Eric Dumazet eric.dumazet@gmail.com wrote:

...
On 4/26/20 10:57 AM, syzbot wrote:

...
syzbot has bisected this bug to:

commit e7096c131e5161fa3b8e52a650d7719d2857adfd Author: Jason A. Donenfeld Jason@zx2c4.com Date: Sun Dec 8 23:27:34 2019 +0000
net: WireGuard secure network tunnel
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000 start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/.. git tree: upstream final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000 console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000 kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68 dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a userspace arch: i386 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000

Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection
I have not looked at the repro closely, but WireGuard has some workers that might loop forever, cond_resched() might help a bit.
I'm working on this right now. Having a bit difficult of a time getting it to reproduce locally...

The reports show the stall happening always at:

static struct sk_buff * sfq_dequeue(struct Qdisc *sch) { struct sfq_sched_data *q = qdisc_priv(sch); struct sk_buff *skb; sfq_index a, next_a; struct sfq_slot *slot;
   /* No active slots */
   if (q->tail == NULL)
           return NULL;
next_slot: a = q->tail->next; slot = &q->slots[a];

Which is kind of interesting, because it's not like that should block or anything, unless there's some kasan faulting happening.
I am not really sure WireGuard is involved, the repro does not rely on it anyway.
Yes, do not spend too much time on this.

syzbot found its way into crazy qdisc settings these last days.

( I sent a patch yesterday for choke qdisc, it seems similar checks are needed in sfq )

Ah, whew, okay. I had just begun instrumenting sfq (the highly technical term for "adding printks everywhere") to figure out what's going on. Looks like you've got a handle on it, so I'll let you have at it.

On the brighter side, it seems like Dmitry's and my effort to get full coverage of WireGuard has paid off in the sense that tons of packets wind up being shoveled through it in one way or another, which is good.

Eric Dumazet

9:53 p.m.

New subject: INFO: rcu detected stall in wg_packet_tx_worker

On 4/26/20 1:46 PM, Jason A. Donenfeld wrote:

...

Ah, whew, okay. I had just begun instrumenting sfq (the highly technical term for "adding printks everywhere") to figure out what's going on. Looks like you've got a handle on it, so I'll let you have at it.

Yes, syzbot manages to put a zero in q->scaled_quantum

I will send a fix.

...

On the brighter side, it seems like Dmitry's and my effort to get full coverage of WireGuard has paid off in the sense that tons of packets wind up being shoveled through it in one way or another, which is good.

Sure !

Jason A. Donenfeld

4 May 4 May

11:23 a.m.

New subject: INFO: rcu detected stall in wg_packet_tx_worker

So in spite of this Syzkaller bug being unrelated in the end, I've continued to think about the stacktrace a bit, and combined with some other [potentially false alarm] bug reports I'm trying to wrap my head around, I'm a bit a curious about ideal usage for the udp_tunnel API.

All the uses I've seen in the kernel (including wireguard) follow this pattern:

rcu_read_lock_bh(); sock = rcu_dereference(obj->sock); ... udp_tunnel_xmit_skb(..., sock, ...); rcu_read_unlock_bh();

udp_tunnel_xmit_skb calls iptunnel_xmit, which winds up in the usual ip_local_out path, which eventually winds up calling some other devices' ndo_xmit, or gets queued up in a qdisc. Calls to udp_tunnel_xmit_skb aren't exactly cheap. So I wonder: is holding the rcu lock for all that time really a good thing?

A different pattern that avoids holding the rcu lock would be:

rcu_read_lock_bh(); sock = rcu_dereference(obj->sock); sock_hold(sock); rcu_read_unlock_bh(); ... udp_tunnel_xmit_skb(..., sock, ...); sock_put(sock);

This seems better, but I wonder if it has some drawbacks too. For example, sock_put has some comment that warns against incrementing it in response to forwarded packets. And if this isn't necessary to do, it's marginally more costly than the first pattern.

Any opinions about this?

Jason

2007

days inactive

2015

days old

linux-kselftest-mirror@lists.linaro.org

9 comments

participants

tags (0)

participants (3)

Eric Dumazet
Jason A. Donenfeld
syzbot