On 10/10/19 10:46 AM, Peter Zijlstra wrote:
On Thu, Oct 10, 2019 at 05:19:01PM +0000, Alexei Starovoitov wrote:
On 10/10/19 12:36 AM, Peter Zijlstra wrote:
On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A deadlock on rq_lock():
rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [...] Call Trace: try_to_wake_up+0x1ad/0x590 wake_up_q+0x54/0x80 rwsem_wake+0x8a/0xb0 bpf_get_stack+0x13c/0x150 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000 bpf_overflow_handler+0x60/0x100 __perf_event_overflow+0x4f/0xf0 perf_swevent_overflow+0x99/0xc0 ___perf_sw_event+0xe7/0x120 __schedule+0x47d/0x620 schedule+0x29/0x90 futex_wait_queue_me+0xb9/0x110 futex_wait+0x139/0x230 do_futex+0x2ac/0xa50 __x64_sys_futex+0x13c/0x180 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 052580c33d26..3b278f6b0c3e 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, bool irq_work_busy = false; struct stack_map_irq_work *work = NULL;
- if (in_nmi()) {
- if (in_nmi() || this_rq_is_locked()) { work = this_cpu_ptr(&up_read_work); if (work->irq_work.flags & IRQ_WORK_BUSY) /* cannot queue more up_read, fallback */
This is horrific crap. Just say no to that get_build_id_offset() trainwreck.
this is not a helpful comment. What issues do you see with this approach?
It will still generate deadlocks if I place a tracepoint inside a lock that nests inside rq->lock, and it won't ever be able to detect that. Say do the very same thing on trace_hrtimer_start(), which is under cpu_base->lock, which nests inside rq->lock. That should give you an AB-BA.
tracepoints / perf-overflow should _never_ take locks.
All of stack_map_get_build_id_offset() is just disguisting games; I did tell you guys how to do lockless vma lookups a few years ago -- and yes, that is invasive core mm surgery. But this is just disguisting hacks for not wanting to do it right.
you mean speculative page fault stuff? That was my hope as well and I offered Laurent all the help to land it. Yet after a year since we've talked the patches are not any closer to landing. Any other 'invasive mm surgery' you have in mind?
Basically the only semi-sane thing to do with that trainwreck is s/in_nmi()/true/ and pray.
On top of that I just hate buildids in general.
Emotions aside... build_id is useful and used in production. It's used widely because it solves real problems. This dead lock is from real servers and not from some sanitizer wannabe. Hence we need to fix it as cleanly as possible and quickly. s/in_nmi/true/ is certainly an option. I'm worried about overhead of doing irq_work_queue() all the time. But I'm not familiar with mechanism enough to justify the concerns. Would it make sense to do s/in_nmi/irgs_disabled/ instead?