bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A deadlock on rq_lock():
rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [...] Call Trace: try_to_wake_up+0x1ad/0x590 wake_up_q+0x54/0x80 rwsem_wake+0x8a/0xb0 bpf_get_stack+0x13c/0x150 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000 bpf_overflow_handler+0x60/0x100 __perf_event_overflow+0x4f/0xf0 perf_swevent_overflow+0x99/0xc0 ___perf_sw_event+0xe7/0x120 __schedule+0x47d/0x620 schedule+0x29/0x90 futex_wait_queue_me+0xb9/0x110 futex_wait+0x139/0x230 do_futex+0x2ac/0xa50 __x64_sys_futex+0x13c/0x180 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9
This can be reproduced by: 1. Start a multi-thread program that does parallel mmap() and malloc(); 2. taskset the program to 2 CPUs; 3. Attach bpf program to trace_sched_switch and gather stackmap with build-id, e.g. with trace.py from bcc tools: trace.py -U -p <pid> -s <some-bin,some-lib> t:sched:sched_switch
A sample reproducer is attached at the end.
Fix this by checking whether rq_lock is already locked. If so, postpone the up_read() to irq_work, just like the in_nmi() case.
Fixes: commit 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address") Cc: stable@vger.kernel.org # v4.17+ Cc: Peter Zijlstra peterz@infradead.org Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Reported-by: Tejun Heo tj@kernel.org Signed-off-by: Song Liu songliubraving@fb.com
Reproducer: ============================ 8< ============================
char *filename;
void *worker(void *p) { void *ptr; int fd; char *pptr;
fd = open(filename, O_RDONLY); if (fd < 0) return NULL; while (1) { struct timespec ts = {0, 1000 + rand() % 2000};
ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0); usleep(1); if (ptr == MAP_FAILED) { printf("failed to mmap\n"); break; } munmap(ptr, 4096 * 64); usleep(1); pptr = malloc(1); usleep(1); pptr[0] = 1; usleep(1); free(pptr); usleep(1); nanosleep(&ts, NULL); } close(fd); return NULL; }
int main(int argc, char *argv[]) { void *ptr; int i; pthread_t threads[THREAD_COUNT];
if (argc < 2) return 0;
filename = argv[1];
for (i = 0; i < THREAD_COUNT; i++) { if (pthread_create(threads + i, NULL, worker, NULL)) { fprintf(stderr, "Error creating thread\n"); return 0; } }
for (i = 0; i < THREAD_COUNT; i++) pthread_join(threads[i], NULL); return 0; } ============================ 8< ============================ --- kernel/bpf/stackmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 052580c33d26..3b278f6b0c3e 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, bool irq_work_busy = false; struct stack_map_irq_work *work = NULL;
- if (in_nmi()) { + if (in_nmi() || this_rq_is_locked()) { work = this_cpu_ptr(&up_read_work); if (work->irq_work.flags & IRQ_WORK_BUSY) /* cannot queue more up_read, fallback */ -- 2.17.1
On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A deadlock on rq_lock():
rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [...] Call Trace: try_to_wake_up+0x1ad/0x590 wake_up_q+0x54/0x80 rwsem_wake+0x8a/0xb0 bpf_get_stack+0x13c/0x150 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000 bpf_overflow_handler+0x60/0x100 __perf_event_overflow+0x4f/0xf0 perf_swevent_overflow+0x99/0xc0 ___perf_sw_event+0xe7/0x120 __schedule+0x47d/0x620 schedule+0x29/0x90 futex_wait_queue_me+0xb9/0x110 futex_wait+0x139/0x230 do_futex+0x2ac/0xa50 __x64_sys_futex+0x13c/0x180 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 052580c33d26..3b278f6b0c3e 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, bool irq_work_busy = false; struct stack_map_irq_work *work = NULL;
- if (in_nmi()) {
- if (in_nmi() || this_rq_is_locked()) { work = this_cpu_ptr(&up_read_work); if (work->irq_work.flags & IRQ_WORK_BUSY) /* cannot queue more up_read, fallback */
This is horrific crap. Just say no to that get_build_id_offset() trainwreck.
On 10/10/19 12:36 AM, Peter Zijlstra wrote:
On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A deadlock on rq_lock():
rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [...] Call Trace: try_to_wake_up+0x1ad/0x590 wake_up_q+0x54/0x80 rwsem_wake+0x8a/0xb0 bpf_get_stack+0x13c/0x150 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000 bpf_overflow_handler+0x60/0x100 __perf_event_overflow+0x4f/0xf0 perf_swevent_overflow+0x99/0xc0 ___perf_sw_event+0xe7/0x120 __schedule+0x47d/0x620 schedule+0x29/0x90 futex_wait_queue_me+0xb9/0x110 futex_wait+0x139/0x230 do_futex+0x2ac/0xa50 __x64_sys_futex+0x13c/0x180 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 052580c33d26..3b278f6b0c3e 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, bool irq_work_busy = false; struct stack_map_irq_work *work = NULL;
- if (in_nmi()) {
- if (in_nmi() || this_rq_is_locked()) { work = this_cpu_ptr(&up_read_work); if (work->irq_work.flags & IRQ_WORK_BUSY) /* cannot queue more up_read, fallback */
This is horrific crap. Just say no to that get_build_id_offset() trainwreck.
this is not a helpful comment. What issues do you see with this approach?
On Thu, Oct 10, 2019 at 05:19:01PM +0000, Alexei Starovoitov wrote:
On 10/10/19 12:36 AM, Peter Zijlstra wrote:
On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A deadlock on rq_lock():
rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [...] Call Trace: try_to_wake_up+0x1ad/0x590 wake_up_q+0x54/0x80 rwsem_wake+0x8a/0xb0 bpf_get_stack+0x13c/0x150 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000 bpf_overflow_handler+0x60/0x100 __perf_event_overflow+0x4f/0xf0 perf_swevent_overflow+0x99/0xc0 ___perf_sw_event+0xe7/0x120 __schedule+0x47d/0x620 schedule+0x29/0x90 futex_wait_queue_me+0xb9/0x110 futex_wait+0x139/0x230 do_futex+0x2ac/0xa50 __x64_sys_futex+0x13c/0x180 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 052580c33d26..3b278f6b0c3e 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, bool irq_work_busy = false; struct stack_map_irq_work *work = NULL;
- if (in_nmi()) {
- if (in_nmi() || this_rq_is_locked()) { work = this_cpu_ptr(&up_read_work); if (work->irq_work.flags & IRQ_WORK_BUSY) /* cannot queue more up_read, fallback */
This is horrific crap. Just say no to that get_build_id_offset() trainwreck.
this is not a helpful comment. What issues do you see with this approach?
It will still generate deadlocks if I place a tracepoint inside a lock that nests inside rq->lock, and it won't ever be able to detect that. Say do the very same thing on trace_hrtimer_start(), which is under cpu_base->lock, which nests inside rq->lock. That should give you an AB-BA.
tracepoints / perf-overflow should _never_ take locks.
All of stack_map_get_build_id_offset() is just disguisting games; I did tell you guys how to do lockless vma lookups a few years ago -- and yes, that is invasive core mm surgery. But this is just disguisting hacks for not wanting to do it right.
Basically the only semi-sane thing to do with that trainwreck is s/in_nmi()/true/ and pray.
On top of that I just hate buildids in general.
On 10/10/19 10:46 AM, Peter Zijlstra wrote:
On Thu, Oct 10, 2019 at 05:19:01PM +0000, Alexei Starovoitov wrote:
On 10/10/19 12:36 AM, Peter Zijlstra wrote:
On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A deadlock on rq_lock():
rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [...] Call Trace: try_to_wake_up+0x1ad/0x590 wake_up_q+0x54/0x80 rwsem_wake+0x8a/0xb0 bpf_get_stack+0x13c/0x150 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000 bpf_overflow_handler+0x60/0x100 __perf_event_overflow+0x4f/0xf0 perf_swevent_overflow+0x99/0xc0 ___perf_sw_event+0xe7/0x120 __schedule+0x47d/0x620 schedule+0x29/0x90 futex_wait_queue_me+0xb9/0x110 futex_wait+0x139/0x230 do_futex+0x2ac/0xa50 __x64_sys_futex+0x13c/0x180 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 052580c33d26..3b278f6b0c3e 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, bool irq_work_busy = false; struct stack_map_irq_work *work = NULL;
- if (in_nmi()) {
- if (in_nmi() || this_rq_is_locked()) { work = this_cpu_ptr(&up_read_work); if (work->irq_work.flags & IRQ_WORK_BUSY) /* cannot queue more up_read, fallback */
This is horrific crap. Just say no to that get_build_id_offset() trainwreck.
this is not a helpful comment. What issues do you see with this approach?
It will still generate deadlocks if I place a tracepoint inside a lock that nests inside rq->lock, and it won't ever be able to detect that. Say do the very same thing on trace_hrtimer_start(), which is under cpu_base->lock, which nests inside rq->lock. That should give you an AB-BA.
tracepoints / perf-overflow should _never_ take locks.
All of stack_map_get_build_id_offset() is just disguisting games; I did tell you guys how to do lockless vma lookups a few years ago -- and yes, that is invasive core mm surgery. But this is just disguisting hacks for not wanting to do it right.
you mean speculative page fault stuff? That was my hope as well and I offered Laurent all the help to land it. Yet after a year since we've talked the patches are not any closer to landing. Any other 'invasive mm surgery' you have in mind?
Basically the only semi-sane thing to do with that trainwreck is s/in_nmi()/true/ and pray.
On top of that I just hate buildids in general.
Emotions aside... build_id is useful and used in production. It's used widely because it solves real problems. This dead lock is from real servers and not from some sanitizer wannabe. Hence we need to fix it as cleanly as possible and quickly. s/in_nmi/true/ is certainly an option. I'm worried about overhead of doing irq_work_queue() all the time. But I'm not familiar with mechanism enough to justify the concerns. Would it make sense to do s/in_nmi/irgs_disabled/ instead?
On Thu, Oct 10, 2019 at 06:06:14PM +0000, Alexei Starovoitov wrote:
On 10/10/19 10:46 AM, Peter Zijlstra wrote:
All of stack_map_get_build_id_offset() is just disguisting games; I did tell you guys how to do lockless vma lookups a few years ago -- and yes, that is invasive core mm surgery. But this is just disguisting hacks for not wanting to do it right.
you mean speculative page fault stuff? That was my hope as well and I offered Laurent all the help to land it. Yet after a year since we've talked the patches are not any closer to landing. Any other 'invasive mm surgery' you have in mind?
Indeed that series. It had RCU managed VMAs and lockless VMA lookups, which is exactly what you need here.
Basically the only semi-sane thing to do with that trainwreck is s/in_nmi()/true/ and pray.
On top of that I just hate buildids in general.
Emotions aside... build_id is useful and used in production. It's used widely because it solves real problems.
AFAIU it solves the problem of you not knowing what version of the binary runs where; which I was hoping your cloud infrastructure thing would actually know already.
Anyway, I know what it does, I just don't nessecarily agree it is the right way around that particular problem (also, the way I'm personally affected is that perf-record is dead slow by default due to built-id post processing).
And it obviously leads to horrible hacks like the code currently under discussion :/
This dead lock is from real servers and not from some sanitizer wannabe.
If you enable CFS bandwidth control and run this function on the trace_hrtimer_start() tracepoint, you should be able to trigger a real AB-BA lockup.
Hence we need to fix it as cleanly as possible and quickly. s/in_nmi/true/ is certainly an option.
That is the best option; because tracepoints / perf-overflow handlers really should not be taking any locks.
I'm worried about overhead of doing irq_work_queue() all the time. But I'm not familiar with mechanism enough to justify the concerns. Would it make sense to do s/in_nmi/irgs_disabled/ instead?
irqs_disabled() should work in this particular case because rq->lock (and therefore all it's nested locks) are IRQ-safe.
Thanks Peter!
On Oct 14, 2019, at 2:09 AM, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Oct 10, 2019 at 06:06:14PM +0000, Alexei Starovoitov wrote:
On 10/10/19 10:46 AM, Peter Zijlstra wrote:
All of stack_map_get_build_id_offset() is just disguisting games; I did tell you guys how to do lockless vma lookups a few years ago -- and yes, that is invasive core mm surgery. But this is just disguisting hacks for not wanting to do it right.
you mean speculative page fault stuff? That was my hope as well and I offered Laurent all the help to land it. Yet after a year since we've talked the patches are not any closer to landing. Any other 'invasive mm surgery' you have in mind?
Indeed that series. It had RCU managed VMAs and lockless VMA lookups, which is exactly what you need here.
Lockless VMA lookups will be really useful. It would resolve all the pains we are having here.
I remember Google folks also mentioned in LPC that they would like better mechanism to confirm build-id in perf.
Basically the only semi-sane thing to do with that trainwreck is s/in_nmi()/true/ and pray.
On top of that I just hate buildids in general.
Emotions aside... build_id is useful and used in production. It's used widely because it solves real problems.
AFAIU it solves the problem of you not knowing what version of the binary runs where; which I was hoping your cloud infrastructure thing would actually know already.
Anyway, I know what it does, I just don't nessecarily agree it is the right way around that particular problem (also, the way I'm personally affected is that perf-record is dead slow by default due to built-id post processing).
And it obviously leads to horrible hacks like the code currently under discussion :/
This dead lock is from real servers and not from some sanitizer wannabe.
If you enable CFS bandwidth control and run this function on the trace_hrtimer_start() tracepoint, you should be able to trigger a real AB-BA lockup.
Hence we need to fix it as cleanly as possible and quickly. s/in_nmi/true/ is certainly an option.
That is the best option; because tracepoints / perf-overflow handlers really should not be taking any locks.
I'm worried about overhead of doing irq_work_queue() all the time. But I'm not familiar with mechanism enough to justify the concerns. Would it make sense to do s/in_nmi/irgs_disabled/ instead?
irqs_disabled() should work in this particular case because rq->lock (and therefore all it's nested locks) are IRQ-safe.
We worry about the overhead of irq_work for every single stackmap lookup. So we would like to go with the irqs_disabled() check. I just sent v2 of the patch.
Thanks again, Song
linux-stable-mirror@lists.linaro.org