On 6/13/23 11:49 AM, Bhatnagar, Rishabh wrote:
Hi Sebastian/Greg
We are seeing RCU stall warnings from recent stable tree updates: 5.4.243, 5.10.180, 5.15.113, 6.1.31 onwards. This is seen in the upstream stable trees without any downstream patches.
The issue is seen few minutes after booting without any workload. We launch hundred's of virtual instances and this shows up in 1-2 instances, so its hard to reproduce. Attaching a few stack traces below.
The issue can be seen on virtual and baremetal instances. Another interesting point is we only see this on x86 based instances. We also did test this on linux-mainline but were not able to reproduce the issue. So maybe there's a fixup or related commit that has gone in?
We tried bisecting the stable trees and found that after reverting the below commit we couldn't reproduce this in any of the kernels consistently.
tick/common: Align tick period with the HZ tick. [ Upstream commit e9523a0d81899361214d118ad60ef76f0e92f71d ]
Not exactly sure how this commit is affecting all stable kernels. Can you take a look at this issue and share your insight?
A few example stack traces of the observed RCU stall.
Seen in kernel-5.10.181:
rcu: INFO: rcu_sched self-detected stall on CPU rcu: 37-....: (1818 ticks this GP) idle=cf2/1/0x4000000000000000 softirq=151/151 fqs=7245 (t=14809 jiffies g=-759 q=3290) NMI backtrace for cpu 37 Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006 Call Trace: <IRQ> dump_stack+0x57/0x70 ? lapic_can_unplug_cpu.cold+0x3a/0x3a nmi_cpu_backtrace.cold+0x32/0x68 nmi_trigger_cpumask_backtrace+0xdf/0xe6 rcu_dump_cpu_stacks+0xa5/0xd7 print_cpu_stall.cold+0xa4/0x149 check_cpu_stall+0xee/0x210 rcu_pending+0x26/0xc0 rcu_sched_clock_irq+0x43/0x110 update_process_times+0x8c/0xc0 tick_periodic+0x27/0x80 tick_handle_periodic+0x20/0x70 xen_timer_interrupt+0x1e/0x30 __handle_irq_event_percpu+0x3d/0x160 handle_irq_event_percpu+0x31/0x80 handle_percpu_irq+0x37/0x60 generic_handle_irq+0x4b/0x60 evtchn_2l_handle_events+0x26d/0x280 __xen_evtchn_do_upcall+0x66/0xb0 __sysvec_xen_hvm_callback+0x22/0x30 asm_call_irq_on_stack+0x12/0x20 </IRQ> sysvec_xen_hvm_callback+0x72/0x80 asm_sysvec_xen_hvm_callback+0x12/0x20 RIP: 0010:_raw_spin_unlock_irqrestore+0xe/0x20 RSP: 0000:ffffc900188ffa28 EFLAGS: 00000246 pci_conf1_read+0xa4/0x100 pci_bus_read_config_byte+0x3f/0x70 __pci_find_next_cap_ttl+0x3b/0xd0 pci_find_capability+0x71/0xa0 pci_pm_init+0x6c/0x1c0 ? pci_allocate_vc_save_buffers+0x6d/0x8b pci_device_add+0xb0/0x1b0 pci_scan_single_device+0xbd/0xf0 pci_scan_slot+0x53/0x120 pci_scan_child_bus_extend+0x3a/0x2a0 acpi_pci_root_create+0x20c/0x26a pci_acpi_scan_root+0x19a/0x1c0 ? negotiate_os_control+0xdd/0x2b9 acpi_pci_root_add.cold+0x59/0x1b0 acpi_bus_attach+0xee/0x210 acpi_bus_attach+0x63/0x210 acpi_bus_attach+0x63/0x210 acpi_bus_scan+0x43/0x90 ? acpi_bus_init+0x15f/0x15f acpi_scan_init+0x10e/0x1b4 acpi_init+0xba/0x11d ? intel_idle_init+0x2a2/0x2a2 do_one_initcall+0x44/0x1c4 do_initcalls+0xc6/0xdf kernel_init_freeable+0x14d/0x198 ? rest_init+0xb4/0xb4 kernel_init+0xa/0x11c ret_from_fork+0x22/0x30
Seen in kernel-5.4.246:
NMI backtrace for cpu 7 CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.4.246 #1 Hardware name: Amazon EC2 m5.2xlarge/, BIOS 1.0 10/16/2017 RIP: 0010:lapic_next_deadline+0x26/0x40 RSP: 0000:ffffc900001ecf98 EFLAGS: 00000002 Call Trace: <NMI> ? nmi_cpu_backtrace+0x4c/0x90 ? nmi_cpu_backtrace_handler+0xd/0x20 ? nmi_handle+0x71/0x160 ? default_do_nmi+0x4e/0x100 ? do_nmi+0x15b/0x190 ? end_repeat_nmi+0x16/0x62 ? lapic_next_deadline+0x26/0x40 ? lapic_next_deadline+0x26/0x40 ? lapic_next_deadline+0x26/0x40 </NMI> <IRQ> clockevents_program_event+0xcf/0x100 tick_handle_periodic+0x48/0x60 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> ? vprintk_emit+0x19b/0x280 ? printk+0x52/0x6e ? pci_hp_add+0x189/0x300 ? acpiphp_register_hotplug_slot+0xe0/0xf0 ? acpiphp_add_context+0x3b1/0x460 ? acpi_ns_walk_namespace+0x108/0x220 ? acpiphp_put_context.part.12+0x30/0x30 ? acpi_walk_namespace+0xc8/0xf0 ? kmem_cache_alloc_trace+0x1f8/0x210 ? acpiphp_enumerate_slots+0x172/0x250 ? acpi_pci_add_bus+0x50/0xd0 ? pci_register_host_bridge+0x22b/0x460 ? pci_create_root_bus+0x87/0xb0 ? acpi_pci_root_create+0x128/0x290 ? pci_acpi_scan_root+0x14a/0x1b0 ? acpi_pci_root_add+0x1ff/0x580 ? acpi_evaluate_integer+0x52/0x90 ? set_debug_rodata+0x2f/0x2f ? acpi_bus_attach+0x144/0x1d0 ? acpi_bus_attach+0x80/0x1d0 ? acpi_bus_attach+0x80/0x1d0 ? acpi_sleep_proc_init+0x24/0x24 ? acpi_sleep_proc_init+0x24/0x24 ? acpi_bus_scan+0x43/0x90 ? acpi_scan_init+0xf1/0x235 ? acpi_sleep_proc_init+0x24/0x24 ? acpi_init+0x2e9/0x349 ? do_one_initcall+0x46/0x200 ? kernel_init_freeable+0x1db/0x286 ? rest_init+0xb0/0xb0 ? kernel_init+0xa/0x110 ? ret_from_fork+0x35/0x40
Seen in 5.10.183:
rcu: INFO: rcu_sched detected stalls on CPUs/tasks: acpiphp: Slot [19] registered rcu: 3-....: (349 ticks this GP) idle=936/1/0x4000000000000000 softirq=94/94 fqs=151 (detected by 0, t=14752 jiffies, g=-1063, q=591) Sending NMI from CPU 0 to CPUs 3: NMI backtrace for cpu 3 CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.10.183 #1 Hardware name: Amazon EC2 m5.2xlarge/, BIOS 1.0 10/16/2017 RIP: 0010:io_serial_in+0x14/0x20 RSP: 0000:ffffbc9640017720 EFLAGS: 00000006 PKRU: 55555554 Call Trace: <NMI> ? nmi_cpu_backtrace+0x57/0xb0 ? nmi_cpu_backtrace_handler+0xd/0x20 ? nmi_handle+0x62/0x130 ? default_do_nmi+0x49/0x100 ? exc_nmi+0x152/0x170 ? end_repeat_nmi+0x16/0x67 ? mem16_serial_in+0x10/0x20 ? io_serial_in+0x14/0x20 ? io_serial_in+0x14/0x20 ? io_serial_in+0x14/0x20 </NMI> serial8250_console_write+0x8e/0x310 ? record_print_text+0xbf/0x150 console_unlock+0x485/0x4c0 vprintk_emit+0xe4/0x240 printk+0x52/0x72 ? pci_hp_add+0x18d/0x300 acpiphp_register_hotplug_slot+0xe4/0x100 acpiphp_add_context+0x3b5/0x460 acpi_ns_walk_namespace+0x10c/0x220 ? acpiphp_put_context.part.13+0x30/0x30 ? acpiphp_put_context.part.13+0x30/0x30 acpi_walk_namespace+0xd4/0xf0 ? kmem_cache_alloc_trace+0x3be/0x400 acpiphp_enumerate_slots+0x175/0x260 acpi_pci_add_bus+0x50/0xe0 pci_register_host_bridge+0x233/0x4d0 ? complete_all+0x2a/0x40 pci_create_root_bus+0x87/0xc0 acpi_pci_root_create+0x128/0x2a0 pci_acpi_scan_root+0x14e/0x1b0 acpi_pci_root_add+0x203/0x580 ? acpi_evaluate_integer+0x52/0x90 ? rdinit_setup+0x2f/0x2f acpi_bus_attach+0x148/0x1d0 acpi_bus_attach+0x84/0x1d0 acpi_bus_attach+0x84/0x1d0 ? acpi_sleep_proc_init+0x24/0x24 ? acpi_sleep_proc_init+0x24/0x24 acpi_bus_scan+0x43/0x90 acpi_scan_init+0xff/0x247 ? acpi_sleep_proc_init+0x24/0x24 acpi_init+0x2e9/0x34d do_one_initcall+0x44/0x1e0 kernel_init_freeable+0x229/0x286 ? rest_init+0xc0/0xc0 kernel_init+0xa/0x120 ret_from_fork+0x22/0x30
Thanks Rishabh
CCing stable kernel list.