On 8/20/20 5:42 PM, Ashok Raj wrote:
When offlining CPUs, fixup_irqs() migrates all interrupts away from the outgoing CPU to an online CPU. It's always possible the device sent an interrupt to the previous CPU destination. Pending interrupt bit in IRR in LAPIC identifies such interrupts. apic_soft_disable() will not capture any new interrupts in IRR. This causes interrupts from device to be lost during CPU offline. The issue was found when explicitly setting MSI affinity to a CPU and immediately offlining it. It was simple to recreate with a USB ethernet device and doing I/O to it while the CPU is offlined. Lost interrupts happen even when Interrupt Remapping is enabled.
Current code does apic_soft_disable() before migrating interrupts.
native_cpu_disable() { ... apic_soft_disable(); cpu_disable_common(); --> fixup_irqs(); // Too late to capture anything in IRR. }
Just flipping the above call sequence seems to hit the IRR checks and the lost interrupt is fixed for both legacy MSI and when interrupt remapping is enabled.
Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead") Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/ Reported-by: Evan Green evgreen@chromium.org Tested-by: Mathias Nyman mathias.nyman@linux.intel.com Tested-by: Evan Green evgreen@chromium.org Reviewed-by: Evan Green evgreen@chromium.org Signed-off-by: Ashok Raj ashok.raj@intel.com
v2:
- Typos and fixes suggested by Randy Dunlap
Those all look good now. Thanks for the update.
To: linux-kernel@vger.kernel.org To: Thomas Gleixner tglx@linutronix.de Cc: Sukumar Ghorai sukumar.ghorai@intel.com Cc: Srikanth Nandamuri srikanth.nandamuri@intel.com Cc: Evan Green evgreen@chromium.org Cc: Mathias Nyman mathias.nyman@linux.intel.com Cc: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org
arch/x86/kernel/smpboot.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)