On 8/14/20 2:38 PM, Ashok Raj wrote:
When offlining CPU's, fixup_irqs() migrates all interrupts away from the
CPUs,
outgoing CPU to an online CPU. Its always possible the device sent an
It's
interrupt to the previous CPU destination. Pending interrupt bit in IRR in lapic identifies such interrupts. apic_soft_disable() will not capture any
LAPIC
new interrupts in IRR. This causes interrupts from device to be lost during cpu offline. The issue was found when explicitly setting MSI affinity to a
CPU
CPU and immediately offlining it. It was simple to recreate with a USB ethernet device and doing I/O to it while the CPU is offlined. Lost interrupts happen even when Interrupt Remapping is enabled.
Current code does apic_soft_disable() before migrating interrupts.
native_cpu_disable() { ... apic_soft_disable(); cpu_disable_common(); --> fixup_irqs(); // Too late to capture anything in IRR. }
Just fliping the above call sequence seems to hit the IRR checks
flipping
and the lost interrupt is fixed for both legacy MSI and when interrupt remapping is enabled.
Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead") Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/ Signed-off-by: Ashok Raj ashok.raj@intel.com
To: linux-kernel@vger.kernel.org To: Thomas Gleixner tglx@linutronix.de Cc: Sukumar Ghorai sukumar.ghorai@intel.com Cc: Srikanth Nandamuri srikanth.nandamuri@intel.com Cc: Evan Green evgreen@chromium.org Cc: Mathias Nyman mathias.nyman@linux.intel.com Cc: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org
arch/x86/kernel/smpboot.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index ffbd9a3d78d8..278cc9f92f2f 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1603,13 +1603,20 @@ int native_cpu_disable(void) if (ret) return ret;
- cpu_disable_common(); /*
- Disable the local APIC. Otherwise IPI broadcasts will reach
- it. It still responds normally to INIT, NMI, SMI, and SIPI
* messages.
* messages. Its important to do apic_soft_disable() after
It's
* fixup_irqs(), because fixup_irqs() called from cpu_disable_common()
* depends on IRR being set. After apic_soft_disable() CPU preserves
* currently set IRR/ISR but new interrupts will not set IRR.
* This causes interrupts sent to outgoing cpu before completion
CPU
* of irq migration to be lost. Check SDM Vol 3 "10.4.7.2 Local
IRQ
* APIC State after It Has been Software Disabled" section for more
*/ apic_soft_disable();* details.
- cpu_disable_common();
return 0; }
thanks.