There is a vector setup race, which overwrites the interrupt descriptor in the per CPU vector array resulting in a disfunctional device.
CPU0 CPU1 interrupt is raised in APIC IRR but not handled free_irq() per_cpu(vector_irq, CPU1)[vector] = VECTOR_SHUTDOWN;
request_irq() common_interrupt() d = this_cpu_read(vector_irq[vector]);
per_cpu(vector_irq, CPU1)[vector] = desc;
if (d == VECTOR_SHUTDOWN) this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
free_irq() cannot observe the pending vector in the CPU1 APIC as there is no way to query the remote CPUs APIC IRR.
This requires that request_irq() uses the same vector/CPU as the one which was freed, but this also can be triggered by a spurious interrupt.
Interestingly enough this problem managed to be hidden for more than a decade.
Prevent this by reevaluating vector_irq under the vector lock, which is held by the interrupt activation code when vector_irq is updated.
The first patch provides context for subsequent real bugfix patch.
Fixes: 9345005f4eed ("x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs") Cc: stable@vger.kernel.org#6.6.x Cc: gregkh@linuxfoundation.org
v1 -> RESEND - Add upstream commit ID.
Jacob Pan (1): x86/irq: Factor out handler invocation from common_interrupt()
Thomas Gleixner (1): x86/irq: Plug vector setup race
arch/x86/kernel/irq.c | 70 ++++++++++++++++++++++++++++++++++--------- 1 file changed, 56 insertions(+), 14 deletions(-)
From: Jacob Pan jacob.jun.pan@linux.intel.com
commit 6087c7f36ab293a06bc0bcf3857ed4d7eb1f9905 upstream.
Prepare for calling external interrupt handlers directly from the posted MSI demultiplexing loop. Extract the common code from common_interrupt() to avoid code duplication.
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20240423174114.526704-8-jacob.jun.pan@linux.intel.... Signed-off-by: Jinjie Ruan ruanjinjie@huawei.com --- arch/x86/kernel/irq.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 6573678c4bf4..1f066268ec29 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -242,18 +242,10 @@ static __always_inline void handle_irq(struct irq_desc *desc, __handle_irq(desc, regs); }
-/* - * common_interrupt() handles all normal device IRQ's (the special SMP - * cross-CPU interrupts have their own entry points). - */ -DEFINE_IDTENTRY_IRQ(common_interrupt) +static __always_inline void call_irq_handler(int vector, struct pt_regs *regs) { - struct pt_regs *old_regs = set_irq_regs(regs); struct irq_desc *desc;
- /* entry code tells RCU that we're not quiescent. Check it. */ - RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU"); - desc = __this_cpu_read(vector_irq[vector]); if (likely(!IS_ERR_OR_NULL(desc))) { handle_irq(desc, regs); @@ -268,7 +260,20 @@ DEFINE_IDTENTRY_IRQ(common_interrupt) __this_cpu_write(vector_irq[vector], VECTOR_UNUSED); } } +} + +/* + * common_interrupt() handles all normal device IRQ's (the special SMP + * cross-CPU interrupts have their own entry points). + */ +DEFINE_IDTENTRY_IRQ(common_interrupt) +{ + struct pt_regs *old_regs = set_irq_regs(regs); + + /* entry code tells RCU that we're not quiescent. Check it. */ + RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
+ call_irq_handler(vector, regs); set_irq_regs(old_regs); }
From: Thomas Gleixner tglx@linutronix.de
commit ce0b5eedcb753697d43f61dd2e27d68eb5d3150f upstream.
Hogan reported a vector setup race, which overwrites the interrupt descriptor in the per CPU vector array resulting in a disfunctional device.
CPU0 CPU1 interrupt is raised in APIC IRR but not handled free_irq() per_cpu(vector_irq, CPU1)[vector] = VECTOR_SHUTDOWN;
request_irq() common_interrupt() d = this_cpu_read(vector_irq[vector]);
per_cpu(vector_irq, CPU1)[vector] = desc;
if (d == VECTOR_SHUTDOWN) this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
free_irq() cannot observe the pending vector in the CPU1 APIC as there is no way to query the remote CPUs APIC IRR.
This requires that request_irq() uses the same vector/CPU as the one which was freed, but this also can be triggered by a spurious interrupt.
Interestingly enough this problem managed to be hidden for more than a decade.
Prevent this by reevaluating vector_irq under the vector lock, which is held by the interrupt activation code when vector_irq is updated.
To avoid ifdeffery or IS_ENABLED() nonsense, move the [un]lock_vector_lock() declarations out under the CONFIG_IRQ_DOMAIN_HIERARCHY guard as it's only provided when CONFIG_X86_LOCAL_APIC=y.
The current CONFIG_IRQ_DOMAIN_HIERARCHY guard is selected by CONFIG_X86_LOCAL_APIC, but can also be selected by other parts of the Kconfig system, which makes 32-bit UP builds with CONFIG_X86_LOCAL_APIC=n fail.
Can we just get rid of this !APIC nonsense once and forever?
Fixes: 9345005f4eed ("x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs") Cc: stable@vger.kernel.org#6.6.x Cc: gregkh@linuxfoundation.org Reported-by: Hogan Wang hogan.wang@huawei.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Hogan Wang hogan.wang@huawei.com Link: https://lore.kernel.org/all/draft-87ikjhrhhh.ffs@tglx [ Conflicts in arch/x86/kernel/irq.c because call_irq_handler() has been refactored to do apic_eoi() according to the return value. Conflicts in arch/x86/include/asm/hw_irq.h because (un)lock_vector_lock() are already controlled by CONFIG_X86_LOCAL_APIC. ] Signed-off-by: Jinjie Ruan ruanjinjie@huawei.com --- arch/x86/kernel/irq.c | 65 +++++++++++++++++++++++++++++++++---------- 1 file changed, 51 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 1f066268ec29..29d0fc94232e 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -242,24 +242,59 @@ static __always_inline void handle_irq(struct irq_desc *desc, __handle_irq(desc, regs); }
-static __always_inline void call_irq_handler(int vector, struct pt_regs *regs) +static struct irq_desc *reevaluate_vector(int vector) { - struct irq_desc *desc; + struct irq_desc *desc = __this_cpu_read(vector_irq[vector]); + + if (!IS_ERR_OR_NULL(desc)) + return desc; + + if (desc == VECTOR_UNUSED) + pr_emerg_ratelimited("No irq handler for %d.%u\n", smp_processor_id(), vector); + else + __this_cpu_write(vector_irq[vector], VECTOR_UNUSED); + return NULL; +} + +static __always_inline bool call_irq_handler(int vector, struct pt_regs *regs) +{ + struct irq_desc *desc = __this_cpu_read(vector_irq[vector]);
- desc = __this_cpu_read(vector_irq[vector]); if (likely(!IS_ERR_OR_NULL(desc))) { handle_irq(desc, regs); - } else { - apic_eoi(); - - if (desc == VECTOR_UNUSED) { - pr_emerg_ratelimited("%s: %d.%u No irq handler for vector\n", - __func__, smp_processor_id(), - vector); - } else { - __this_cpu_write(vector_irq[vector], VECTOR_UNUSED); - } + return true; } + + /* + * Reevaluate with vector_lock held to prevent a race against + * request_irq() setting up the vector: + * + * CPU0 CPU1 + * interrupt is raised in APIC IRR + * but not handled + * free_irq() + * per_cpu(vector_irq, CPU1)[vector] = VECTOR_SHUTDOWN; + * + * request_irq() common_interrupt() + * d = this_cpu_read(vector_irq[vector]); + * + * per_cpu(vector_irq, CPU1)[vector] = desc; + * + * if (d == VECTOR_SHUTDOWN) + * this_cpu_write(vector_irq[vector], VECTOR_UNUSED); + * + * This requires that the same vector on the same target CPU is + * handed out or that a spurious interrupt hits that CPU/vector. + */ + lock_vector_lock(); + desc = reevaluate_vector(vector); + unlock_vector_lock(); + + if (!desc) + return false; + + handle_irq(desc, regs); + return true; }
/* @@ -273,7 +308,9 @@ DEFINE_IDTENTRY_IRQ(common_interrupt) /* entry code tells RCU that we're not quiescent. Check it. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
- call_irq_handler(vector, regs); + if (unlikely(!call_irq_handler(vector, regs))) + apic_eoi(); + set_irq_regs(old_regs); }
On Fri, Aug 22, 2025 at 03:38:25AM +0000, Jinjie Ruan wrote:
From: Thomas Gleixner tglx@linutronix.de
commit ce0b5eedcb753697d43f61dd2e27d68eb5d3150f upstream.
Hogan reported a vector setup race, which overwrites the interrupt descriptor in the per CPU vector array resulting in a disfunctional device.
CPU0 CPU1 interrupt is raised in APIC IRR but not handled free_irq() per_cpu(vector_irq, CPU1)[vector] = VECTOR_SHUTDOWN;
request_irq() common_interrupt() d = this_cpu_read(vector_irq[vector]);
per_cpu(vector_irq, CPU1)[vector] = desc; if (d == VECTOR_SHUTDOWN) this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
free_irq() cannot observe the pending vector in the CPU1 APIC as there is no way to query the remote CPUs APIC IRR.
This requires that request_irq() uses the same vector/CPU as the one which was freed, but this also can be triggered by a spurious interrupt.
Interestingly enough this problem managed to be hidden for more than a decade.
Prevent this by reevaluating vector_irq under the vector lock, which is held by the interrupt activation code when vector_irq is updated.
To avoid ifdeffery or IS_ENABLED() nonsense, move the [un]lock_vector_lock() declarations out under the CONFIG_IRQ_DOMAIN_HIERARCHY guard as it's only provided when CONFIG_X86_LOCAL_APIC=y.
The current CONFIG_IRQ_DOMAIN_HIERARCHY guard is selected by CONFIG_X86_LOCAL_APIC, but can also be selected by other parts of the Kconfig system, which makes 32-bit UP builds with CONFIG_X86_LOCAL_APIC=n fail.
Can we just get rid of this !APIC nonsense once and forever?
Fixes: 9345005f4eed ("x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs") Cc: stable@vger.kernel.org#6.6.x Cc: gregkh@linuxfoundation.org Reported-by: Hogan Wang hogan.wang@huawei.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Hogan Wang hogan.wang@huawei.com Link: https://lore.kernel.org/all/draft-87ikjhrhhh.ffs@tglx [ Conflicts in arch/x86/kernel/irq.c because call_irq_handler() has been refactored to do apic_eoi() according to the return value. Conflicts in arch/x86/include/asm/hw_irq.h because (un)lock_vector_lock() are already controlled by CONFIG_X86_LOCAL_APIC. ] Signed-off-by: Jinjie Ruan ruanjinjie@huawei.com
arch/x86/kernel/irq.c | 65 +++++++++++++++++++++++++++++++++---------- 1 file changed, 51 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 1f066268ec29..29d0fc94232e 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -242,24 +242,59 @@ static __always_inline void handle_irq(struct irq_desc *desc, __handle_irq(desc, regs); } -static __always_inline void call_irq_handler(int vector, struct pt_regs *regs) +static struct irq_desc *reevaluate_vector(int vector) {
- struct irq_desc *desc;
- struct irq_desc *desc = __this_cpu_read(vector_irq[vector]);
- if (!IS_ERR_OR_NULL(desc))
return desc;
- if (desc == VECTOR_UNUSED)
pr_emerg_ratelimited("No irq handler for %d.%u\n", smp_processor_id(), vector);
- else
__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
- return NULL;
+}
+static __always_inline bool call_irq_handler(int vector, struct pt_regs *regs) +{
- struct irq_desc *desc = __this_cpu_read(vector_irq[vector]);
- desc = __this_cpu_read(vector_irq[vector]); if (likely(!IS_ERR_OR_NULL(desc))) { handle_irq(desc, regs);
- } else {
apic_eoi();
if (desc == VECTOR_UNUSED) {
pr_emerg_ratelimited("%s: %d.%u No irq handler for vector\n",
__func__, smp_processor_id(),
vector);
} else {
__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
}
}return true;
- /*
* Reevaluate with vector_lock held to prevent a race against
* request_irq() setting up the vector:
*
* CPU0 CPU1
* interrupt is raised in APIC IRR
* but not handled
* free_irq()
* per_cpu(vector_irq, CPU1)[vector] = VECTOR_SHUTDOWN;
*
* request_irq() common_interrupt()
* d = this_cpu_read(vector_irq[vector]);
*
* per_cpu(vector_irq, CPU1)[vector] = desc;
*
* if (d == VECTOR_SHUTDOWN)
* this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
*
* This requires that the same vector on the same target CPU is
* handed out or that a spurious interrupt hits that CPU/vector.
*/
- lock_vector_lock();
- desc = reevaluate_vector(vector);
- unlock_vector_lock();
- if (!desc)
return false;
- handle_irq(desc, regs);
- return true;
} /* @@ -273,7 +308,9 @@ DEFINE_IDTENTRY_IRQ(common_interrupt) /* entry code tells RCU that we're not quiescent. Check it. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
- call_irq_handler(vector, regs);
- if (unlikely(!call_irq_handler(vector, regs)))
apic_eoi();
This chunk does not look correct. The original commit did not have this, so why add it here? Where did it come from?
The original patch said: - if (unlikely(call_irq_handler(vector, regs))) + if (unlikely(!call_irq_handler(vector, regs)))
And was not an if statement.
So did you forget to backport something else here? Why is this not identical to what the original was?
thanks,
greg k-h
On 2025/8/22 16:57, Greg KH wrote:
On Fri, Aug 22, 2025 at 03:38:25AM +0000, Jinjie Ruan wrote:
From: Thomas Gleixner tglx@linutronix.de
commit ce0b5eedcb753697d43f61dd2e27d68eb5d3150f upstream.
[...]
/* @@ -273,7 +308,9 @@ DEFINE_IDTENTRY_IRQ(common_interrupt) /* entry code tells RCU that we're not quiescent. Check it. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
- call_irq_handler(vector, regs);
- if (unlikely(!call_irq_handler(vector, regs)))
apic_eoi();
This chunk does not look correct. The original commit did not have this, so why add it here? Where did it come from?
The original patch said:
if (unlikely(call_irq_handler(vector, regs)))
if (unlikely(!call_irq_handler(vector, regs)))
And was not an if statement.
So did you forget to backport something else here? Why is this not identical to what the original was?
The if statement is introduced in commit 1b03d82ba15e ("x86/irq: Install posted MSI notification handler") which is a patch in patch set https://lore.kernel.org/all/20240423174114.526704-1-jacob.jun.pan@linux.inte..., but it seems to be a performance optimization patch set, and this patch includes additional modifications. The context conflict is merely a simple refactoring, but the cost of the entire round of this patch set is too high.
thanks,
greg k-h
On Fri, Aug 22, 2025 at 05:25:56PM +0800, Jinjie Ruan wrote:
On 2025/8/22 16:57, Greg KH wrote:
On Fri, Aug 22, 2025 at 03:38:25AM +0000, Jinjie Ruan wrote:
From: Thomas Gleixner tglx@linutronix.de
commit ce0b5eedcb753697d43f61dd2e27d68eb5d3150f upstream.
[...]
/* @@ -273,7 +308,9 @@ DEFINE_IDTENTRY_IRQ(common_interrupt) /* entry code tells RCU that we're not quiescent. Check it. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
- call_irq_handler(vector, regs);
- if (unlikely(!call_irq_handler(vector, regs)))
apic_eoi();
This chunk does not look correct. The original commit did not have this, so why add it here? Where did it come from?
The original patch said:
if (unlikely(call_irq_handler(vector, regs)))
if (unlikely(!call_irq_handler(vector, regs)))
And was not an if statement.
So did you forget to backport something else here? Why is this not identical to what the original was?
The if statement is introduced in commit 1b03d82ba15e ("x86/irq: Install posted MSI notification handler") which is a patch in patch set https://lore.kernel.org/all/20240423174114.526704-1-jacob.jun.pan@linux.inte..., but it seems to be a performance optimization patch set, and this patch includes additional modifications. The context conflict is merely a simple refactoring, but the cost of the entire round of this patch set is too high.
Why is it "too high"? We almost NEVER want to deviate from what is in mainline, as every time wo do that it adds the potential for bugs AND it increases our maintance burden (i.e. later patches will not apply.)
For a kernel that has to live as long as this one does, we need to try to stick to what is in mainline as close as possible. Otherwise it becomes unworkable.
thanks,
greg k-h
On Fri, Aug 22, 2025 at 11:48:56AM +0200, Greg KH wrote:
On Fri, Aug 22, 2025 at 05:25:56PM +0800, Jinjie Ruan wrote:
On 2025/8/22 16:57, Greg KH wrote:
On Fri, Aug 22, 2025 at 03:38:25AM +0000, Jinjie Ruan wrote:
From: Thomas Gleixner tglx@linutronix.de
commit ce0b5eedcb753697d43f61dd2e27d68eb5d3150f upstream.
[...]
/* @@ -273,7 +308,9 @@ DEFINE_IDTENTRY_IRQ(common_interrupt) /* entry code tells RCU that we're not quiescent. Check it. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
- call_irq_handler(vector, regs);
- if (unlikely(!call_irq_handler(vector, regs)))
apic_eoi();
This chunk does not look correct. The original commit did not have this, so why add it here? Where did it come from?
The original patch said:
if (unlikely(call_irq_handler(vector, regs)))
if (unlikely(!call_irq_handler(vector, regs)))
And was not an if statement.
So did you forget to backport something else here? Why is this not identical to what the original was?
The if statement is introduced in commit 1b03d82ba15e ("x86/irq: Install posted MSI notification handler") which is a patch in patch set https://lore.kernel.org/all/20240423174114.526704-1-jacob.jun.pan@linux.inte..., but it seems to be a performance optimization patch set, and this patch includes additional modifications. The context conflict is merely a simple refactoring, but the cost of the entire round of this patch set is too high.
Why is it "too high"? We almost NEVER want to deviate from what is in mainline, as every time wo do that it adds the potential for bugs AND it increases our maintance burden (i.e. later patches will not apply.)
For a kernel that has to live as long as this one does, we need to try to stick to what is in mainline as close as possible. Otherwise it becomes unworkable.
Also, I will push back, why not just use 6.12.y to resolve this issue instead? Why are you stuck on 6.6.y for now, and what are you going to do with those systems once 6.6.y goes end-of-life? Why postpone the inevitable?
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org