Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called when dev->aer_info is NULL. Add a NULL check before proceeding to avoid calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which does not rate limit, given this is fatal.
This prevents a kernel crash triggered by dereferencing a NULL pointer in aer_ratelimit(), ensuring safer handling of PCI devices that lack AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr() which already performs this NULL check.
Cc: stable@vger.kernel.org Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging") Signed-off-by: Breno Leitao leitao@debian.org --- - This problem is still happening in upstream, and unfortunately no action was done in the previous discussion. - Link to previous post: https://lore.kernel.org/r/20250804-aer_crash_2-v1-1-fd06562c18a4@debian.org --- drivers/pci/pcie/aer.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index e286c197d7167..55abc5e17b8b1 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
static int aer_ratelimit(struct pci_dev *dev, unsigned int severity) { + if (!dev->aer_info) + return 1; + switch (severity) { case AER_NONFATAL: return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
--- base-commit: e5f0a698b34ed76002dc5cff3804a61c80233a7a change-id: 20250801-aer_crash_2-b21cc2ef0d00
Best regards, -- Breno Leitao leitao@debian.org
On 9/29/25 2:15 AM, Breno Leitao wrote:
Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called when dev->aer_info is NULL. Add a NULL check before proceeding to avoid calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which does not rate limit, given this is fatal.
This prevents a kernel crash triggered by dereferencing a NULL pointer in aer_ratelimit(), ensuring safer handling of PCI devices that lack AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr() which already performs this NULL check.
Cc: stable@vger.kernel.org Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging") Signed-off-by: Breno Leitao leitao@debian.org
- This problem is still happening in upstream, and unfortunately no action was done in the previous discussion.
- Link to previous post: https://lore.kernel.org/r/20250804-aer_crash_2-v1-1-fd06562c18a4@debian.org
Although we haven't identified the path that triggers this issue, adding this check is harmless.
Reviewed-by: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com
drivers/pci/pcie/aer.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index e286c197d7167..55abc5e17b8b1 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev, static int aer_ratelimit(struct pci_dev *dev, unsigned int severity) {
- if (!dev->aer_info)
return 1;
- switch (severity) { case AER_NONFATAL: return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
base-commit: e5f0a698b34ed76002dc5cff3804a61c80233a7a change-id: 20250801-aer_crash_2-b21cc2ef0d00
Best regards,
Breno Leitao leitao@debian.org
Le 29/09/2025 à 11:15, Breno Leitao a écrit :
Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called when dev->aer_info is NULL. Add a NULL check before proceeding to avoid calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which does not rate limit, given this is fatal.
This prevents a kernel crash triggered by dereferencing a NULL pointer in aer_ratelimit(), ensuring safer handling of PCI devices that lack AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr() which already performs this NULL check.
Cc: stable@vger.kernel.org Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging") Signed-off-by: Breno Leitao leitao@debian.org
- This problem is still happening in upstream, and unfortunately no action was done in the previous discussion.
- Link to previous post: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...
drivers/pci/pcie/aer.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index e286c197d7167..55abc5e17b8b1 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev, static int aer_ratelimit(struct pci_dev *dev, unsigned int severity) {
- if (!dev->aer_info)
return 1;
This is a static function, it cannot be called from outside aer.c . Why do you need such a check ?
I a check was to be made it should be in pci_aer_init() and in fact if kmalloc fails then all the probe should be made to fail.
switch (severity) { case AER_NONFATAL: return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
base-commit: e5f0a698b34ed76002dc5cff3804a61c80233a7a change-id: 20250801-aer_crash_2-b21cc2ef0d00
Best regards,
Breno Leitao leitao@debian.org
Hello Christophe,
On Mon, Sep 29, 2025 at 07:01:43PM +0200, Christophe Leroy wrote:
Le 29/09/2025 à 11:15, Breno Leitao a écrit :
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index e286c197d7167..55abc5e17b8b1 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev, static int aer_ratelimit(struct pci_dev *dev, unsigned int severity) {
- if (!dev->aer_info)
return 1;
This is a static function, it cannot be called from outside aer.c . Why do you need such a check ?
I have more than 5 hosts crash in recent kernel due to this issue. This is causing real crashes.
[30745.821301] [ T964809] BUG: kernel NULL pointer dereference, address: 0000000000000264 [30745.835249] [ T964809] #PF: supervisor read access in kernel mode [30745.845541] [ T964809] #PF: error_code(0x0000) - not-present page [30745.855831] [ T964809] PGD 7a26aa067 P4D 7a26aa067 PUD 2a807a067 PMD 0 [30745.867267] [ T964809] Oops: Oops: 0000 [#1] SMP [30745.900469] [ T964809] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE [30745.911999] [ T964809] Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020 [30745.929557] [ T964809] Workqueue: events aer_recover_work_func [30745.939339] [ T964809] RIP: 0010:___ratelimit (lib/ratelimit.c:33) [30745.947896] [ T964809] Code: 89 06 48 8d 45 10 48 89 46 08 48 89 e0 48 89 46 10 48 89 df e8 18 48 0d 00 48 8d 65 f8 5b 5d c3 cc 55 41 57 41 56 41 54 53 50 <4c> 63 77 04 4d 85 f6 0f 9e c0 8b 5f 08 85 db 0f 9e c1 08 c1 80 f9 All code ======== 0: 89 06 mov %eax,(%rsi) 2: 48 8d 45 10 lea 0x10(%rbp),%rax 6: 48 89 46 08 mov %rax,0x8(%rsi) a: 48 89 e0 mov %rsp,%rax d: 48 89 46 10 mov %rax,0x10(%rsi) 11: 48 89 df mov %rbx,%rdi 14: e8 18 48 0d 00 call 0xd4831 19: 48 8d 65 f8 lea -0x8(%rbp),%rsp 1d: 5b pop %rbx 1e: 5d pop %rbp 1f: c3 ret 20: cc int3 21: 55 push %rbp 22: 41 57 push %r15 24: 41 56 push %r14 26: 41 54 push %r12 28: 53 push %rbx 29: 50 push %rax 2a:* 4c 63 77 04 movslq 0x4(%rdi),%r14 <-- trapping instruction 2e: 4d 85 f6 test %r14,%r14 31: 0f 9e c0 setle %al 34: 8b 5f 08 mov 0x8(%rdi),%ebx 37: 85 db test %ebx,%ebx 39: 0f 9e c1 setle %cl 3c: 08 c1 or %al,%cl 3e: 80 .byte 0x80 3f: f9 stc
Code starting with the faulting instruction =========================================== 0: 4c 63 77 04 movslq 0x4(%rdi),%r14 4: 4d 85 f6 test %r14,%r14 7: 0f 9e c0 setle %al a: 8b 5f 08 mov 0x8(%rdi),%ebx d: 85 db test %ebx,%ebx f: 0f 9e c1 setle %cl 12: 08 c1 or %al,%cl 14: 80 .byte 0x80 15: f9 stc [30745.985517] [ T964809] RSP: 0018:ffffc9002a10fc88 EFLAGS: 00010206 [30745.996094] [ T964809] RAX: 0000000000000002 RBX: 0000000000000002 RCX: ffffffff8260c509 [30746.010396] [ T964809] RDX: 0000000000000000 RSI: ffffffff825f37b7 RDI: 0000000000000260 [30746.024710] [ T964809] RBP: 0000000000000000 R08: 8080808080808080 R09: 0000000000000000 [30746.039044] [ T964809] R10: ffffc9002a10fcf0 R11: 8080000000000000 R12: 0000000000000000 [30746.053349] [ T964809] R13: 0000000000000000 R14: ffff8882851f4000 R15: ffffc90000c312e0 [30746.067649] [ T964809] FS: 0000000000000000(0000) GS:ffff8890fa91e000(0000) knlGS:0000000000000000 [30746.083875] [ T964809] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [30746.095394] [ T964809] CR2: 0000000000000264 CR3: 00000008150b6002 CR4: 00000000007726f0 [30746.109695] [ T964809] PKRU: 55555554 [30746.115123] [ T964809] Call Trace: [30746.120028] [ T964809] <TASK> [30746.124253] [ T964809] pci_print_aer (drivers/pci/pcie/aer.c:929) [30746.140335] [ T964809] aer_recover_work_func (drivers/pci/pcie/aer.c:1181) [30746.149070] [ T964809] process_scheduled_works (kernel/workqueue.c:3243 kernel/workqueue.c:3321) [30746.158326] [ T964809] ? worker_thread (kernel/workqueue.c:3355) [30746.166195] [ T964809] worker_thread (./include/linux/list.h:373 kernel/workqueue.c:946 kernel/workqueue.c:3403) [30746.173718] [ T964809] kthread (kernel/kthread.c:466) [30746.180018] [ T964809] ? kick_pool (kernel/workqueue.c:3348) [30746.187183] [ T964809] ? finish_task_switch (./include/linux/perf_event.h:? kernel/sched/core.c:5260) [30746.195911] [ T964809] ? housekeeping_affine (kernel/kthread.c:413) [30746.204502] [ T964809] ret_from_fork (arch/x86/kernel/process.c:154) [30746.211842] [ T964809] ? housekeeping_affine (kernel/kthread.c:413) [30746.220400] [ T964809] ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
This stack is based on commit 038d61fd64227 ("Linux 6.16").
I a check was to be made it should be in pci_aer_init() and in fact if kmalloc fails then all the probe should be made to fail.
Ok, the stack is not going through pci_aer_init(), but, through aer_recover_work_func() -> pci_print_aer().
On Mon, Sep 29, 2025 at 02:15:47AM -0700, Breno Leitao wrote:
Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called when dev->aer_info is NULL. Add a NULL check before proceeding to avoid calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which does not rate limit, given this is fatal.
This prevents a kernel crash triggered by dereferencing a NULL pointer in aer_ratelimit(), ensuring safer handling of PCI devices that lack AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr() which already performs this NULL check.
Cc: stable@vger.kernel.org Fixes: a57f2bfb4a5863 ("PCI/AER: Ratelimit correctable and non-fatal error logging") Signed-off-by: Breno Leitao leitao@debian.org
Thanks, Breno, I applied this to pci/aer for v6.18. I added a little more detail to the commit log because the path where we hit this is a bit obscure. Please take a look and see if it makes sense:
https://git.kernel.org/cgit/linux/kernel/git/pci/pci.git/commit/?id=451f30b9...
- This problem is still happening in upstream, and unfortunately no action was done in the previous discussion.
- Link to previous post: https://lore.kernel.org/r/20250804-aer_crash_2-v1-1-fd06562c18a4@debian.org
drivers/pci/pcie/aer.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index e286c197d7167..55abc5e17b8b1 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -786,6 +786,9 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev, static int aer_ratelimit(struct pci_dev *dev, unsigned int severity) {
- if (!dev->aer_info)
return 1;
- switch (severity) { case AER_NONFATAL: return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
base-commit: e5f0a698b34ed76002dc5cff3804a61c80233a7a change-id: 20250801-aer_crash_2-b21cc2ef0d00
Best regards,
Breno Leitao leitao@debian.org
linux-stable-mirror@lists.linaro.org