[REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

List overview All Threads
Download

newer

older

[PATCH v8 01/20] drm/gpusvm:...

[PATCH v6 0/1] kasan: Avoid...

Alexey Klimov

15 Apr 2025 15 Apr '25

6:28 p.m.

#regzbot introduced: v6.12..v6.13

I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:

[drm] amdgpu: 7886M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 SError Interrupt on CPU11, code 0x00000000be000011 -- SError CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : amdgpu_device_rreg+0x60/0xe4 [amdgpu] lr : hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] sp : ffffffc08321b490 x29: ffffffc08321b490 x28: ffffff80b8b80000 x27: ffffff80b8bd0178 x26: ffffff80b8b8fe88 x25: 0000000000000001 x24: ffffff8081647000 x23: ffffffc079d6e000 x22: ffffff80b8bd5000 x21: 000000000007f000 x20: 000000000001fc00 x19: 00000000ffffffff x18: 00000000000015fc x17: 00000000000015fc x16: 00000000000015cf x15: 00000000000015ce x14: 00000000000015d0 x13: 00000000000015d1 x12: 00000000000015d2 x11: 00000000000015d3 x10: 000000000000ec00 x9 : 00000000000015fd x8 : 00000000000015fd x7 : 0000000000001689 x6 : 0000000000555401 x5 : 0000000000000001 x4 : 0000000000100000 x3 : 0000000000100000 x2 : 0000000000000000 x1 : 000000000007f000 x0 : 0000000000000000 Kernel panic - not syncing: Asynchronous SError Interrupt CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 Call trace: show_stack+0x2c/0x84 (C) dump_stack_lvl+0x60/0x80 dump_stack+0x18/0x24 panic+0x148/0x330 add_taint+0x0/0xbc arm64_serror_panic+0x64/0x7c do_serror+0x28/0x68 el1h_64_error_handler+0x30/0x48 el1h_64_error+0x6c/0x70 amdgpu_device_rreg+0x60/0xe4 [amdgpu] (P) hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] gmc_v10_0_hw_init+0xec/0x1fc [amdgpu] amdgpu_device_init+0x19f8/0x2480 [amdgpu] amdgpu_driver_load_kms+0x20/0xb0 [amdgpu] amdgpu_pci_probe+0x1b8/0x5d4 [amdgpu] pci_device_probe+0xbc/0x1a8 really_probe+0xc0/0x39c __driver_probe_device+0x7c/0x14c driver_probe_device+0x3c/0x120 __driver_attach+0xc4/0x200 bus_for_each_dev+0x68/0xb4 driver_attach+0x24/0x30 bus_add_driver+0x110/0x240 driver_register+0x68/0x124 __pci_register_driver+0x44/0x50 amdgpu_init+0x84/0xf94 [amdgpu] do_one_initcall+0x60/0x1e0 do_init_module+0x54/0x200 load_module+0x18f8/0x1e68 init_module_from_file+0x74/0xa0 __arm64_sys_finit_module+0x1e0/0x3f0 invoke_syscall+0x64/0xe4 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xd0 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x19c SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x1000,000000e0,f169a650,9b7ff667 Memory Limit: none ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

(bios version seems to be 45 years old but that is the state of the board when I received it)

Also saw this crash with RX6700. Old radeons like HD5450 and nvidia gt1030 work fine on that board.

A little bit of testing showed that it was introduced between 6.12 and 6.13. Also it seems that changes were taken by some distro kernels already and different iso images I tried failed to boot before I bumped into some iso with kernel 6.8 that worked just fine.

The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or have any suggestions.

In theory I also need to confirm that exactly that change introduced the regression.

Thanks, Alexey

Show replies by date

Fugang Duan

16 Apr 16 Apr

3:12 a.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16日 2:28

...

#regzbot introduced: v6.12..v6.13

I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:

[drm] amdgpu: 7886M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 SError Interrupt on CPU11, code 0x00000000be000011 -- SError CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : amdgpu_device_rreg+0x60/0xe4 [amdgpu] lr : hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] sp : ffffffc08321b490 x29: ffffffc08321b490 x28: ffffff80b8b80000 x27: ffffff80b8bd0178 x26: ffffff80b8b8fe88 x25: 0000000000000001 x24: ffffff8081647000 x23: ffffffc079d6e000 x22: ffffff80b8bd5000 x21: 000000000007f000 x20: 000000000001fc00 x19: 00000000ffffffff x18: 00000000000015fc x17: 00000000000015fc x16: 00000000000015cf x15: 00000000000015ce x14: 00000000000015d0 x13: 00000000000015d1 x12: 00000000000015d2 x11: 00000000000015d3 x10: 000000000000ec00 x9 : 00000000000015fd x8 : 00000000000015fd x7 : 0000000000001689 x6 : 0000000000555401 x5 : 0000000000000001 x4 : 0000000000100000 x3 : 0000000000100000 x2 : 0000000000000000 x1 : 000000000007f000 x0 : 0000000000000000 Kernel panic

not syncing: Asynchronous SError Interrupt

CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 Call trace: show_stack+0x2c/0x84 (C) dump_stack_lvl+0x60/0x80 dump_stack+0x18/0x24 panic+0x148/0x330 add_taint+0x0/0xbc arm64_serror_panic+0x64/0x7c do_serror+0x28/0x68 el1h_64_error_handler+0x30/0x48 el1h_64_error+0x6c/0x70 amdgpu_device_rreg+0x60/0xe4 [amdgpu] (P) hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] gmc_v10_0_hw_init+0xec/0x1fc [amdgpu] amdgpu_device_init+0x19f8/0x2480 [amdgpu] amdgpu_driver_load_kms+0x20/0xb0 [amdgpu] amdgpu_pci_probe+0x1b8/0x5d4 [amdgpu] pci_device_probe+0xbc/0x1a8 really_probe+0xc0/0x39c __driver_probe_device+0x7c/0x14c driver_probe_device+0x3c/0x120 __driver_attach+0xc4/0x200 bus_for_each_dev+0x68/0xb4 driver_attach+0x24/0x30 bus_add_driver+0x110/0x240 driver_register+0x68/0x124 __pci_register_driver+0x44/0x50 amdgpu_init+0x84/0xf94 [amdgpu] do_one_initcall+0x60/0x1e0 do_init_module+0x54/0x200 load_module+0x18f8/0x1e68 init_module_from_file+0x74/0xa0 __arm64_sys_finit_module+0x1e0/0x3f0 invoke_syscall+0x64/0xe4 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xd0 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x19c SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x1000,000000e0,f169a650,9b7ff667 Memory Limit: none ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

(bios version seems to be 45 years old but that is the state of the board when I received it)

Also saw this crash with RX6700. Old radeons like HD5450 and nvidia gt1030 work fine on that board.

A little bit of testing showed that it was introduced between 6.12 and 6.13. Also it seems that changes were taken by some distro kernels already and different iso images I tried failed to boot before I bumped into some iso with kernel 6.8 that worked just fine.

The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or have any suggestions.

In theory I also need to confirm that exactly that change introduced the regression.

Thanks, Alexey

Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b682b546b...

Thanks, Fugang

Alexey Klimov

11:25 a.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:

...

发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16日 2:28

...
#regzbot introduced: v6.12..v6.13

[..]

...

...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or have any suggestions.

Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b682b546b...

Please read my email in the first place. Let me quote just in case:

...

The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

...

Reverting that commit ^^ did help and resolved that problem.

Thanks, Alexey

Alex Deucher

2:49 p.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...

On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:

...
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16日 2:28

...
#regzbot introduced: v6.12..v6.13

[..]

...
...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or have any suggestions.

Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b682b546b...

Please read my email in the first place. Let me quote just in case:

...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

...
Reverting that commit ^^ did help and resolved that problem.

We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch fix it?

Alex

...

Thanks, Alexey

Fugang Duan

17 Apr 17 Apr

12:42 a.m.

New subject: 回复: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49

...

收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:

...
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16

日 2:28

...
...
...
#regzbot introduced: v6.12..v6.13

[..]

...
...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or

have any suggestions.

...
...
...
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b 682b546b255e74a420f

Please read my email in the first place. Let me quote just in case:

...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

...
Reverting that commit ^^ did help and resolved that problem.

We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch fix it?

Alex

4K page size. We can try the fix if we got the environment.

Fugang

Alex Deucher

1:08 p.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:

...

发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49

...
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:

...
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16

日 2:28

...
...
...
#regzbot introduced: v6.12..v6.13

[..]

...
...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or

have any suggestions.

...
...
...
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b 682b546b255e74a420f

Please read my email in the first place. Let me quote just in case:

...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

...
Reverting that commit ^^ did help and resolved that problem.

We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch fix it?

Alex

4K page size. We can try the fix if we got the environment.

OK. that patch won't change anything then. Can you try this patch instead?

Alex

...

Fugang

This email (including its attachments) is intended only for the person or entity to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. Unauthorized use, dissemination, distribution or copying of this email or the information herein or taking any action in reliance on the contents of this email or the information herein, by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is strictly prohibited. If you are not the intended recipient, please do not read, copy, use or disclose any part of this e-mail to others. Please notify the sender immediately and permanently delete this e-mail and any attachments if you received it in error. Internet communications cannot be guaranteed to be timely, secure, error-free or virus-free. The sender does not accept liability for any errors or omissions.

Fugang Duan

18 Apr 18 Apr

12:30 a.m.

New subject: 回复: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月17日 21:08

...

On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:

...
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日

22:49

...
...
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:

...
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4

月16

...
...
日 2:28

...
...
...
#regzbot introduced: v6.12..v6.13

[..]

...
...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or

have any suggestions.

...
...
...
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df 05b 682b546b255e74a420f

Please read my email in the first place. Let me quote just in case:

...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

...
Reverting that commit ^^ did help and resolved that problem.

We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch

fix it?

...
...
Alex

4K page size. We can try the fix if we got the environment.

OK. that patch won't change anything then. Can you try this patch instead?

Alex

Alex, it is very sorry that our team don't have the GPU card in hands. It is better to ask amd gfx team help to try the fixes.

...

...
Fugang

This email (including its attachments) is intended only for the person or entity

to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. Unauthorized use, dissemination, distribution or copying of this email or the information herein or taking any action in reliance on the contents of this email or the information herein, by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is strictly prohibited. If you are not the intended recipient, please do not read, copy, use or disclose any part of this e-mail to others. Please notify the sender immediately and permanently delete this e-mail and any attachments if you received it in error. Internet communications cannot be guaranteed to be timely, secure, error-free or virus-free. The sender does not accept liability for any errors or omissions.

Alex Deucher

1:10 a.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Thu, Apr 17, 2025 at 8:30 PM Fugang Duan fugang.duan@cixtech.com wrote:

...

发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月17日 21:08

...
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:

...
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日

22:49

...
...
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:

...
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4

月16

...
...
日 2:28

...
...
>#regzbot introduced: v6.12..v6.13

[..]

...
>The only change related to hdp_v5_0_flush_hdp() was >cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing >HDP > >Reverting that commit ^^ did help and resolved that problem. >Before sending revert as-is I was interested to know if there >supposed to be a proper fix for this or maybe someone is >interested to debug this or

have any suggestions.

...
...
> Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df 05b 682b546b255e74a420f

Please read my email in the first place. Let me quote just in case:

...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

...
Reverting that commit ^^ did help and resolved that problem.

We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch

fix it?

...
...
Alex

4K page size. We can try the fix if we got the environment.

OK. that patch won't change anything then. Can you try this patch instead?

Alex

Alex, it is very sorry that our team don't have the GPU card in hands. It is better to ask amd gfx team help to try the fixes.

Sorry, we don't have the problematic arm board. This code works as expected on x86.

Alex

...

...
...
Fugang

This email (including its attachments) is intended only for the person or entity

to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. Unauthorized use, dissemination, distribution or copying of this email or the information herein or taking any action in reliance on the contents of this email or the information herein, by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is strictly prohibited. If you are not the intended recipient, please do not read, copy, use or disclose any part of this e-mail to others. Please notify the sender immediately and permanently delete this e-mail and any attachments if you received it in error. Internet communications cannot be guaranteed to be timely, secure, error-free or virus-free. The sender does not accept liability for any errors or omissions.

This email (including its attachments) is intended only for the person or entity to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. Unauthorized use, dissemination, distribution or copying of this email or the information herein or taking any action in reliance on the contents of this email or the information herein, by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is strictly prohibited. If you are not the intended recipient, please do not read, copy, use or disclose any part of this e-mail to others. Please notify the sender immediately and permanently delete this e-mail and any attachments if you received it in error. Internet communications cannot be guaranteed to be timely, secure, error-free or virus-free. The sender does not accept liability for any errors or omissions.

Alexey Klimov

22 Apr 22 Apr

2:20 a.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:

...

On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:

...
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49

...
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:

...
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16

日 2:28

...
...
...
#regzbot introduced: v6.12..v6.13

[..]

...
...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or

have any suggestions.

...
...
...
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b 682b546b255e74a420f

Please read my email in the first place. Let me quote just in case:

...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

...
Reverting that commit ^^ did help and resolved that problem.

We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch fix it?

Alex

4K page size. We can try the fix if we got the environment.

OK. that patch won't change anything then. Can you try this patch instead?

Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.

So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.

So it seems that it is firmware problem after all?

Thanks, Alexey

Alex Deucher

1 p.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov alexey.klimov@linaro.org wrote:

...

On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:

...
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:

...
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49

...
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:

...
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16

日 2:28

...
...
>#regzbot introduced: v6.12..v6.13

[..]

...
>The only change related to hdp_v5_0_flush_hdp() was >cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP > >Reverting that commit ^^ did help and resolved that problem. Before >sending revert as-is I was interested to know if there supposed to >be a proper fix for this or maybe someone is interested to debug this or

have any suggestions.

...
...
> Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b 682b546b255e74a420f

Please read my email in the first place. Let me quote just in case:

...
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

...
Reverting that commit ^^ did help and resolved that problem.

We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch fix it?

Alex

4K page size. We can try the fix if we got the environment.

OK. that patch won't change anything then. Can you try this patch instead?

Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.

So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.

Did you test the first (4k check) or the second (don't remap on ARM) patch?

...

So it seems that it is firmware problem after all?

There is no GPU firmware involved in this operation. It's just a posted write. E.g., we write to a register to flush the HDP write queue and then read the register back to make sure the write posted. If the second patch didn't help, then perhaps there is some issue with MMIO access on your platform?

Alex

...

Thanks, Alexey

Alexey Klimov

3:59 p.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:

...

On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:

...
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:

...
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49

...
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote: > 发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16

日 2:28

...
>>#regzbot introduced: v6.12..v6.13 >>The only change related to hdp_v5_0_flush_hdp() was >>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP >> >>Reverting that commit ^^ did help and resolved that problem. Before

[..]

...

...
...
OK. that patch won't change anything then. Can you try this patch instead?

Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.

So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.

Did you test the first (4k check) or the second (don't remap on ARM) patch?

The second one. I think you mentioned that first one won't help for 4k pages.

...

...
So it seems that it is firmware problem after all?

There is no GPU firmware involved in this operation. It's just a posted write. E.g., we write to a register to flush the HDP write queue and then read the register back to make sure the write posted. If the second patch didn't help, then perhaps there is some issue with MMIO access on your platform?

I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.

Completely out of the blue, based on nothing, do you think that adding delay/some mem barrier between write and read might help? I wonder if host data path code should be executed during common desktop usage as a common user then why it doesn't break later. But yeah, I also think this is this motherboard problem. Thank you.

Thanks, Alexey

Christian König

23 Apr 23 Apr

2:32 p.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On 4/22/25 17:59, Alexey Klimov wrote:

...

On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:

...
On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:

...
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:

...
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49

...
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote: > > On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote: >> 发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16 日 2:28 >>> #regzbot introduced: v6.12..v6.13 >>> The only change related to hdp_v5_0_flush_hdp() was >>> cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP >>> >>> Reverting that commit ^^ did help and resolved that problem. Before

[..]

...
...
...
OK. that patch won't change anything then. Can you try this patch instead?

Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.

So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.

Did you test the first (4k check) or the second (don't remap on ARM) patch?

The second one. I think you mentioned that first one won't help for 4k pages.

...
...
So it seems that it is firmware problem after all?

There is no GPU firmware involved in this operation. It's just a posted write. E.g., we write to a register to flush the HDP write queue and then read the register back to make sure the write posted. If the second patch didn't help, then perhaps there is some issue with MMIO access on your platform?

I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.

Completely out of the blue, based on nothing, do you think that adding delay/some mem barrier between write and read might help?

That would still be quite some platform bug.

...

I wonder if host data path code should be executed during common desktop usage as a common user then why it doesn't break later.

Maybe it's some kind of write/read re-ordering issue.

But yeah, I also think this is this motherboard problem. Thank you.

You should probably ping some ARM guys to figure out what the fault code actually means.

Regards, Christian.

...

Thanks, Alexey

Alex Deucher

24 Apr 24 Apr

3:44 p.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Tue, Apr 22, 2025 at 11:59 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...

On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:

...
On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:

...
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:

...
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49

...
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote: > > On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote: > > 发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16 日 2:28 > >>#regzbot introduced: v6.12..v6.13 > >>The only change related to hdp_v5_0_flush_hdp() was > >>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP > >> > >>Reverting that commit ^^ did help and resolved that problem. Before

[..]

...
...
...
OK. that patch won't change anything then. Can you try this patch instead?

Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.

So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.

Did you test the first (4k check) or the second (don't remap on ARM) patch?

The second one. I think you mentioned that first one won't help for 4k pages.

...
...
So it seems that it is firmware problem after all?

There is no GPU firmware involved in this operation. It's just a posted write. E.g., we write to a register to flush the HDP write queue and then read the register back to make sure the write posted. If the second patch didn't help, then perhaps there is some issue with MMIO access on your platform?

I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.

Completely out of the blue, based on nothing, do you think that adding delay/some mem barrier between write and read might help? I wonder if host data path code should be executed during common desktop usage as a common user then why it doesn't break later. But yeah, I also think this is this motherboard problem. Thank you.

I think I found the problem. The previous patch wasn't doing what I expected. Please try this patch instead.

Thanks,

Alex

...

Thanks, Alexey

Alexey Klimov

27 Apr 27 Apr

1:01 a.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Thu Apr 24, 2025 at 4:44 PM BST, Alex Deucher wrote:

...

On Tue, Apr 22, 2025 at 11:59 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:

...
On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:

...
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:

...
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49 >收件人: Alexey Klimov alexey.klimov@linaro.org >On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote: >> >> On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote: >> > 发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16 >日 2:28 >> >>#regzbot introduced: v6.12..v6.13 >> >>The only change related to hdp_v5_0_flush_hdp() was >> >>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP >> >> >> >>Reverting that commit ^^ did help and resolved that problem. Before

[..]

...
...
...
OK. that patch won't change anything then. Can you try this patch instead?

Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.

So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.

Did you test the first (4k check) or the second (don't remap on ARM) patch?

The second one. I think you mentioned that first one won't help for 4k pages.

...
...
So it seems that it is firmware problem after all?

There is no GPU firmware involved in this operation. It's just a posted write. E.g., we write to a register to flush the HDP write queue and then read the register back to make sure the write posted. If the second patch didn't help, then perhaps there is some issue with MMIO access on your platform?

I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.

Completely out of the blue, based on nothing, do you think that adding delay/some mem barrier between write and read might help? I wonder if host data path code should be executed during common desktop usage as a common user then why it doesn't break later. But yeah, I also think this is this motherboard problem. Thank you.

I think I found the problem. The previous patch wasn't doing what I expected. Please try this patch instead.

This one works!

[ 4.483750] [drm] amdgpu kernel modesetting enabled. [ 4.491985] amdgpu: IO link not available for non x86 platforms [ 4.497189] amdgpu: Virtual CRAT table created for CPU [ 4.497559] amdgpu: Topology: Add CPU node [ 4.509623] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 0 <nv_common> [ 4.512905] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 1 <gmc_v10_0> [ 4.513254] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 2 <navi10_ih> [ 4.513595] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 3 <psp> [ 4.513932] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 4 <smu> [ 4.514278] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 5 <dm> [ 4.514625] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 6 <gfx_v10_0> [ 4.514980] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 7 <sdma_v5_2> [ 4.515334] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 8 <vcn_v3_0> [ 4.515699] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 9 <jpeg_v3_0> [ 4.516087] amdgpu 0000:c3:00.0: amdgpu: Fetched VBIOS from VFCT [ 4.516466] amdgpu: ATOM BIOS: 113-V502MECH-0OC [ 4.749748] amdgpu 0000:c3:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default) [ 4.777435] amdgpu 0000:c3:00.0: BAR 2 [mem 0x1810000000-0x18101fffff 64bit pref]: releasing [ 4.793256] amdgpu 0000:c3:00.0: BAR 0 [mem 0x1800000000-0x180fffffff 64bit pref]: releasing [ 4.844639] amdgpu 0000:c3:00.0: BAR 0 [mem 0x1800000000-0x19ffffffff 64bit pref]: assigned [ 4.849774] amdgpu 0000:c3:00.0: BAR 2 [mem 0x1a00000000-0x1a001fffff 64bit pref]: assigned [ 4.957411] amdgpu 0000:c3:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used) [ 4.967618] amdgpu 0000:c3:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF [ 4.992963] [drm] amdgpu: 8176M of VRAM memory ready [ 5.004032] [drm] amdgpu: 7888M of GTT memory ready. [ 6.224159] amdgpu 0000:c3:00.0: amdgpu: STB initialized to 2048 entries [ 6.284328] amdgpu 0000:c3:00.0: amdgpu: Found VCN firmware Version ENC: 1.33 DEC: 4 VEP: 0 Revision: 3 [ 6.361142] amdgpu 0000:c3:00.0: amdgpu: reserve 0xa00000 from 0x81fd000000 for PSP TMR [ 6.471231] amdgpu 0000:c3:00.0: amdgpu: RAS: optional ras ta ucode is not available [ 6.492967] amdgpu 0000:c3:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available [ 6.492993] amdgpu 0000:c3:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version = 0x003b3100 (59.49.0) [ 6.513659] amdgpu 0000:c3:00.0: amdgpu: SMU driver if version not matched [ 6.513699] amdgpu 0000:c3:00.0: amdgpu: use vbios provided pptable [ 6.588418] amdgpu 0000:c3:00.0: amdgpu: SMU is initialized successfully! [ 6.800975] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 6.806709] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1 [ 6.813516] amdgpu: Virtual CRAT table created for GPU [ 6.819229] amdgpu: Topology: Add dGPU node [0x73ff:0x1002] [ 6.824865] kfd kfd: amdgpu: added device 1002:73ff [ 6.829821] amdgpu 0000:c3:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 8, active_cu_number 28 [ 6.838355] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ 6.846007] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0 [ 6.853658] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0 [ 6.861398] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0 [ 6.869137] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0 [ 6.876877] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0 [ 6.884615] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0 [ 6.892356] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0 [ 6.900094] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0 [ 6.907921] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0 [ 6.915748] amdgpu 0000:c3:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0 [ 6.923663] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0 [ 6.931050] amdgpu 0000:c3:00.0: amdgpu: ring sdma1 uses VM inv eng 14 on hub 0 [ 6.938439] amdgpu 0000:c3:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8 [ 6.946089] amdgpu 0000:c3:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8 [ 6.953916] amdgpu 0000:c3:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8 [ 6.961742] amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8 [ 6.970485] amdgpu 0000:c3:00.0: amdgpu: Using BACO for runtime pm [ 6.977167] [drm] Initialized amdgpu 3.63.0 for 0000:c3:00.0 on minor 0 [ 7.234638] amdgpu 0000:c3:00.0: [drm] fb0: amdgpudrmfb frame buffer device root@orion:~ # uname -a Linux orion 6.15.0-rc3test6+ #1 SMP Sun Apr 27 01:12:10 BST 2025 aarch64 GNU/Linux

Thank you for taking a look into this.

Best regards, Alexey

Alex Deucher

30 Apr 30 Apr

4:55 p.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

I think I have a better solution. Please try these patches instead. Thanks!

For the RX6600, you only need patch 0003. The rest of the series fixes up other chips.

Thanks,

Alex

On Sat, Apr 26, 2025 at 9:01 PM Alexey Klimov alexey.klimov@linaro.org wrote:

...

On Thu Apr 24, 2025 at 4:44 PM BST, Alex Deucher wrote:

...
On Tue, Apr 22, 2025 at 11:59 AM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:

...
On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov alexey.klimov@linaro.org wrote:

...
On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:

...
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote: > > 发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49 > >收件人: Alexey Klimov alexey.klimov@linaro.org > >On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote: > >> > >> On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote: > >> > 发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16 > >日 2:28 > >> >>#regzbot introduced: v6.12..v6.13 > >> >>The only change related to hdp_v5_0_flush_hdp() was > >> >>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP > >> >> > >> >>Reverting that commit ^^ did help and resolved that problem. Before

[..]

...
...
...
OK. that patch won't change anything then. Can you try this patch instead?

Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.

So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.

Did you test the first (4k check) or the second (don't remap on ARM) patch?

The second one. I think you mentioned that first one won't help for 4k pages.

...
...
So it seems that it is firmware problem after all?

There is no GPU firmware involved in this operation. It's just a posted write. E.g., we write to a register to flush the HDP write queue and then read the register back to make sure the write posted. If the second patch didn't help, then perhaps there is some issue with MMIO access on your platform?

I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.

Completely out of the blue, based on nothing, do you think that adding delay/some mem barrier between write and read might help? I wonder if host data path code should be executed during common desktop usage as a common user then why it doesn't break later. But yeah, I also think this is this motherboard problem. Thank you.

I think I found the problem. The previous patch wasn't doing what I expected. Please try this patch instead.

This one works!

[ 4.483750] [drm] amdgpu kernel modesetting enabled. [ 4.491985] amdgpu: IO link not available for non x86 platforms [ 4.497189] amdgpu: Virtual CRAT table created for CPU [ 4.497559] amdgpu: Topology: Add CPU node [ 4.509623] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 0 <nv_common> [ 4.512905] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 1 <gmc_v10_0> [ 4.513254] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 2 <navi10_ih> [ 4.513595] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 3 <psp> [ 4.513932] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 4 <smu> [ 4.514278] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 5 <dm> [ 4.514625] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 6 <gfx_v10_0> [ 4.514980] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 7 <sdma_v5_2> [ 4.515334] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 8 <vcn_v3_0> [ 4.515699] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 9 <jpeg_v3_0> [ 4.516087] amdgpu 0000:c3:00.0: amdgpu: Fetched VBIOS from VFCT [ 4.516466] amdgpu: ATOM BIOS: 113-V502MECH-0OC [ 4.749748] amdgpu 0000:c3:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default) [ 4.777435] amdgpu 0000:c3:00.0: BAR 2 [mem 0x1810000000-0x18101fffff 64bit pref]: releasing [ 4.793256] amdgpu 0000:c3:00.0: BAR 0 [mem 0x1800000000-0x180fffffff 64bit pref]: releasing [ 4.844639] amdgpu 0000:c3:00.0: BAR 0 [mem 0x1800000000-0x19ffffffff 64bit pref]: assigned [ 4.849774] amdgpu 0000:c3:00.0: BAR 2 [mem 0x1a00000000-0x1a001fffff 64bit pref]: assigned [ 4.957411] amdgpu 0000:c3:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used) [ 4.967618] amdgpu 0000:c3:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF [ 4.992963] [drm] amdgpu: 8176M of VRAM memory ready [ 5.004032] [drm] amdgpu: 7888M of GTT memory ready. [ 6.224159] amdgpu 0000:c3:00.0: amdgpu: STB initialized to 2048 entries [ 6.284328] amdgpu 0000:c3:00.0: amdgpu: Found VCN firmware Version ENC: 1.33 DEC: 4 VEP: 0 Revision: 3 [ 6.361142] amdgpu 0000:c3:00.0: amdgpu: reserve 0xa00000 from 0x81fd000000 for PSP TMR [ 6.471231] amdgpu 0000:c3:00.0: amdgpu: RAS: optional ras ta ucode is not available [ 6.492967] amdgpu 0000:c3:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available [ 6.492993] amdgpu 0000:c3:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version = 0x003b3100 (59.49.0) [ 6.513659] amdgpu 0000:c3:00.0: amdgpu: SMU driver if version not matched [ 6.513699] amdgpu 0000:c3:00.0: amdgpu: use vbios provided pptable [ 6.588418] amdgpu 0000:c3:00.0: amdgpu: SMU is initialized successfully! [ 6.800975] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 6.806709] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1 [ 6.813516] amdgpu: Virtual CRAT table created for GPU [ 6.819229] amdgpu: Topology: Add dGPU node [0x73ff:0x1002] [ 6.824865] kfd kfd: amdgpu: added device 1002:73ff [ 6.829821] amdgpu 0000:c3:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 8, active_cu_number 28 [ 6.838355] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ 6.846007] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0 [ 6.853658] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0 [ 6.861398] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0 [ 6.869137] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0 [ 6.876877] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0 [ 6.884615] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0 [ 6.892356] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0 [ 6.900094] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0 [ 6.907921] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0 [ 6.915748] amdgpu 0000:c3:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0 [ 6.923663] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0 [ 6.931050] amdgpu 0000:c3:00.0: amdgpu: ring sdma1 uses VM inv eng 14 on hub 0 [ 6.938439] amdgpu 0000:c3:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8 [ 6.946089] amdgpu 0000:c3:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8 [ 6.953916] amdgpu 0000:c3:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8 [ 6.961742] amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8 [ 6.970485] amdgpu 0000:c3:00.0: amdgpu: Using BACO for runtime pm [ 6.977167] [drm] Initialized amdgpu 3.63.0 for 0000:c3:00.0 on minor 0 [ 7.234638] amdgpu 0000:c3:00.0: [drm] fb0: amdgpudrmfb frame buffer device root@orion:~ # uname -a Linux orion 6.15.0-rc3test6+ #1 SMP Sun Apr 27 01:12:10 BST 2025 aarch64 GNU/Linux

Thank you for taking a look into this.

Best regards, Alexey

Alexey Klimov

11 May 11 May

11:24 p.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Wed, 30 Apr 2025 at 17:55, Alex Deucher alexdeucher@gmail.com wrote:

...

I think I have a better solution. Please try these patches instead. Thanks!

For the RX6600, you only need patch 0003. The rest of the series fixes up other chips.

Sorry for the delay. Finally managed to find some time to test it. It seems that patches are merged in the current -rc tree so I just re-tested -rc5. All works. Thank you.

A bit annoying thing is repeating: [drm] Unknown EDID CEA parser results and I also didn't observe such messages before on -rc2 or -rc3: amdgpu 0000:c3:00.0: amdgpu: [drm] amdgpu: DP AUX transfer fail:4

dmesg is in attachment. But I don't think that these are related to hdp_v5_0_flush_hdp() issue.

Best regards, Alexey

Alex Deucher

12 May 12 May

2:46 p.m.

New subject: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()

On Sun, May 11, 2025 at 7:25 PM Alexey Klimov alexey.klimov@linaro.org wrote:

...

On Wed, 30 Apr 2025 at 17:55, Alex Deucher alexdeucher@gmail.com wrote:

...
I think I have a better solution. Please try these patches instead. Thanks!

For the RX6600, you only need patch 0003. The rest of the series fixes up other chips.

Sorry for the delay. Finally managed to find some time to test it. It seems that patches are merged in the current -rc tree so I just re-tested -rc5. All works. Thank you.

Thanks.

...

A bit annoying thing is repeating: [drm] Unknown EDID CEA parser results and I also didn't observe such messages before on -rc2 or -rc3: amdgpu 0000:c3:00.0: amdgpu: [drm] amdgpu: DP AUX transfer fail:4

dmesg is in attachment. But I don't think that these are related to hdp_v5_0_flush_hdp() issue.

Correct. There was a DP AUX fix that also landed that was a bit too chatty in some cases. There will be a patch to quiet that down.

Alex

...

Best regards, Alexey

Christian König

16 Apr 16 Apr

11:44 a.m.

Am 15.04.25 um 20:28 schrieb Alexey Klimov:

...

#regzbot introduced: v6.12..v6.13

I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:

Well in general we already had tons of problems with low end ARM64 boards. So first question of all is that board SBSA certified?

If not then the chances of that board actually working correctly are very low unfortunately.

...

[drm] amdgpu: 7886M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 SError Interrupt on CPU11, code 0x00000000be000011 -- SError

Any idea what that error code means?

Thanks, Christian.

...

CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : amdgpu_device_rreg+0x60/0xe4 [amdgpu] lr : hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] sp : ffffffc08321b490 x29: ffffffc08321b490 x28: ffffff80b8b80000 x27: ffffff80b8bd0178 x26: ffffff80b8b8fe88 x25: 0000000000000001 x24: ffffff8081647000 x23: ffffffc079d6e000 x22: ffffff80b8bd5000 x21: 000000000007f000 x20: 000000000001fc00 x19: 00000000ffffffff x18: 00000000000015fc x17: 00000000000015fc x16: 00000000000015cf x15: 00000000000015ce x14: 00000000000015d0 x13: 00000000000015d1 x12: 00000000000015d2 x11: 00000000000015d3 x10: 000000000000ec00 x9 : 00000000000015fd x8 : 00000000000015fd x7 : 0000000000001689 x6 : 0000000000555401 x5 : 0000000000000001 x4 : 0000000000100000 x3 : 0000000000100000 x2 : 0000000000000000 x1 : 000000000007f000 x0 : 0000000000000000 Kernel panic - not syncing: Asynchronous SError Interrupt CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 Call trace: show_stack+0x2c/0x84 (C) dump_stack_lvl+0x60/0x80 dump_stack+0x18/0x24 panic+0x148/0x330 add_taint+0x0/0xbc arm64_serror_panic+0x64/0x7c do_serror+0x28/0x68 el1h_64_error_handler+0x30/0x48 el1h_64_error+0x6c/0x70 amdgpu_device_rreg+0x60/0xe4 [amdgpu] (P) hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] gmc_v10_0_hw_init+0xec/0x1fc [amdgpu] amdgpu_device_init+0x19f8/0x2480 [amdgpu] amdgpu_driver_load_kms+0x20/0xb0 [amdgpu] amdgpu_pci_probe+0x1b8/0x5d4 [amdgpu] pci_device_probe+0xbc/0x1a8 really_probe+0xc0/0x39c __driver_probe_device+0x7c/0x14c driver_probe_device+0x3c/0x120 __driver_attach+0xc4/0x200 bus_for_each_dev+0x68/0xb4 driver_attach+0x24/0x30 bus_add_driver+0x110/0x240 driver_register+0x68/0x124 __pci_register_driver+0x44/0x50 amdgpu_init+0x84/0xf94 [amdgpu] do_one_initcall+0x60/0x1e0 do_init_module+0x54/0x200 load_module+0x18f8/0x1e68 init_module_from_file+0x74/0xa0 __arm64_sys_finit_module+0x1e0/0x3f0 invoke_syscall+0x64/0xe4 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xd0 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x19c SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x1000,000000e0,f169a650,9b7ff667 Memory Limit: none ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

(bios version seems to be 45 years old but that is the state of the board when I received it)

Also saw this crash with RX6700. Old radeons like HD5450 and nvidia gt1030 work fine on that board.

A little bit of testing showed that it was introduced between 6.12 and 6.13. Also it seems that changes were taken by some distro kernels already and different iso images I tried failed to boot before I bumped into some iso with kernel 6.8 that worked just fine.

The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or have any suggestions.

In theory I also need to confirm that exactly that change introduced the regression.

Thanks, Alexey

Alexey Klimov

22 Apr 22 Apr

2:49 a.m.

On Wed Apr 16, 2025 at 12:44 PM BST, Christian König wrote:

...

Am 15.04.25 um 20:28 schrieb Alexey Klimov:

...
#regzbot introduced: v6.12..v6.13

I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:

Well in general we already had tons of problems with low end ARM64 boards. So first question of all is that board SBSA certified?

Yeah, I can imagine. I can't find any info about SBSA cartification for that board hence I'd say that state is unknown, hence most likely "no". At least that's what I think. It is a good question for cix or cixtech.com-based emails.

They have some updated potentially unstable UEFI firmwares to test though.

...

If not then the chances of that board actually working correctly are very low unfortunately.

...
[drm] amdgpu: 7886M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 SError Interrupt on CPU11, code 0x00000000be000011 -- SError

Any idea what that error code means?

Well, current thinking process that it means: -- bits 31:26 system error interrupt; -- bit 25 indicates that it was 32-bit instruction; -- 0x11 in lsb is probably implementation-defined which can be anything like bus errors, parity, access violations, etc

That's probably not very helping here.

Best regards, Alexey

Peter Chen

24 Apr 24 Apr

11:41 a.m.

On 25-04-22 03:49:17, Alexey Klimov wrote:

...

EXTERNAL EMAIL

On Wed Apr 16, 2025 at 12:44 PM BST, Christian König wrote:

...
Am 15.04.25 um 20:28 schrieb Alexey Klimov:

...
#regzbot introduced: v6.12..v6.13

I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:

Well in general we already had tons of problems with low end ARM64 boards. So first question of all is that board SBSA certified?

Yeah, I can imagine. I can't find any info about SBSA cartification for that board hence I'd say that state is unknown, hence most likely "no". At least that's what I think. It is a good question for cix or cixtech.com-based emails.

Hi Alexey,

This board has just got Arm SystemReady SR v2.5 certificate, See attachment. Arm is in the process of updating the list, so you may can't find it in website now.

Peter

...

They have some updated potentially unstable UEFI firmwares to test though.

...
If not then the chances of that board actually working correctly are very low unfortunately.

...
[drm] amdgpu: 7886M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 SError Interrupt on CPU11, code 0x00000000be000011 -- SError

Any idea what that error code means?

Well, current thinking process that it means: -- bits 31:26 system error interrupt; -- bit 25 indicates that it was 32-bit instruction; -- 0x11 in lsb is probably implementation-defined which can be anything like bus errors, parity, access violations, etc

That's probably not very helping here.

Best regards, Alexey

-- Best regards, Peter

215

days inactive

242

days old

linux-stable-mirror@lists.linaro.org

19 comments

participants

tags (0)

participants (5)

Alex Deucher
Alexey Klimov
Christian König
Fugang Duan
Peter Chen