#regzbot introduced: v6.12..v6.13
I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:
[drm] amdgpu: 7886M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 SError Interrupt on CPU11, code 0x00000000be000011 -- SError CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : amdgpu_device_rreg+0x60/0xe4 [amdgpu] lr : hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] sp : ffffffc08321b490 x29: ffffffc08321b490 x28: ffffff80b8b80000 x27: ffffff80b8bd0178 x26: ffffff80b8b8fe88 x25: 0000000000000001 x24: ffffff8081647000 x23: ffffffc079d6e000 x22: ffffff80b8bd5000 x21: 000000000007f000 x20: 000000000001fc00 x19: 00000000ffffffff x18: 00000000000015fc x17: 00000000000015fc x16: 00000000000015cf x15: 00000000000015ce x14: 00000000000015d0 x13: 00000000000015d1 x12: 00000000000015d2 x11: 00000000000015d3 x10: 000000000000ec00 x9 : 00000000000015fd x8 : 00000000000015fd x7 : 0000000000001689 x6 : 0000000000555401 x5 : 0000000000000001 x4 : 0000000000100000 x3 : 0000000000100000 x2 : 0000000000000000 x1 : 000000000007f000 x0 : 0000000000000000 Kernel panic - not syncing: Asynchronous SError Interrupt CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 Call trace: show_stack+0x2c/0x84 (C) dump_stack_lvl+0x60/0x80 dump_stack+0x18/0x24 panic+0x148/0x330 add_taint+0x0/0xbc arm64_serror_panic+0x64/0x7c do_serror+0x28/0x68 el1h_64_error_handler+0x30/0x48 el1h_64_error+0x6c/0x70 amdgpu_device_rreg+0x60/0xe4 [amdgpu] (P) hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] gmc_v10_0_hw_init+0xec/0x1fc [amdgpu] amdgpu_device_init+0x19f8/0x2480 [amdgpu] amdgpu_driver_load_kms+0x20/0xb0 [amdgpu] amdgpu_pci_probe+0x1b8/0x5d4 [amdgpu] pci_device_probe+0xbc/0x1a8 really_probe+0xc0/0x39c __driver_probe_device+0x7c/0x14c driver_probe_device+0x3c/0x120 __driver_attach+0xc4/0x200 bus_for_each_dev+0x68/0xb4 driver_attach+0x24/0x30 bus_add_driver+0x110/0x240 driver_register+0x68/0x124 __pci_register_driver+0x44/0x50 amdgpu_init+0x84/0xf94 [amdgpu] do_one_initcall+0x60/0x1e0 do_init_module+0x54/0x200 load_module+0x18f8/0x1e68 init_module_from_file+0x74/0xa0 __arm64_sys_finit_module+0x1e0/0x3f0 invoke_syscall+0x64/0xe4 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xd0 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x19c SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x1000,000000e0,f169a650,9b7ff667 Memory Limit: none ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
(bios version seems to be 45 years old but that is the state of the board when I received it)
Also saw this crash with RX6700. Old radeons like HD5450 and nvidia gt1030 work fine on that board.
A little bit of testing showed that it was introduced between 6.12 and 6.13. Also it seems that changes were taken by some distro kernels already and different iso images I tried failed to boot before I bumped into some iso with kernel 6.8 that worked just fine.
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or have any suggestions.
In theory I also need to confirm that exactly that change introduced the regression.
Thanks, Alexey
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16日 2:28
#regzbot introduced: v6.12..v6.13
I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:
[drm] amdgpu: 7886M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 SError Interrupt on CPU11, code 0x00000000be000011 -- SError CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : amdgpu_device_rreg+0x60/0xe4 [amdgpu] lr : hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] sp : ffffffc08321b490 x29: ffffffc08321b490 x28: ffffff80b8b80000 x27: ffffff80b8bd0178 x26: ffffff80b8b8fe88 x25: 0000000000000001 x24: ffffff8081647000 x23: ffffffc079d6e000 x22: ffffff80b8bd5000 x21: 000000000007f000 x20: 000000000001fc00 x19: 00000000ffffffff x18: 00000000000015fc x17: 00000000000015fc x16: 00000000000015cf x15: 00000000000015ce x14: 00000000000015d0 x13: 00000000000015d1 x12: 00000000000015d2 x11: 00000000000015d3 x10: 000000000000ec00 x9 : 00000000000015fd x8 : 00000000000015fd x7 : 0000000000001689 x6 : 0000000000555401 x5 : 0000000000000001 x4 : 0000000000100000 x3 : 0000000000100000 x2 : 0000000000000000 x1 : 000000000007f000 x0 : 0000000000000000 Kernel panic
- not syncing: Asynchronous SError Interrupt
CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 Call trace: show_stack+0x2c/0x84 (C) dump_stack_lvl+0x60/0x80 dump_stack+0x18/0x24 panic+0x148/0x330 add_taint+0x0/0xbc arm64_serror_panic+0x64/0x7c do_serror+0x28/0x68 el1h_64_error_handler+0x30/0x48 el1h_64_error+0x6c/0x70 amdgpu_device_rreg+0x60/0xe4 [amdgpu] (P) hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] gmc_v10_0_hw_init+0xec/0x1fc [amdgpu] amdgpu_device_init+0x19f8/0x2480 [amdgpu] amdgpu_driver_load_kms+0x20/0xb0 [amdgpu] amdgpu_pci_probe+0x1b8/0x5d4 [amdgpu] pci_device_probe+0xbc/0x1a8 really_probe+0xc0/0x39c __driver_probe_device+0x7c/0x14c driver_probe_device+0x3c/0x120 __driver_attach+0xc4/0x200 bus_for_each_dev+0x68/0xb4 driver_attach+0x24/0x30 bus_add_driver+0x110/0x240 driver_register+0x68/0x124 __pci_register_driver+0x44/0x50 amdgpu_init+0x84/0xf94 [amdgpu] do_one_initcall+0x60/0x1e0 do_init_module+0x54/0x200 load_module+0x18f8/0x1e68 init_module_from_file+0x74/0xa0 __arm64_sys_finit_module+0x1e0/0x3f0 invoke_syscall+0x64/0xe4 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xd0 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x19c SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x1000,000000e0,f169a650,9b7ff667 Memory Limit: none ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
(bios version seems to be 45 years old but that is the state of the board when I received it)
Also saw this crash with RX6700. Old radeons like HD5450 and nvidia gt1030 work fine on that board.
A little bit of testing showed that it was introduced between 6.12 and 6.13. Also it seems that changes were taken by some distro kernels already and different iso images I tried failed to boot before I bumped into some iso with kernel 6.8 that worked just fine.
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or have any suggestions.
In theory I also need to confirm that exactly that change introduced the regression.
Thanks, Alexey
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b682b546b...
Thanks, Fugang
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16日 2:28
#regzbot introduced: v6.12..v6.13
[..]
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or have any suggestions.
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b682b546b...
Please read my email in the first place. Let me quote just in case:
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem.
Thanks, Alexey
On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16日 2:28
#regzbot introduced: v6.12..v6.13
[..]
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or have any suggestions.
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b682b546b...
Please read my email in the first place. Let me quote just in case:
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem.
We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch fix it?
Alex
Thanks, Alexey
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16
日 2:28
#regzbot introduced: v6.12..v6.13
[..]
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or
have any suggestions.
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b 682b546b255e74a420f
Please read my email in the first place. Let me quote just in case:
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem.
We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch fix it?
Alex
4K page size. We can try the fix if we got the environment.
Fugang
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16
日 2:28
#regzbot introduced: v6.12..v6.13
[..]
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or
have any suggestions.
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b 682b546b255e74a420f
Please read my email in the first place. Let me quote just in case:
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem.
We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch fix it?
Alex
4K page size. We can try the fix if we got the environment.
OK. that patch won't change anything then. Can you try this patch instead?
Alex
Fugang
This email (including its attachments) is intended only for the person or entity to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. Unauthorized use, dissemination, distribution or copying of this email or the information herein or taking any action in reliance on the contents of this email or the information herein, by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is strictly prohibited. If you are not the intended recipient, please do not read, copy, use or disclose any part of this e-mail to others. Please notify the sender immediately and permanently delete this e-mail and any attachments if you received it in error. Internet communications cannot be guaranteed to be timely, secure, error-free or virus-free. The sender does not accept liability for any errors or omissions.
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月17日 21:08
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日
22:49
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4
月16
日 2:28
#regzbot introduced: v6.12..v6.13
[..]
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or
have any suggestions.
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df 05b 682b546b255e74a420f
Please read my email in the first place. Let me quote just in case:
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem.
We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch
fix it?
Alex
4K page size. We can try the fix if we got the environment.
OK. that patch won't change anything then. Can you try this patch instead?
Alex
Alex, it is very sorry that our team don't have the GPU card in hands. It is better to ask amd gfx team help to try the fixes.
Fugang
This email (including its attachments) is intended only for the person or entity
to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. Unauthorized use, dissemination, distribution or copying of this email or the information herein or taking any action in reliance on the contents of this email or the information herein, by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is strictly prohibited. If you are not the intended recipient, please do not read, copy, use or disclose any part of this e-mail to others. Please notify the sender immediately and permanently delete this e-mail and any attachments if you received it in error. Internet communications cannot be guaranteed to be timely, secure, error-free or virus-free. The sender does not accept liability for any errors or omissions.
On Thu, Apr 17, 2025 at 8:30 PM Fugang Duan fugang.duan@cixtech.com wrote:
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月17日 21:08
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日
22:49
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4
月16
日 2:28
>#regzbot introduced: v6.12..v6.13
[..]
>The only change related to hdp_v5_0_flush_hdp() was >cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing >HDP > >Reverting that commit ^^ did help and resolved that problem. >Before sending revert as-is I was interested to know if there >supposed to be a proper fix for this or maybe someone is >interested to debug this or
have any suggestions.
> Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df 05b 682b546b255e74a420f
Please read my email in the first place. Let me quote just in case:
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem.
We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch
fix it?
Alex
4K page size. We can try the fix if we got the environment.
OK. that patch won't change anything then. Can you try this patch instead?
Alex
Alex, it is very sorry that our team don't have the GPU card in hands. It is better to ask amd gfx team help to try the fixes.
Sorry, we don't have the problematic arm board. This code works as expected on x86.
Alex
Fugang
This email (including its attachments) is intended only for the person or entity
to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. Unauthorized use, dissemination, distribution or copying of this email or the information herein or taking any action in reliance on the contents of this email or the information herein, by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is strictly prohibited. If you are not the intended recipient, please do not read, copy, use or disclose any part of this e-mail to others. Please notify the sender immediately and permanently delete this e-mail and any attachments if you received it in error. Internet communications cannot be guaranteed to be timely, secure, error-free or virus-free. The sender does not accept liability for any errors or omissions.
This email (including its attachments) is intended only for the person or entity to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. Unauthorized use, dissemination, distribution or copying of this email or the information herein or taking any action in reliance on the contents of this email or the information herein, by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is strictly prohibited. If you are not the intended recipient, please do not read, copy, use or disclose any part of this e-mail to others. Please notify the sender immediately and permanently delete this e-mail and any attachments if you received it in error. Internet communications cannot be guaranteed to be timely, secure, error-free or virus-free. The sender does not accept liability for any errors or omissions.
On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16
日 2:28
#regzbot introduced: v6.12..v6.13
[..]
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or
have any suggestions.
Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b 682b546b255e74a420f
Please read my email in the first place. Let me quote just in case:
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem.
We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch fix it?
Alex
4K page size. We can try the fix if we got the environment.
OK. that patch won't change anything then. Can you try this patch instead?
Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.
So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.
So it seems that it is firmware problem after all?
Thanks, Alexey
On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov alexey.klimov@linaro.org wrote:
On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16
日 2:28
>#regzbot introduced: v6.12..v6.13
[..]
>The only change related to hdp_v5_0_flush_hdp() was >cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP > >Reverting that commit ^^ did help and resolved that problem. Before >sending revert as-is I was interested to know if there supposed to >be a proper fix for this or maybe someone is interested to debug this or
have any suggestions.
> Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b 682b546b255e74a420f
Please read my email in the first place. Let me quote just in case:
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem.
We can't really revert the change as that will lead to coherency problems. What is the page size on your system? Does the attached patch fix it?
Alex
4K page size. We can try the fix if we got the environment.
OK. that patch won't change anything then. Can you try this patch instead?
Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.
So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.
Did you test the first (4k check) or the second (don't remap on ARM) patch?
So it seems that it is firmware problem after all?
There is no GPU firmware involved in this operation. It's just a posted write. E.g., we write to a register to flush the HDP write queue and then read the register back to make sure the write posted. If the second patch didn't help, then perhaps there is some issue with MMIO access on your platform?
Alex
Thanks, Alexey
On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:
On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov alexey.klimov@linaro.org wrote:
On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote:
On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote: > 发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16
日 2:28
>>#regzbot introduced: v6.12..v6.13 >>The only change related to hdp_v5_0_flush_hdp() was >>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP >> >>Reverting that commit ^^ did help and resolved that problem. Before
[..]
OK. that patch won't change anything then. Can you try this patch instead?
Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.
So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.
Did you test the first (4k check) or the second (don't remap on ARM) patch?
The second one. I think you mentioned that first one won't help for 4k pages.
So it seems that it is firmware problem after all?
There is no GPU firmware involved in this operation. It's just a posted write. E.g., we write to a register to flush the HDP write queue and then read the register back to make sure the write posted. If the second patch didn't help, then perhaps there is some issue with MMIO access on your platform?
I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.
Completely out of the blue, based on nothing, do you think that adding delay/some mem barrier between write and read might help? I wonder if host data path code should be executed during common desktop usage as a common user then why it doesn't break later. But yeah, I also think this is this motherboard problem. Thank you.
Thanks, Alexey
On 4/22/25 17:59, Alexey Klimov wrote:
On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:
On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov alexey.klimov@linaro.org wrote:
On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote: > > On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote: >> 发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16 日 2:28 >>> #regzbot introduced: v6.12..v6.13 >>> The only change related to hdp_v5_0_flush_hdp() was >>> cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP >>> >>> Reverting that commit ^^ did help and resolved that problem. Before
[..]
OK. that patch won't change anything then. Can you try this patch instead?
Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.
So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.
Did you test the first (4k check) or the second (don't remap on ARM) patch?
The second one. I think you mentioned that first one won't help for 4k pages.
So it seems that it is firmware problem after all?
There is no GPU firmware involved in this operation. It's just a posted write. E.g., we write to a register to flush the HDP write queue and then read the register back to make sure the write posted. If the second patch didn't help, then perhaps there is some issue with MMIO access on your platform?
I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.
Completely out of the blue, based on nothing, do you think that adding delay/some mem barrier between write and read might help?
That would still be quite some platform bug.
I wonder if host data path code should be executed during common desktop usage as a common user then why it doesn't break later.
Maybe it's some kind of write/read re-ordering issue.
But yeah, I also think this is this motherboard problem. Thank you.
You should probably ping some ARM guys to figure out what the fault code actually means.
Regards, Christian.
Thanks, Alexey
On Tue, Apr 22, 2025 at 11:59 AM Alexey Klimov alexey.klimov@linaro.org wrote:
On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:
On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov alexey.klimov@linaro.org wrote:
On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:
On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan fugang.duan@cixtech.com wrote:
发件人: Alex Deucher alexdeucher@gmail.com 发送时间: 2025年4月16日 22:49
收件人: Alexey Klimov alexey.klimov@linaro.org On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov alexey.klimov@linaro.org wrote: > > On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote: > > 发件人: Alexey Klimov alexey.klimov@linaro.org 发送时间: 2025年4月16 日 2:28 > >>#regzbot introduced: v6.12..v6.13 > >>The only change related to hdp_v5_0_flush_hdp() was > >>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP > >> > >>Reverting that commit ^^ did help and resolved that problem. Before
[..]
OK. that patch won't change anything then. Can you try this patch instead?
Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.
So I tested that patch, thank you, and some other different configurations -- nothing helped. Exactly the same behaviour with the same backtrace.
Did you test the first (4k check) or the second (don't remap on ARM) patch?
The second one. I think you mentioned that first one won't help for 4k pages.
So it seems that it is firmware problem after all?
There is no GPU firmware involved in this operation. It's just a posted write. E.g., we write to a register to flush the HDP write queue and then read the register back to make sure the write posted. If the second patch didn't help, then perhaps there is some issue with MMIO access on your platform?
I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.
Completely out of the blue, based on nothing, do you think that adding delay/some mem barrier between write and read might help? I wonder if host data path code should be executed during common desktop usage as a common user then why it doesn't break later. But yeah, I also think this is this motherboard problem. Thank you.
I think I found the problem. The previous patch wasn't doing what I expected. Please try this patch instead.
Thanks,
Alex
Thanks, Alexey
Am 15.04.25 um 20:28 schrieb Alexey Klimov:
#regzbot introduced: v6.12..v6.13
I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:
Well in general we already had tons of problems with low end ARM64 boards. So first question of all is that board SBSA certified?
If not then the chances of that board actually working correctly are very low unfortunately.
[drm] amdgpu: 7886M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 SError Interrupt on CPU11, code 0x00000000be000011 -- SError
Any idea what that error code means?
Thanks, Christian.
CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : amdgpu_device_rreg+0x60/0xe4 [amdgpu] lr : hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] sp : ffffffc08321b490 x29: ffffffc08321b490 x28: ffffff80b8b80000 x27: ffffff80b8bd0178 x26: ffffff80b8b8fe88 x25: 0000000000000001 x24: ffffff8081647000 x23: ffffffc079d6e000 x22: ffffff80b8bd5000 x21: 000000000007f000 x20: 000000000001fc00 x19: 00000000ffffffff x18: 00000000000015fc x17: 00000000000015fc x16: 00000000000015cf x15: 00000000000015ce x14: 00000000000015d0 x13: 00000000000015d1 x12: 00000000000015d2 x11: 00000000000015d3 x10: 000000000000ec00 x9 : 00000000000015fd x8 : 00000000000015fd x7 : 0000000000001689 x6 : 0000000000555401 x5 : 0000000000000001 x4 : 0000000000100000 x3 : 0000000000100000 x2 : 0000000000000000 x1 : 000000000007f000 x0 : 0000000000000000 Kernel panic - not syncing: Asynchronous SError Interrupt CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980 Call trace: show_stack+0x2c/0x84 (C) dump_stack_lvl+0x60/0x80 dump_stack+0x18/0x24 panic+0x148/0x330 add_taint+0x0/0xbc arm64_serror_panic+0x64/0x7c do_serror+0x28/0x68 el1h_64_error_handler+0x30/0x48 el1h_64_error+0x6c/0x70 amdgpu_device_rreg+0x60/0xe4 [amdgpu] (P) hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu] gmc_v10_0_hw_init+0xec/0x1fc [amdgpu] amdgpu_device_init+0x19f8/0x2480 [amdgpu] amdgpu_driver_load_kms+0x20/0xb0 [amdgpu] amdgpu_pci_probe+0x1b8/0x5d4 [amdgpu] pci_device_probe+0xbc/0x1a8 really_probe+0xc0/0x39c __driver_probe_device+0x7c/0x14c driver_probe_device+0x3c/0x120 __driver_attach+0xc4/0x200 bus_for_each_dev+0x68/0xb4 driver_attach+0x24/0x30 bus_add_driver+0x110/0x240 driver_register+0x68/0x124 __pci_register_driver+0x44/0x50 amdgpu_init+0x84/0xf94 [amdgpu] do_one_initcall+0x60/0x1e0 do_init_module+0x54/0x200 load_module+0x18f8/0x1e68 init_module_from_file+0x74/0xa0 __arm64_sys_finit_module+0x1e0/0x3f0 invoke_syscall+0x64/0xe4 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xd0 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x19c SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x1000,000000e0,f169a650,9b7ff667 Memory Limit: none ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
(bios version seems to be 45 years old but that is the state of the board when I received it)
Also saw this crash with RX6700. Old radeons like HD5450 and nvidia gt1030 work fine on that board.
A little bit of testing showed that it was introduced between 6.12 and 6.13. Also it seems that changes were taken by some distro kernels already and different iso images I tried failed to boot before I bumped into some iso with kernel 6.8 that worked just fine.
The only change related to hdp_v5_0_flush_hdp() was cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
Reverting that commit ^^ did help and resolved that problem. Before sending revert as-is I was interested to know if there supposed to be a proper fix for this or maybe someone is interested to debug this or have any suggestions.
In theory I also need to confirm that exactly that change introduced the regression.
Thanks, Alexey
On Wed Apr 16, 2025 at 12:44 PM BST, Christian König wrote:
Am 15.04.25 um 20:28 schrieb Alexey Klimov:
#regzbot introduced: v6.12..v6.13
I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:
Well in general we already had tons of problems with low end ARM64 boards. So first question of all is that board SBSA certified?
Yeah, I can imagine. I can't find any info about SBSA cartification for that board hence I'd say that state is unknown, hence most likely "no". At least that's what I think. It is a good question for cix or cixtech.com-based emails.
They have some updated potentially unstable UEFI firmwares to test though.
If not then the chances of that board actually working correctly are very low unfortunately.
[drm] amdgpu: 7886M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 SError Interrupt on CPU11, code 0x00000000be000011 -- SError
Any idea what that error code means?
Well, current thinking process that it means: -- bits 31:26 system error interrupt; -- bit 25 indicates that it was 32-bit instruction; -- 0x11 in lsb is probably implementation-defined which can be anything like bus errors, parity, access violations, etc
That's probably not very helping here.
Best regards, Alexey
On 25-04-22 03:49:17, Alexey Klimov wrote:
EXTERNAL EMAIL
On Wed Apr 16, 2025 at 12:44 PM BST, Christian König wrote:
Am 15.04.25 um 20:28 schrieb Alexey Klimov:
#regzbot introduced: v6.12..v6.13
I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:
Well in general we already had tons of problems with low end ARM64 boards. So first question of all is that board SBSA certified?
Yeah, I can imagine. I can't find any info about SBSA cartification for that board hence I'd say that state is unknown, hence most likely "no". At least that's what I think. It is a good question for cix or cixtech.com-based emails.
Hi Alexey,
This board has just got Arm SystemReady SR v2.5 certificate, See attachment. Arm is in the process of updating the list, so you may can't find it in website now.
Peter
They have some updated potentially unstable UEFI firmwares to test though.
If not then the chances of that board actually working correctly are very low unfortunately.
[drm] amdgpu: 7886M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 SError Interrupt on CPU11, code 0x00000000be000011 -- SError
Any idea what that error code means?
Well, current thinking process that it means: -- bits 31:26 system error interrupt; -- bit 25 indicates that it was 32-bit instruction; -- 0x11 in lsb is probably implementation-defined which can be anything like bus errors, parity, access violations, etc
That's probably not very helping here.
Best regards, Alexey
linux-stable-mirror@lists.linaro.org