On Thu, May 08, 2025 at 12:47:18PM +0900, Sergey Senozhatsky wrote:
On (25/03/28 00:03), Sergey Senozhatsky wrote:
Some of our devices crash in tb_cfg_request_dequeue():
general protection fault, probably for non-canonical address 0xdead000000000122
CPU: 6 PID: 91007 Comm: kworker/6:2 Tainted: G U W 6.6.65 RIP: 0010:tb_cfg_request_dequeue+0x2d/0xa0 Call Trace:
<TASK> ? tb_cfg_request_dequeue+0x2d/0xa0 tb_cfg_request_work+0x33/0x80 worker_thread+0x386/0x8f0 kthread+0xed/0x110 ret_from_fork+0x38/0x50 ret_from_fork_asm+0x1b/0x30
The circumstances are unclear, however, the theory is that tb_cfg_request_work() can be scheduled twice for a request: first time via frame.callback from ring_work() and second time from tb_cfg_request(). Both times kworkers will execute tb_cfg_request_dequeue(), which results in double list_del() from the ctl->request_queue (the list poison deference hints at it: 0xdead000000000122).
Do not dequeue requests that don't have TB_CFG_REQUEST_ACTIVE bit set.
Mika, as was discussed in [1] thread we rolled out the fix to our fleet and we don't see the crashes anymore. So it's tested and verified.
Cool, thanks! Applied to thunderbolt.git/fixes.