Hi,
On Thu, Mar 27, 2025 at 11:02:04PM +0900, Sergey Senozhatsky wrote:
Hi,
On (25/03/27 15:37), Mika Westerberg wrote:
Another possibility can be tb_cfg_request_sync():
tb_cfg_request_sync() tb_cfg_request() schedule_work(&req->work) -> tb_cfg_request_dequeue() tb_cfg_request_cancel() schedule_work(&req->work) -> tb_cfg_request_dequeue()
Not sure about this one because &req->work will only be scheduled once the second schedule_work() should not queue it again (as far as I can tell).
If the second schedule_work() happens after a timeout, that's what !wait_for_completion_timeout() does, then the first schedule_work() can already execute the work by that time, and then we can schedule the work again (but the request is already dequeued). Am I missing something?
schedule_work() does not schedule the work again if it is already scheduled.
To address the issue, do not dequeue requests that don't have TB_CFG_REQUEST_ACTIVE bit set.
Just to be sure. After this change you have not seen the issue anymore with your testing?
Haven't tried it yet.
We just found it today, it usually takes several weeks before we can roll out the fix to our fleet and we prefer patches from upstream/subsystem git, so that's why we reach out to the upstream.
Makes sense.
The 0xdead000000000122 deference is a LIST_POISON on x86_64, which is set explicitly in list_del(), so I'd say I'm fairly confident that we have a double list_del() in tb_cfg_request_dequeue().
Yes, I agree but since I have not seen any similar reports (sans what I saw ages ago), I would like to be sure the issue you see is actually fixed with the patch (and that there are no unexpected side-effects). ;-)