On 4/27/21 10:52 AM, mwilck@suse.com wrote:
From: Martin Wilck mwilck@suse.com
We have observed a few crashes run_timer_softirq(), where a broken timer_list struct belonging to an anatt_timer was encountered. The broken structures look like this, and we see actually multiple ones attached to the same timer base:
crash> struct timer_list 0xffff92471bcfdc90 struct timer_list { entry = { next = 0xdead000000000122, // LIST_POISON2 pprev = 0x0 }, expires = 4296022933, function = 0xffffffffc06de5e0 <nvme_anatt_timeout>, flags = 20 }
If such a timer is encountered in run_timer_softirq(), the kernel crashes. The test scenario was an I/O load test with lots of NVMe controllers, some of which were removed and re-added on the storage side.
I think this may happen if the rdma recovery_work starts, in this call chain:
nvme_rdma_error_recovery_work() /* this stops all sorts of activity for the controller, but not the multipath-related work queue and timer */ nvme_rdma_reconnect_or_remove(ctrl) => kicks reconnect_work
work queue: reconnect_work
nvme_rdma_reconnect_ctrl_work() nvme_rdma_setup_ctrl() nvme_rdma_configure_admin_queue() nvme_init_identify() nvme_mpath_init() # this sets some fields of the timer_list without taking a lock timer_setup() nvme_read_ana_log() mod_timer() or del_timer_sync()
Similar for TCP. The idea for the patch is based on the observation that nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(), whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but not the anatt timer.
I admit that the root cause analysis isn't rock solid yet. In particular, I can't explain why we see LIST_POISON2 in the "next" pointer, which would indicate that the timer has been detached before; yet we find it linked to the timer base when the crash occurs.
OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and nvme_mpath_stop(), so the hypothesis that modifying active timers may cause the issue isn't totally out of sight. I suspect that the LIST_POISON2 may come to pass in multiple steps.
If anyone has better ideas, please advise. The issue occurs very sporadically; verifying this by testing will be difficult.
Signed-off-by: Martin Wilck mwilck@suse.com Reviewed-by: Sagi Grimberg sagi@grimberg.me Reviewed-by: Chao Leng lengchao@huawei.com Cc: stable@vger.kernel.org
As indicated in my previous mail, please change the description. We have since established a actual reason (duplicate calls to add_timer()), so please list it here.
Cheers,
Hannes