From: Martin Wilck mwilck@suse.com
We have observed a few crashes run_timer_softirq(), where a broken timer_list struct belonging to an anatt_timer was encountered. The broken structures look like this, and we see actually multiple ones attached to the same timer base:
crash> struct timer_list 0xffff92471bcfdc90 struct timer_list { entry = { next = 0xdead000000000122, // LIST_POISON2 pprev = 0x0 }, expires = 4296022933, function = 0xffffffffc06de5e0 <nvme_anatt_timeout>, flags = 20 }
If such a timer is encountered in run_timer_softirq(), the kernel crashes. The test scenario was an I/O load test with lots of NVMe controllers, some of which were removed and re-added on the storage side.
I think this may happen if the rdma recovery_work starts, in this call chain:
nvme_rdma_error_recovery_work() /* this stops all sorts of activity for the controller, but not the multipath-related work queue and timer */ nvme_rdma_reconnect_or_remove(ctrl) => kicks reconnect_work
work queue: reconnect_work
nvme_rdma_reconnect_ctrl_work() nvme_rdma_setup_ctrl() nvme_rdma_configure_admin_queue() nvme_init_identify() nvme_mpath_init() # this sets some fields of the timer_list without taking a lock timer_setup() nvme_read_ana_log() mod_timer() or del_timer_sync()
Similar for TCP. The idea for the patch is based on the observation that nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(), whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but not the anatt timer.
I admit that the root cause analysis isn't rock solid yet. In particular, I can't explain why we see LIST_POISON2 in the "next" pointer, which would indicate that the timer has been detached before; yet we find it linked to the timer base when the crash occurs.
OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and nvme_mpath_stop(), so the hypothesis that modifying active timers may cause the issue isn't totally out of sight. I suspect that the LIST_POISON2 may come to pass in multiple steps.
If anyone has better ideas, please advise. The issue occurs very sporadically; verifying this by testing will be difficult.
Signed-off-by: Martin Wilck mwilck@suse.com Reviewed-by: Sagi Grimberg sagi@grimberg.me Reviewed-by: Chao Leng lengchao@huawei.com Cc: stable@vger.kernel.org
---- Changes in v3: Changed the subject line, as suggested by Sagi Grimberg
Changes in v2: Moved call to nvme_mpath_stop() further down, directly before the call of nvme_rdma_reconnect_or_remove() (Chao Leng) --- drivers/nvme/host/multipath.c | 1 + drivers/nvme/host/rdma.c | 1 + drivers/nvme/host/tcp.c | 1 + 3 files changed, 3 insertions(+)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index a1d476e1ac02..c63dd5dfa7ff 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -586,6 +586,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl) del_timer_sync(&ctrl->anatt_timer); cancel_work_sync(&ctrl->ana_work); } +EXPORT_SYMBOL_GPL(nvme_mpath_stop);
#define SUBSYS_ATTR_RW(_name, _mode, _show, _store) \ struct device_attribute subsys_attr_##_name = \ diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index be905d4fdb47..fc07a7b0dc1d 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1202,6 +1202,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work) return; }
+ nvme_mpath_stop(&ctrl->ctrl); nvme_rdma_reconnect_or_remove(ctrl); }
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index a0f00cb8f9f3..46287b4f4d10 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2068,6 +2068,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work) return; }
+ nvme_mpath_stop(ctrl); nvme_tcp_reconnect_or_remove(ctrl); }
On 4/27/21 10:52 AM, mwilck@suse.com wrote:
From: Martin Wilck mwilck@suse.com
We have observed a few crashes run_timer_softirq(), where a broken timer_list struct belonging to an anatt_timer was encountered. The broken structures look like this, and we see actually multiple ones attached to the same timer base:
crash> struct timer_list 0xffff92471bcfdc90 struct timer_list { entry = { next = 0xdead000000000122, // LIST_POISON2 pprev = 0x0 }, expires = 4296022933, function = 0xffffffffc06de5e0 <nvme_anatt_timeout>, flags = 20 }
If such a timer is encountered in run_timer_softirq(), the kernel crashes. The test scenario was an I/O load test with lots of NVMe controllers, some of which were removed and re-added on the storage side.
I think this may happen if the rdma recovery_work starts, in this call chain:
nvme_rdma_error_recovery_work() /* this stops all sorts of activity for the controller, but not the multipath-related work queue and timer */ nvme_rdma_reconnect_or_remove(ctrl) => kicks reconnect_work
work queue: reconnect_work
nvme_rdma_reconnect_ctrl_work() nvme_rdma_setup_ctrl() nvme_rdma_configure_admin_queue() nvme_init_identify() nvme_mpath_init() # this sets some fields of the timer_list without taking a lock timer_setup() nvme_read_ana_log() mod_timer() or del_timer_sync()
Similar for TCP. The idea for the patch is based on the observation that nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(), whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but not the anatt timer.
I admit that the root cause analysis isn't rock solid yet. In particular, I can't explain why we see LIST_POISON2 in the "next" pointer, which would indicate that the timer has been detached before; yet we find it linked to the timer base when the crash occurs.
OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and nvme_mpath_stop(), so the hypothesis that modifying active timers may cause the issue isn't totally out of sight. I suspect that the LIST_POISON2 may come to pass in multiple steps.
If anyone has better ideas, please advise. The issue occurs very sporadically; verifying this by testing will be difficult.
Signed-off-by: Martin Wilck mwilck@suse.com Reviewed-by: Sagi Grimberg sagi@grimberg.me Reviewed-by: Chao Leng lengchao@huawei.com Cc: stable@vger.kernel.org
As indicated in my previous mail, please change the description. We have since established a actual reason (duplicate calls to add_timer()), so please list it here.
Cheers,
Hannes
On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke wrote:
As indicated in my previous mail, please change the description. We have since established a actual reason (duplicate calls to add_timer()), so please list it here.
So what happens if the offending add_timer is changed to mod_timer?
On 4/27/21 6:25 PM, Christoph Hellwig wrote:
On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke wrote:
As indicated in my previous mail, please change the description. We have since established a actual reason (duplicate calls to add_timer()), so please list it here.
So what happens if the offending add_timer is changed to mod_timer?
I guess that should be fine, as the boilerplate said it can act as a safe version of add_timer.
But that would just solve the crash upon add_timer(). We still have the problem that the anatt timer might trigger just _after_ eg nvme_tcp_teardown_admin_queue(), causing it to hit an invalid admin queue.
Cheers,
Hannes
On Tue, Apr 27, 2021 at 08:05:41PM +0200, Hannes Reinecke wrote:
On 4/27/21 6:25 PM, Christoph Hellwig wrote:
On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke wrote:
As indicated in my previous mail, please change the description. We have since established a actual reason (duplicate calls to add_timer()), so please list it here.
So what happens if the offending add_timer is changed to mod_timer?
I guess that should be fine, as the boilerplate said it can act as a safe version of add_timer.
But that would just solve the crash upon add_timer(). We still have the problem that the anatt timer might trigger just _after_ eg nvme_tcp_teardown_admin_queue(), causing it to hit an invalid admin queue.
Yeah. nvme_mpath_init isn't really suitable for being called on an already initialized controller for a few reasons. I have an idea how to fix it, let me give it a spin.
On Tue, 2021-04-27 at 20:05 +0200, Hannes Reinecke wrote:
On 4/27/21 6:25 PM, Christoph Hellwig wrote:
On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke wrote:
As indicated in my previous mail, please change the description. We have since established a actual reason (duplicate calls to add_timer()), so please list it here.
So what happens if the offending add_timer is changed to mod_timer?
I guess that should be fine, as the boilerplate said it can act as a safe version of add_timer.
But that would just solve the crash upon add_timer().
The code doesn't use add_timer(), only mod_timer() and del_timer_sync(). And we didn't observe a crash upon add_timer(). What we observed was that a timer had been enqueued multiple times, and the kernel crashes in expire_timers()->detach_timer(), when it encounters an already detached entry in the timer list.
Regards, Martin
On 4/27/21 9:54 PM, Martin Wilck wrote:
On Tue, 2021-04-27 at 20:05 +0200, Hannes Reinecke wrote:
On 4/27/21 6:25 PM, Christoph Hellwig wrote:
On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke wrote:
As indicated in my previous mail, please change the description. We have since established a actual reason (duplicate calls to add_timer()), so please list it here.
So what happens if the offending add_timer is changed to mod_timer?
I guess that should be fine, as the boilerplate said it can act as a safe version of add_timer.
But that would just solve the crash upon add_timer().
The code doesn't use add_timer(), only mod_timer() and del_timer_sync(). And we didn't observe a crash upon add_timer(). What we observed was that a timer had been enqueued multiple times, and the kernel crashes in expire_timers()->detach_timer(), when it encounters an already detached entry in the timer list.
nvme_mpath_init() doesn't use add_timer, but it uses timer_setup(). And calling that on an already pending timer is even worse :-)
And my point is that the anatt timer is not stopped at the end of nvme_init_identify() if any of the calls to
nvme_configure_apst() nvme_configure_timestamp() nvme_configure_directives() nvme_configure_acre()
returns with an error. If they do the controller is reset, causing eg nvme_tcp_configure_admin_queue() to be called, which will be calling timer_setup() with the original timer still running. If the (original) timer triggers _after_ that time we have the crash.
Cheers,
Hannes
On Wed, 2021-04-28 at 08:35 +0200, Hannes Reinecke wrote:
On 4/27/21 9:54 PM, Martin Wilck wrote:
On Tue, 2021-04-27 at 20:05 +0200, Hannes Reinecke wrote:
On 4/27/21 6:25 PM, Christoph Hellwig wrote:
On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke wrote:
As indicated in my previous mail, please change the description. We have since established a actual reason (duplicate calls to add_timer()), so please list it here.
So what happens if the offending add_timer is changed to mod_timer?
I guess that should be fine, as the boilerplate said it can act as a safe version of add_timer.
But that would just solve the crash upon add_timer().
The code doesn't use add_timer(), only mod_timer() and del_timer_sync(). And we didn't observe a crash upon add_timer(). What we observed was that a timer had been enqueued multiple times, and the kernel crashes in expire_timers()->detach_timer(), when it encounters an already detached entry in the timer list.
nvme_mpath_init() doesn't use add_timer, but it uses timer_setup().
Yes. The notion that this is wrong was the idea behind my patch.
And calling that on an already pending timer is even worse :-)
And my point is that the anatt timer is not stopped at the end of nvme_init_identify() if any of the calls to
nvme_configure_apst() nvme_configure_timestamp() nvme_configure_directives() nvme_configure_acre()
returns with an error. If they do the controller is reset, causing eg nvme_tcp_configure_admin_queue() to be called, which will be calling timer_setup() with the original timer still running. If the (original) timer triggers _after_ that time we have the crash.
You are right. But afaics the problem doesn't have to originate in these 4 function calls. The same applies even after nvme_init_identify() has returned. Any error that would trigger the error recovery work after the anatt timer has started would have this effect.
Regards, Martin
út 27. 4. 2021 v 22:02 odesílatel Martin Wilck mwilck@suse.com napsal:
The code doesn't use add_timer(), only mod_timer() and del_timer_sync(). And we didn't observe a crash upon add_timer(). What we observed was that a timer had been enqueued multiple times, and the kernel crashes in expire_timers()->detach_timer(), when it encounters an already detached entry in the timer list.
How can a timer end up enqueued multiple times? It's safe to call mod_timer() against both an active or an inactive timer and mod_timer() is thread-safe also.
IMO the problem is due to the fact that timer_setup() could end up being called against an active, pending timer. timer_setup() doesn't take any lock and modifies the pprev pointer and the timer's flags
Maurizio
On Wed, 2021-04-28 at 08:39 +0200, Maurizio Lombardi wrote:
út 27. 4. 2021 v 22:02 odesílatel Martin Wilck mwilck@suse.com napsal:
The code doesn't use add_timer(), only mod_timer() and del_timer_sync(). And we didn't observe a crash upon add_timer(). What we observed was that a timer had been enqueued multiple times, and the kernel crashes in expire_timers()->detach_timer(), when it encounters an already detached entry in the timer list.
How can a timer end up enqueued multiple times?
I observed in a dump that this can happen. More precisely, I observed the following:
[ 4389.978732] nvme nvme36: Successfully reconnected (1 attempt) ... [ 4441.445655] nvme nvme36: Successfully reconnected (1 attempt) ... [ 4510.891400] nvme nvme36: ANATT timeout, resetting controller. ... [ 4517.035411] general protection fault: 0000 [#1] SMP PTI (kernel crash)
In the crash dump, I could see a anatt_timer belonging to nvme36 being pending in the timer list, with a remaining expiry time of 44s, suggesting it had been created around time 4441.4s. That means that at the time the ANATT timeout was seen (which corresponds to a timer added around 4389.9s), the timer must have been pending twice. (*)
It's safe to call mod_timer() against both an active or an inactive timer and mod_timer() is thread-safe also.
IMO the problem is due to the fact that timer_setup() could end up being called against an active, pending timer. timer_setup() doesn't take any lock and modifies the pprev pointer and the timer's flags
Yes, that's what I think has happened. timer_setup() doesn't clear any pointers in the list of pending timers pointing to this entry. If the newly-initialized timer is then added with mod_timer(), it becomes linked in a second timer list. When the first one expires, the timer will be detached, but only from one of the lists it's pending in. In a scenario like the one we faced, this could actually happen multiple times. If the detached timer remains linked into a timer list, once that list is traversed, the kernel dereferences a pointer with value LIST_POISON2, and crashes.
Anybody: correct me if this is nonsense.
Best, Martin
(*) Note that the crash had been caused by another wrongly linked anatt timer, not this one.
st 28. 4. 2021 v 9:06 odesílatel Martin Wilck mwilck@suse.com napsal:
Yes, that's what I think has happened. timer_setup() doesn't clear any pointers in the list of pending timers pointing to this entry. If the newly-initialized timer is then added with mod_timer(), it becomes linked in a second timer list. When the first one expires, the timer will be detached, but only from one of the lists it's pending in. In a scenario like the one we faced, this could actually happen multiple times. If the detached timer remains linked into a timer list, once that list is traversed, the kernel dereferences a pointer with value LIST_POISON2, and crashes.
Yes I think it makes sense. timer_setup() modifies the timer's base in the "flags" field, then mod_timer() could add the timer to the wrong base structure.
Maurizio
linux-stable-mirror@lists.linaro.org