On Wed, 2021-04-28 at 08:35 +0200, Hannes Reinecke wrote:
On 4/27/21 9:54 PM, Martin Wilck wrote:
On Tue, 2021-04-27 at 20:05 +0200, Hannes Reinecke wrote:
On 4/27/21 6:25 PM, Christoph Hellwig wrote:
On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke wrote:
As indicated in my previous mail, please change the description. We have since established a actual reason (duplicate calls to add_timer()), so please list it here.
So what happens if the offending add_timer is changed to mod_timer?
I guess that should be fine, as the boilerplate said it can act as a safe version of add_timer.
But that would just solve the crash upon add_timer().
The code doesn't use add_timer(), only mod_timer() and del_timer_sync(). And we didn't observe a crash upon add_timer(). What we observed was that a timer had been enqueued multiple times, and the kernel crashes in expire_timers()->detach_timer(), when it encounters an already detached entry in the timer list.
nvme_mpath_init() doesn't use add_timer, but it uses timer_setup().
Yes. The notion that this is wrong was the idea behind my patch.
And calling that on an already pending timer is even worse :-)
And my point is that the anatt timer is not stopped at the end of nvme_init_identify() if any of the calls to
nvme_configure_apst() nvme_configure_timestamp() nvme_configure_directives() nvme_configure_acre()
returns with an error. If they do the controller is reset, causing eg nvme_tcp_configure_admin_queue() to be called, which will be calling timer_setup() with the original timer still running. If the (original) timer triggers _after_ that time we have the crash.
You are right. But afaics the problem doesn't have to originate in these 4 function calls. The same applies even after nvme_init_identify() has returned. Any error that would trigger the error recovery work after the anatt timer has started would have this effect.
Regards, Martin