On Tue, May 12, 2020 at 11:52:34AM +0000, Wan, Kaike wrote:
-----Original Message----- From: linux-rdma-owner@vger.kernel.org <linux-rdma- owner@vger.kernel.org> On Behalf Of Leon Romanovsky Sent: Tuesday, May 12, 2020 1:55 AM To: Dalessandro, Dennis dennis.dalessandro@intel.com Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org; Marciniszyn, Mike mike.marciniszyn@intel.com; stable@vger.kernel.org; Wan, Kaike kaike.wan@intel.com Subject: Re: [PATCH for-rc or next 1/3] IB/hfi1: Do not destroy hfi1_wq when the device is shut down
On Mon, May 11, 2020 at 11:13:15PM -0400, Dennis Dalessandro wrote:
From: Kaike Wan kaike.wan@intel.com
The workqueue hfi1_wq is destroyed in function shutdown_device(), which is called by either shutdown_one() or remove_one(). The function shutdown_one() is called when the kernel is rebooted while remove_one() is called when the hfi1 driver is unloaded. When the kernel is rebooted, hfi1_wq is destroyed while all qps are still active, leading to a kernel crash:
I was under impression that kernel reboot should follow same logic as module removal. This is what graceful reboot will do anyway. Can you please give me a link where I can read about difference in those flows?
I used to think the same. However, by adding traces to the hfi driver, I found out that the shutdown function of the pci_driver was called when typing "reboot" while the remove function of the pci_driver was called when typing "modprobe -r hfi1".
I took a look on what mlx5_core is doing in shutdown flow and it can be summarized in the following: 1. Drain workqueues 2. Close PCI 3. Don't release anything.
So maybe you didn't flush the hfi1_wq?
I am not an expert on kernel reboot and can someone give some hints?
Kaike