On 05.09.23 16:35, Keith Busch wrote:
On Tue, Sep 05, 2023 at 01:37:36PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
On 04.09.23 13:07, Bagas Sanjaya wrote:
I notice a regression report on Bugzilla [1]. Quoting from it:
I bought a new 4 TB Lexar NM790 and I was using kernel 6.3.13 at the time. It wasn't recognized, with these messages in dmesg:
[ 358.950147] nvme nvme0: pci function 0000:06:00.0 [ 358.958327] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
My other NVMe appears correctly in the nvme list though.
So I tried using other kernels I had installed at the time: 6.3.7, 6.4.10, 6.5.0rc6, 6.5.0, 6.5.1 and none of these recognized the disk. I installed the 6.1.50 lts kernel from arch repositories (I can compile my own too if this would be an issue) and then the device was correctly recognized:
[ 4.654613] nvme 0000:06:00.0: platform quirk: setting simple suspend [ 4.654632] nvme nvme0: pci function 0000:06:00.0 [ 4.667290] nvme nvme0: allocated 40 MiB host memory buffer. [ 4.709473] nvme nvme0: 16/0/0 default/read/poll queues
FWIW, the quoted mail missed one crucial detail: """ Claudio Sampaio 2023-09-02 19:04:29 UTC
Adding the two lines
│ 3457 { PCI_DEVICE(0x1d97, 0x1602), /* Lexar NM790 */ │ 3458 │ .driver_data = NVME_QUIRK_BOGUS_NID, },
in file drivers/nvme/host/pci.c made my NVMe work correctly. Compiled a new 6.5.1 kernel and everything works. """
@NVME maintainers: is there anything more you need from Claudio at this point?
Yes: it doesn't really make any sense. The report says the device stopped showing up with message:
nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
That (a) happens long before the mentioned quirk is considered by the driver, and (b) the "quirk" behavior is now the default in 6.5 and several of the listed stable kernels anyway.
It more likely sounds like the device is flaky and either never becomes ready due to some unspecified internal firmware condition, or inaccurately reports how long it actually needs to become ready in worst-case-scenario.
Thx, I kinda suspected something like that, but I kept my mouth shut, as I feared comments from the cheap seats might be more harmful then helpful.
But what can Claudio do to find the root cause? Check hardware (especially the connectors), update firmware, ...? And if that doesn't lead to anything, bisect the issue?
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.