Hi,
I notice a regression report on Bugzilla [1]. Quoting from it:
I bought a new 4 TB Lexar NM790 and I was using kernel 6.3.13 at the time. It wasn't recognized, with these messages in dmesg:
[ 358.950147] nvme nvme0: pci function 0000:06:00.0 [ 358.958327] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
My other NVMe appears correctly in the nvme list though.
So I tried using other kernels I had installed at the time: 6.3.7, 6.4.10, 6.5.0rc6, 6.5.0, 6.5.1 and none of these recognized the disk. I installed the 6.1.50 lts kernel from arch repositories (I can compile my own too if this would be an issue) and then the device was correctly recognized:
[ 4.654613] nvme 0000:06:00.0: platform quirk: setting simple suspend [ 4.654632] nvme nvme0: pci function 0000:06:00.0 [ 4.667290] nvme nvme0: allocated 40 MiB host memory buffer. [ 4.709473] nvme nvme0: 16/0/0 default/read/poll queues
And then it appears alongside the other nvme: [15:58] [6836] [patola@risadinha patola]% sudo nvme list Node Generic SN Model Namespace Usage Format FW Rev
/dev/nvme1n1 /dev/ng1n1 2K36292CEKD9 XPG GAMMIX S11 Pro 0x1 1.39 TB / 2.05 TB 512 B + 0 B 42B4S9NA /dev/nvme0n1 /dev/ng0n1 NF9755R000057P2202 Lexar SSD NM790 4TB 0x1 4.10 TB / 4.10 TB 512 B + 0 B 12237
And I was able to read and write from it, pvcreate and so on, so it's working. But I can't use a higher kernel version so apparently this is a regression.
There are other people with the same NVMe model (although different capacities) reporting the same issue on this reddit thread: https://www.reddit.com/r/archlinux/comments/15xbxeo/nvme_device_not_ready_ab...
I am not sure but I think this issue might've been introducted after this patch: https://bugzilla.kernel.org/show_bug.cgi?id=215742
See Bugzilla for the full thread and proposed quirk fix.
Anyway, I'm adding this regression to be tracked by regzbot:
#regzbot introduced: v6.1.50..v6.3.13 https://bugzilla.kernel.org/show_bug.cgi?id=217863 #regzbot link: https://www.reddit.com/r/archlinux/comments/15xbxeo/nvme_device_not_ready_ab...
Thanks.
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217863
On 04.09.23 13:07, Bagas Sanjaya wrote:
I notice a regression report on Bugzilla [1]. Quoting from it:
I bought a new 4 TB Lexar NM790 and I was using kernel 6.3.13 at the time. It wasn't recognized, with these messages in dmesg:
[ 358.950147] nvme nvme0: pci function 0000:06:00.0 [ 358.958327] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
My other NVMe appears correctly in the nvme list though.
So I tried using other kernels I had installed at the time: 6.3.7, 6.4.10, 6.5.0rc6, 6.5.0, 6.5.1 and none of these recognized the disk. I installed the 6.1.50 lts kernel from arch repositories (I can compile my own too if this would be an issue) and then the device was correctly recognized:
[ 4.654613] nvme 0000:06:00.0: platform quirk: setting simple suspend [ 4.654632] nvme nvme0: pci function 0000:06:00.0 [ 4.667290] nvme nvme0: allocated 40 MiB host memory buffer. [ 4.709473] nvme nvme0: 16/0/0 default/read/poll queues
FWIW, the quoted mail missed one crucial detail: """ Claudio Sampaio 2023-09-02 19:04:29 UTC
Adding the two lines
│ 3457 { PCI_DEVICE(0x1d97, 0x1602), /* Lexar NM790 */ │ 3458 │ .driver_data = NVME_QUIRK_BOGUS_NID, },
in file drivers/nvme/host/pci.c made my NVMe work correctly. Compiled a new 6.5.1 kernel and everything works. """
@NVME maintainers: is there anything more you need from Claudio at this point?
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.
On Tue, Sep 05, 2023 at 01:37:36PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
On 04.09.23 13:07, Bagas Sanjaya wrote:
I notice a regression report on Bugzilla [1]. Quoting from it:
I bought a new 4 TB Lexar NM790 and I was using kernel 6.3.13 at the time. It wasn't recognized, with these messages in dmesg:
[ 358.950147] nvme nvme0: pci function 0000:06:00.0 [ 358.958327] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
My other NVMe appears correctly in the nvme list though.
So I tried using other kernels I had installed at the time: 6.3.7, 6.4.10, 6.5.0rc6, 6.5.0, 6.5.1 and none of these recognized the disk. I installed the 6.1.50 lts kernel from arch repositories (I can compile my own too if this would be an issue) and then the device was correctly recognized:
[ 4.654613] nvme 0000:06:00.0: platform quirk: setting simple suspend [ 4.654632] nvme nvme0: pci function 0000:06:00.0 [ 4.667290] nvme nvme0: allocated 40 MiB host memory buffer. [ 4.709473] nvme nvme0: 16/0/0 default/read/poll queues
FWIW, the quoted mail missed one crucial detail: """ Claudio Sampaio 2023-09-02 19:04:29 UTC
Adding the two lines
│ 3457 { PCI_DEVICE(0x1d97, 0x1602), /* Lexar NM790 */ │ 3458 │ .driver_data = NVME_QUIRK_BOGUS_NID, },
in file drivers/nvme/host/pci.c made my NVMe work correctly. Compiled a new 6.5.1 kernel and everything works. """
@NVME maintainers: is there anything more you need from Claudio at this point?
Yes: it doesn't really make any sense. The report says the device stopped showing up with message:
nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
That (a) happens long before the mentioned quirk is considered by the driver, and (b) the "quirk" behavior is now the default in 6.5 and several of the listed stable kernels anyway.
It more likely sounds like the device is flaky and either never becomes ready due to some unspecified internal firmware condition, or inaccurately reports how long it actually needs to become ready in worst-case-scenario.
On 05.09.23 16:35, Keith Busch wrote:
On Tue, Sep 05, 2023 at 01:37:36PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
On 04.09.23 13:07, Bagas Sanjaya wrote:
I notice a regression report on Bugzilla [1]. Quoting from it:
I bought a new 4 TB Lexar NM790 and I was using kernel 6.3.13 at the time. It wasn't recognized, with these messages in dmesg:
[ 358.950147] nvme nvme0: pci function 0000:06:00.0 [ 358.958327] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
My other NVMe appears correctly in the nvme list though.
So I tried using other kernels I had installed at the time: 6.3.7, 6.4.10, 6.5.0rc6, 6.5.0, 6.5.1 and none of these recognized the disk. I installed the 6.1.50 lts kernel from arch repositories (I can compile my own too if this would be an issue) and then the device was correctly recognized:
[ 4.654613] nvme 0000:06:00.0: platform quirk: setting simple suspend [ 4.654632] nvme nvme0: pci function 0000:06:00.0 [ 4.667290] nvme nvme0: allocated 40 MiB host memory buffer. [ 4.709473] nvme nvme0: 16/0/0 default/read/poll queues
FWIW, the quoted mail missed one crucial detail: """ Claudio Sampaio 2023-09-02 19:04:29 UTC
Adding the two lines
│ 3457 { PCI_DEVICE(0x1d97, 0x1602), /* Lexar NM790 */ │ 3458 │ .driver_data = NVME_QUIRK_BOGUS_NID, },
in file drivers/nvme/host/pci.c made my NVMe work correctly. Compiled a new 6.5.1 kernel and everything works. """
@NVME maintainers: is there anything more you need from Claudio at this point?
Yes: it doesn't really make any sense. The report says the device stopped showing up with message:
nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
That (a) happens long before the mentioned quirk is considered by the driver, and (b) the "quirk" behavior is now the default in 6.5 and several of the listed stable kernels anyway.
It more likely sounds like the device is flaky and either never becomes ready due to some unspecified internal firmware condition, or inaccurately reports how long it actually needs to become ready in worst-case-scenario.
Thx, I kinda suspected something like that, but I kept my mouth shut, as I feared comments from the cheap seats might be more harmful then helpful.
But what can Claudio do to find the root cause? Check hardware (especially the connectors), update firmware, ...? And if that doesn't lead to anything, bisect the issue?
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.
On Tue, Sep 05, 2023 at 04:49:11PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
But what can Claudio do to find the root cause? Check hardware (especially the connectors), update firmware, ...? And if that doesn't lead to anything, bisect the issue?
Try the current 6.5.1 as-is where the failure was previously seen and verify if this observation is indeed 100% reproducible with the "device not ready" kernel message. Full system power cycle between tests, too.
If this is truly a regression, my only guess is some platform power setting that a newer kernel changed. I am currently suspicious of that right now since 6.5.1 was reported to fail but succeed with a "quirk" that doesn't accomplish anything. I'm more leaning toward my "device is not reliable" theory.
Hi, Keith... Just to give you guys a response concerning this, I'm sorry for the late reply -- too much work. But yes, you are correct, due to having tried patching the kernel in different days and too much stuff going on at the same time, I applied this two-line patch to the same source where I have applied the other patch that multiplies the timeout by 2 and occurs at an earlier time on activation. I thought I had an unpatched kernel at the time and ended up compiling it this way. Sorry for the mistake, but I also saw that now there's a better patch for the issue.
On Tue, Sep 5, 2023 at 4:35 PM Keith Busch kbusch@kernel.org wrote:
On Tue, Sep 05, 2023 at 01:37:36PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
On 04.09.23 13:07, Bagas Sanjaya wrote:
I notice a regression report on Bugzilla [1]. Quoting from it:
I bought a new 4 TB Lexar NM790 and I was using kernel 6.3.13 at the time. It wasn't recognized, with these messages in dmesg:
[ 358.950147] nvme nvme0: pci function 0000:06:00.0 [ 358.958327] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
My other NVMe appears correctly in the nvme list though.
So I tried using other kernels I had installed at the time: 6.3.7, 6.4.10, 6.5.0rc6, 6.5.0, 6.5.1 and none of these recognized the disk. I installed the 6.1.50 lts kernel from arch repositories (I can compile my own too if this would be an issue) and then the device was correctly recognized:
[ 4.654613] nvme 0000:06:00.0: platform quirk: setting simple suspend [ 4.654632] nvme nvme0: pci function 0000:06:00.0 [ 4.667290] nvme nvme0: allocated 40 MiB host memory buffer. [ 4.709473] nvme nvme0: 16/0/0 default/read/poll queues
FWIW, the quoted mail missed one crucial detail: """ Claudio Sampaio 2023-09-02 19:04:29 UTC
Adding the two lines
│ 3457 { PCI_DEVICE(0x1d97, 0x1602), /* Lexar NM790 */ │ 3458 │ .driver_data = NVME_QUIRK_BOGUS_NID, },
in file drivers/nvme/host/pci.c made my NVMe work correctly. Compiled a new 6.5.1 kernel and everything works. """
@NVME maintainers: is there anything more you need from Claudio at this point?
Yes: it doesn't really make any sense. The report says the device stopped showing up with message:
nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
That (a) happens long before the mentioned quirk is considered by the driver, and (b) the "quirk" behavior is now the default in 6.5 and several of the listed stable kernels anyway.
It more likely sounds like the device is flaky and either never becomes ready due to some unspecified internal firmware condition, or inaccurately reports how long it actually needs to become ready in worst-case-scenario.
On 9/4/23 14:07, Bagas Sanjaya wrote:
Hi,
I notice a regression report on Bugzilla [1]. Quoting from it:
I bought a new 4 TB Lexar NM790 and I was using kernel 6.3.13 at the time. It wasn't recognized, with these messages in dmesg:
[ 358.950147] nvme nvme0: pci function 0000:06:00.0 [ 358.958327] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
Hi,
This looks very much the same as the other MAXIO MAP1602 issue mentioned in: https://lore.kernel.org/lkml/7cd693dd-a6d7-4aab-aef0-76a8366ceee6@archlinux.... and Lexar NM790 is indeed also using the same controller.
And as I have mentioned there, 6.1 LTS kernels work without a problem because there are some differences at calculating the resulting timeout value. Latest kernels including the 6.5.x branch makes the ending result zero and breaks all 4 TiB SSDs with this controller as far as I know.
linux-stable-mirror@lists.linaro.org