Hello Tim,
On Tue, May 28, 2024 at 01:17:51PM -0600, Jens Axboe wrote:
(Adding Damien, he's the ATA guy these days - leaving the below intact)
On 5/28/24 1:15 PM, Thomas Gleixner wrote:
Tim!
On Tue, May 28 2024 at 17:43, Tim Teichmann wrote:
On 24/05/27 07:17pm, Thomas Gleixner wrote: I've just tested the fix you've provided in the previous email. The exact patches are attached to the ticket in the archlinux bugtracker[0].
Thanks! I will write a proper changelog and ship it.
The error regarding CPU scheduling disappeared for both kernel verions[0]. However, the ATA bus error still occurs.
Also, I suppose that the ATA bus error is the same as the previous one, because the only value that changes in the exception message is SAct.
This is the message of the ATA error before the patch:
May 23 23:36:49 archlinux kernel: smpboot: x86: Booting SMP configuration: May 23 23:36:49 archlinux kernel: .... node #0, CPUs: #2 #4 #6 May 23 23:36:49 archlinux kernel: __common_interrupt: 2.55 No irq handler for vector May 23 23:36:49 archlinux kernel: __common_interrupt: 4.55 No irq handler for vector May 23 23:36:49 archlinux kernel: __common_interrupt: 6.55 No irq handler for vector
ATA stuff:
May 23 23:36:59 archlinux kernel: ata2.00: exception Emask 0x10 SAct 0x1fffe000 SErr 0x40d0002 action 0xe frozen
That's probably just the fallout of the above.
It's in reality not related and I saw some other AHCI fallout fly by.
And that's the message after the patch:
[ 4.877584] ata2.00: exception Emask 0x10 SAct 0x80000000 SErr 0x40d0002 action 0xe frozen
The full dmesg outputs are in the attachments.
Cc'ed the AHCI people and left the info around for them.
We recently (kernel v6.9) enabled LPM for all AHCI controllers if: -The AHCI controller reports that it supports LPM, and -The drive reports that it supports LPM (DIPM), and -CONFIG_SATA_MOBILE_LPM_POLICY=3, and -The port is not defined as external in the per port PxCMD register, and -The port is not defined as hotplug capable in the per port PxCMD register.
However, there appears to be some drives (usually cheap ones that we've never heard about) that reports that they support DIPM, but when actually turning it on, they stop working.
Looking at the dmesg, you seem to have two SATA drives:
[ 0.957220] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 0.957984] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 0.958027] ata3.00: ATA-8: TOSHIBA HDWD110, MS2OA8J0, max UDMA/133 [ 0.958069] ata2.00: ATA-11: Apacer AS340 120GB, AP612PE0, max UDMA/133
ata3 (TOSHIBA HDWD110) appears to work correctly.
ata2 (Apacer AS340 120GB) results in command timeouts and "a change in device presence has been detected" being set in PxSERR.DIAG.X.
[ 2.964262] ata2.00: exception Emask 0x10 SAct 0x80 SErr 0x40d0002 action 0xe frozen [ 2.964274] ata2.00: irq_stat 0x00000040, connection status changed [ 2.964279] ata2: SError: { RecovComm PHYRdyChg CommWake 10B8B DevExch } [ 2.964288] ata2.00: failed command: READ FPDMA QUEUED [ 2.964291] ata2.00: cmd 60/08:38:80:ff:f1/00:00:0d:00:00/40 tag 7 ncq dma 4096 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 2.964307] ata2.00: status: { DRDY } [ 2.964318] ata2: hard resetting link
Could you please try the following patch (quirk):
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index c449d60d9bb9..24ebcad65b65 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4199,6 +4199,9 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { ATA_HORKAGE_ZERO_AFTER_TRIM | ATA_HORKAGE_NOLPM },
+ /* Apacer models with LPM issues */ + { "Apacer AS340*", NULL, ATA_HORKAGE_NOLPM }, + /* These specific Samsung models/firmware-revs do not handle LPM well */ { "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM }, { "SAMSUNG SSD PM830 mSATA *", "CXM13D1Q", ATA_HORKAGE_NOLPM },
Kind regards, Niklas