New subject: [PATCH] scsi: core: Handle devices which return an unusually large VPD page count

6 May 2024

      Hi all,
I am running a dual Xeon machine as my personal virtualization server at home, using 
Proxmox VE, and with their latest update 8.2 which brings kernel 6.8.4-2-pve, I am seeing 
a serious regression which breaks my setup because it does not boot any more. The last 
message I see displayed during boot is: "Timed out for waiting the udev queue being 
empty.", and then it hangs indefinitely.
Previous kernel 6.5.13-5-pve worked fine, with the following caveat: I had similar 
problems initially with earlier kernels too, so from the very beginning with this machine 
using PVE, I had to set grub parameter rootdelay=60. With that, everything was fine, the 
busses settled and RAID controller and root device was found and system booted. With the 
newer 6.8.4 kernel, not any more, although I even tried to increase rootdelay parameter to 
120.
I was able to reproduce and bisect this regression also with mainline kernels (also with 
stable 6.8.8 and 6.9-rc), so I thought it would be a good idea to report it upstream to 
you guys.
This is an older server machine: 2-socket Ivy Bridge Xeon E5-2697 v2 (24C/48T) in an Asus 
Z9PE-D16/2L motherboard (Intel C-602A chipset); BIOS patched to the latest available from 
Asus. All memory slots occupied, so 256 GB RAM in total. It also has Asus ASMB6 iKVM BMC, 
which supplies virtual storage devices (seel below dmesg) to which ISO images can be 
attached via network to boot/install OS from.
Storage config:
I have two single M4 256 GiB SATA SSD drives attached to internal mainboard SATA ports; 
one of them is my root device and PVE installation drive. The other one I use for storing 
ISO images. My main VM storage is attached to a battery backed-up Adaptec 5805 SATA/SAS 
RAID controller (w/ latest FW build 18948) attached to SATA/SAS enclosure of my Supermicro 
server casing, having eight disk drives in total: I have one RAID1 Array, consisting of 
two Samsung 1 TiB SATA SSDs for VM root disk images, and one RAID5 Array, consisting of 6 
Hitachi 1 TiB HDDs which I use for storing VM data disk images. On both arrays, I use a 
LVM thin pool as PVE storage location. When everything boots up, the system is running 
just fine and smoothly with ~15 VMs at the same time (and has for years!). Although this 
is "only" a homelab server, I love it dearly and use it for many private projects VMs, 
among them runing Windows Server VM with MS SQL Server, and Linux server VMs running 
Oracle Database Server (I'm a database guy).
I attach dmesg output of previous working kernel 6.5.13-5-pve, my git bisect log and 
output of lspci -v. The last successful kernel messages I see from the failing kernels 
version is this:
...
[    5.540424] usb-storage 1-1.3.4:1.0: USB Mass Storage device detected
[    5.540670] scsi host10: usb-storage 1-1.3.4:1.0
[    5.947794] scsi 8:0:0:0: CD-ROM            AMI      Virtual CDROM0   1.00 PQ: 0 ANSI: 
0 CCS
[    6.267830] scsi 9:0:0:0: Direct-Access     AMI      Virtual Floppy0  1.00 PQ: 0 ANSI: 
0 CCS
[    6.555845] scsi 10:0:0:0: Direct-Access     AMI      Virtual HDISK0   1.00 PQ: 0 ANSI: 
0 CCS
and then the error message "Timed out for waiting the udev queue being empty." and the 
system hangs. In case of working kernels, the boot process would continue with this:
...
[    5.947794] scsi 8:0:0:0: CD-ROM            AMI      Virtual CDROM0   1.00 PQ: 0 ANSI: 
0 CCS
[    6.267830] scsi 9:0:0:0: Direct-Access     AMI      Virtual Floppy0  1.00 PQ: 0 ANSI: 
0 CCS
[    6.555845] scsi 10:0:0:0: Direct-Access     AMI      Virtual HDISK0   1.00 PQ: 0 ANSI: 
0 CCS
[   32.592054] scsi 0:3:1:0: Enclosure         ADAPTEC  Virtual SGPIO  1 0001 PQ: 0 ANSI: 5
[   61.536097] sd 0:0:0:0: Attached scsi generic sg0 type 0
[   61.536215] sd 0:0:0:0: [sda] 1998565376 512-byte logical blocks: (1.02 TB/953 GiB)
[   61.536236] sd 0:0:1:0: Attached scsi generic sg1 type 0
[   61.536239] sd 0:0:0:0: [sda] Write Protect is off
[   61.536246] sd 0:0:0:0: [sda] Mode Sense: 12 00 10 08
[   61.536283] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO 
and FUA
[   61.536340] scsi 0:1:0:0: Attached scsi generic sg2 type 0
[   61.536383] sd 0:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
[   61.536400] sd 0:0:1:0: [sdb] 9762222080 512-byte logical blocks: (5.00 TB/4.54 TiB)
[   61.536414] sd 0:0:1:0: [sdb] Write Protect is off
[   61.536418] sd 0:0:1:0: [sdb] Mode Sense: 12 00 10 08
[   61.536439] sd 0:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO 
and FUA
[   61.536455] scsi 0:1:1:0: Attached scsi generic sg3 type 0
[   61.536616] scsi 0:1:2:0: Attached scsi generic sg4 type 0
[   61.536750] scsi 0:1:3:0: Attached scsi generic sg5 type 0
[   61.536840] scsi 0:1:4:0: Attached scsi generic sg6 type 0
[   61.536930] scsi 0:1:5:0: Attached scsi generic sg7 type 0
[   61.537027] scsi 0:1:6:0: Attached scsi generic sg8 type 0
[   61.537122] scsi 0:1:7:0: Attached scsi generic sg9 type 0
[   61.537248] sd 0:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
[   61.537274] scsi 0:3:0:0: Attached scsi generic sg10 type 13
[   61.537390] scsi 0:3:1:0: Attached scsi generic sg11 type 13
[   61.537558] scsi 1:0:0:0: Direct-Access     ATA      M4-CT256M4SSD2   0309 PQ: 0 ANSI: 5
[   61.537851] sd 1:0:0:0: Attached scsi generic sg12 type 0
[   61.537919] scsi: waiting for bus probes to complete ...
[   61.537973] sd 1:0:0:0: [sdc] 500118192 512-byte logical blocks: (256 GB/238 GiB)
[   61.537986] sd 1:0:0:0: [sdc] Write Protect is off
[   61.537989] sd 1:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[   61.538002] sd 1:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't 
support DPO or FUA
[   61.538022] sd 1:0:0:0: [sdc] Preferred minimum I/O size 512 bytes
[   61.538924]  sdc: sdc1 sdc2 < sdc5 >
...
so it seems to me the initialiation of the the Adaptec controller is the culprit.
I have tested and reproduced the regression with mainline kernels according to the 
following list (please excuse me if it's too long ;-)
See at the very bottom for first bad commit I found this way. I always built as "make 
olddefconfig" using the 6.5.13-5-pve config as starting point.
-------------------------------------------------------------------
Proxmox Virtual Environmet (PVE) Kernels
========================================
6.5.13-5-pve    WORKS   last working PVE (8.1) kernel; 5.15-pve and 6.2-pve work too
6.8.4-2-pve     NOPE    PVE release 8.2
Mainline Kernels
================
6.9.0-rc6+      NOPE    Most recent (2024-05-01)
6.9.0-rc5+      NOPE    Most recent (2024-04-27)
6.8.8           NOPE    Most recent released (2024-04-29)
6.8.7           NOPE    Most recent released (2024-04-27)
6.8.4           NOPE    Same version as most recent released PVE 8.2 Kernel
6.5.13          WORKS
My tests, reverts on top of 6.8.8
=================================
6.8.8+          WORKS   Revert "Merge tag 'scsi-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi" - This reverts commit 
6d20acbf3e3a32d331947dbc3802cf2d1a399e7d, reversing changes made to 
fef85269a19d277f23fc5ff08a3c356beeb54cb3
6.8.8+          WORKS   Revert "scsi: core: Consult supported VPD page list prior to 
fetching page" - This reverts commit b5fc07a5fb56216a49e6c1d0b172d5464d99a89b (this is the 
first bad commit of my bisect session, see below, and a single patch as part of the above 
merged tag 'scsi-fixes')
Bisecting, starting from 6.9.0-rc5 (bad) and 6.5.13 (good)
==========================================================
root@linus:/usr/src/linux# git checkout master
Bereits auf 'master'
Ihr Branch ist auf demselben Stand wie 'origin/master'.
root@linus:/usr/src/linux# git log
commit 9d1ddab261f3e2af7c384dc02238784ce0cf9f98 (HEAD -> master, origin/master, origin/HEAD)
Merge: 71b1543c83d6 77d8aa79ecfb
Author: Linus Torvalds torvalds@linux-foundation.org
Date:   Tue Apr 23 09:37:32 2024 -0700
Merge tag '6.9-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
root@linus:/usr/src/linux# cp /boot/config-6.5.13-5-pve .config
root@linus:/usr/src/linux# git bisect start
Status: warte auf guten und schlechten Commit
root@linus:/usr/src/linux# git bisect bad
Status: warte auf gute(n) Commit(s), schlechter Commit bekannt
root@linus:/usr/src/linux# git bisect good v6.5.13
Binäre Suche: eine Merge-Basis muss geprüft werden
[2dde18cd1d8fac735875f2e4987f11817cc0bc2c] Linux 6.5
root@linus:/usr/src/linux# make olddefconfig
.config:10571:warning: symbol value 'm' invalid for ANDROID_BINDER_IPC
.config:10572:warning: symbol value 'm' invalid for ANDROID_BINDERFS
#
# configuration written to .config
#
root@linus:/usr/src/linux# make -j 48
=> 6.5.0 (Merge Base)                       WORKS
root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 32111 Commits zum Testen übrig (ungefähr 15 Schritte)
[0f5cc96c367f2e780eb492cc9cab84e3b2ca88da] Merge tag 's390-6.7-3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
root@linus:/usr/src/linux# make -j 48
=> 6.7.0-rc2+                               WORKS
root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 16056 Commits zum Testen übrig (ungefähr 14 Schritte)
[ee138217c32ccbfa75d5ea6b766158148e98f6fa] Merge tag 'btree-remove-btnum-6.9_2024-02-23' 
of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.9-mergeC
=> 6.8.0-rc4+                               WORKS
root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 8214 Commits zum Testen übrig (ungefähr 13 Schritte)
[e5e038b7ae9da96b93974bf072ca1876899a01a3] Merge tag 'fs_for_v6.9-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
=> 6.8.0+                                   NOPE => does not find root device, does not boot;
                                             message: "BUG: arch topology borken the CPU 
domain not a subset of > the NUMA domain"
                                             message: "Timed out for waiting the udev 
queue being empty."
root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 3954 Commits zum Testen übrig (ungefähr 12 Schritte)
[f153fbe1ea11939e2514ba4b3b62bbd946e2892c] Merge tag 'erofs-for-6.9-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
=> 6.8.0+ (HEAD losgelöst bei f153fbe1ea11) NOPE => same as above
root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 1945 Commits zum Testen übrig (ungefähr 11 Schritte)
[1ddeeb2a058d7b2a58ed9e820396b4ceb715d529] Merge tag 'for-6.9/block-20240310' of 
git://git.kernel.dk/linux
=> 6.8.0+ (HEAD losgelöst bei 1ddeeb2a058d) NOPE => same as above
root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 970 Commits zum Testen übrig (ungefähr 10 Schritte)
[2652b99e43403dc464f3648483ffb38e48872fe4] ice: virtchnl: stop pretending to support RSS 
over AQ or registers
=> 6.8.0-rc6+ (2652b99e4340)                NOPE => same
root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 506 Commits zum Testen übrig (ungefähr 9 Schritte)
[efa80dcbb7a3ecc4a1b2f54624c49b5a612f92b3] Merge tag 'trace-v6.8-rc5' of 
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
=> 6.8.0-rc5+ (efa80dcbb7a3)                WORKS
root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 251 Commits zum Testen übrig (ungefähr 8 Schritte)
[c6a597fcc7ad7335a3ecf8f5287a0459f793a257] Merge tag 'loongarch-fixes-6.8-3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
=> 6.8.0-rc5+ (c6a597fcc7ad)                WORKS
root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 126 Commits zum Testen übrig (ungefähr 7 Schritte)
[cf1182944c7cc9f1c21a8a44e0d29abe12527412] Merge tag 'lsm-pr-20240227' of 
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
=> 6.8.0-rc6+ (cf1182944c7c)                NOPE
root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 62 Commits zum Testen übrig (ungefähr 6 Schritte)
[4ca0d9894fd517a2f2c0c10d26ebe99ab4396fe3] Merge tag 'erofs-for-6.8-rc6-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
=> 6.8.0-rc5+ (4ca0d9894fd5)                NOPE
root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 36 Commits zum Testen übrig (ungefähr 5 Schritte)
[ac389bc0ca56e1a2f92b2a17e58298390a3879a8] Merge tag 'cxl-fixes-6.8-rc6' of 
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
=> 6.8.0-rc5+ (ac389bc0ca56)                NOPE
root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 12 Commits zum Testen übrig (ungefähr 4 Schritte)
[40de53fd002c6ba087a623722915e8006ed68a02] Merge branch 'for-6.8/cxl-cper' into for-6.8/cxl
=> 6.8.0-rc5+ (40de53fd002c)                WORKS
root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 6 Commits zum Testen übrig (ungefähr 3 Schritte)
[9ddf190a7df77b77817f955fdb9c2ae9d1c9c9a3] scsi: jazz_esp: Only build if SCSI core is builtin
=> 6.8.0-rc1+ (9ddf190a7df7)                NOPE
root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 2 Commits zum Testen übrig (ungefähr 2 Schritte)
[de959094eb2197636f7c803af0943cb9d3b35804] scsi: target: pscsi: Fix bio_put() for error case
=> 6.8.0-rc1+ (de959094eb21)                NOPE
root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 0 Commits zum Testen übrig (ungefähr 1 Schritt)
[b5fc07a5fb56216a49e6c1d0b172d5464d99a89b] scsi: core: Consult supported VPD page list 
prior to fetching page
=> 6.8.0-rc1+ (b5fc07a5fb56)                NOPE
root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 0 Commits zum Testen übrig (ungefähr 0 Schritte)
[321da3dc1f3c92a12e3c5da934090d2992a8814c] scsi: sd: usb_storage: uas: Access media prior 
to querying device properties
=> 6.8.0-rc1+ (321da3dc1f3c)                WORKS
root@linus:/usr/src/linux# git bisect good
b5fc07a5fb56216a49e6c1d0b172d5464d99a89b is the first bad commit
commit b5fc07a5fb56216a49e6c1d0b172d5464d99a89b
Author: Martin K. Petersen martin.petersen@oracle.com
Date:   Wed Feb 14 17:14:11 2024 -0500
scsi: core: Consult supported VPD page list prior to fetching page
Commit c92a6b5d6335 ("scsi: core: Query VPD size before getting full
     page") removed the logic which checks whether a VPD page is present on
     the supported pages list before asking for the page itself. That was
     done because SPC helpfully states "The Supported VPD Pages VPD page
     list may or may not include all the VPD pages that are able to be
     returned by the device server". Testing had revealed a few devices
     that supported some of the 0xBn pages but didn't actually list them in
     page 0.
Julian Sikorski bisected a problem with his drive resetting during
     discovery to the commit above. As it turns out, this particular drive
     firmware will crash if we attempt to fetch page 0xB9.
Various approaches were attempted to work around this. In the end,
     reinstating the logic that consults VPD page 0 before fetching any
     other page was the path of least resistance. A firmware update for the
     devices which originally compelled us to remove the check has since
     been released.
Link: https://lore.kernel.org/r/20240214221411.2888112-1-martin.petersen@oracle.co...
     Fixes: c92a6b5d6335 ("scsi: core: Query VPD size before getting full page")
     Cc: stable@vger.kernel.org
     Cc: Bart Van Assche bvanassche@acm.org
     Reported-by: Julian Sikorski belegdol@gmail.com
     Tested-by: Julian Sikorski belegdol@gmail.com
     Reviewed-by: Lee Duncan lee.duncan@suse.com
     Reviewed-by: Bart Van Assche bvanassche@acm.org
     Signed-off-by: Martin K. Petersen martin.petersen@oracle.com
drivers/scsi/scsi.c        | 22 ++++++++++++++++++++--
  include/scsi/scsi_device.h |  4 ----
  2 files changed, 20 insertions(+), 6 deletions(-)
root@linus:/usr/src/linux#
-------------------------------------------------------------------
Beste Grüße,
Peter Schneider
-- 
Climb the mountain not to plant your flag, but to embrace the challenge,
enjoy the air and behold the view. Climb it so you can see the world,
not so the world can see you.                    -- David McCullough Jr.

OpenPGP:  0xA3828BD796CCE11A8CADE8866E3A92C92C3FF244
Download: https://www.peters-netzplatz.de/download/pschneider1968_pub.asc
https://keys.mailvelope.com/pks/lookup?op=get&search=pschneider1968@goog...
https://keys.mailvelope.com/pks/lookup?op=get&search=pschneider1968@gmai...

Kernel 6.8.4 regression: aacraid controller not initialized any more, system boot hangs