On 5/14/25 11:54 AM, Jason Gunthorpe wrote:
On Wed, May 14, 2025 at 09:23:49AM +0000, Ankit Soni wrote:
I am experiencing a system hang with a 5-level v2 page table mode, on boot. The NVMe boot drive is not initializing. Below are the relevant dmesg logs with some prints i had added:
[ 6.386439] AMD-Vi v2 domain init [ 6.390132] AMD-Vi v2 pt init [ 6.390133] AMD-Vi aperture end last va ffffffffffffff ... [ 10.315372] AMD-Vi gen pt MAP PAGES iova ffffffffffffe000 paddr 19351b000 ... [ 72.171930] nvme nvme0: I/O tag 0 (0000) QID 0 timeout, disable controller [ 72.179618] nvme nvme1: I/O tag 24 (0018) QID 0 timeout, disable controller [ 72.197176] nvme nvme0: Identify Controller failed (-4) [ 72.203063] nvme nvme1: Identify Controller failed (-4) [ 72.209237] nvme 0000:05:00.0: probe with driver nvme failed with error -5 [ 72.209336] nvme 0000:44:00.0: probe with driver nvme failed with error -5 ... Timed out waiting for the udev queue to be empty.
According to the dmesg logs above, the IOVA for the v2 page table appears incorrect and is not aligned with domain->geometry.aperture_end. Which requires domain->geometry.force_aperture = true; to be added at the appropriate location. Proabably here!
Thank you for pointing out this issue and its cause. I originally tested on a host with SCSI storage, and after your report I tried but couldn't reproduce the hang on a Zen4 host with an nvme boot drive. I wanted to see if it was a pattern common to NVME, but I suppose it depends on the DMA mask chosen by the specific driver.
Alejandro
Yes! It got lost, thanks alot!
Jason