From: Adrian Huang ahuang12@lenovo.com
When enabling VMD and IOMMU scalable mode, the following kernel panic call trace/kernel log is shown in Eagle Stream platform (Sapphire Rapids CPU) during booting:
pci 0000:59:00.5: Adding to iommu group 42 ... vmd 0000:59:00.5: PCI host bridge to bus 10000:80 pci 10000:80:01.0: [8086:352a] type 01 class 0x060400 pci 10000:80:01.0: reg 0x10: [mem 0x00000000-0x0001ffff 64bit] pci 10000:80:01.0: enabling Extended Tags pci 10000:80:01.0: PME# supported from D0 D3hot D3cold pci 10000:80:01.0: DMAR: Setup RID2PASID failed pci 10000:80:01.0: Failed to add to iommu group 42: -16 pci 10000:80:03.0: [8086:352b] type 01 class 0x060400 pci 10000:80:03.0: reg 0x10: [mem 0x00000000-0x0001ffff 64bit] pci 10000:80:03.0: enabling Extended Tags pci 10000:80:03.0: PME# supported from D0 D3hot D3cold list_add double add: new=ff4d61160b74b8a0, prev=ff4d611d8e245c10, next=ff4d61160b74b8a0. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:29! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.17.0-rc3+ #7 Hardware name: Lenovo ThinkSystem SR650V3/SB27A86647, BIOS ESE101Y-1.00 01/13/2022 Workqueue: events work_for_cpu_fn RIP: 0010:__list_add_valid.cold+0x26/0x3f Code: 9a 4a ab ff 4c 89 c1 48 c7 c7 40 0c d9 9e e8 b9 b1 fe ff 0f 0b 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 f0 0c d9 9e e8 a2 b1 fe ff <0f> 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 98 0c d9 9e e8 8b b1 fe RSP: 0000:ff5ad434865b3a40 EFLAGS: 00010246 RAX: 0000000000000058 RBX: ff4d61160b74b880 RCX: ff4d61255e1fffa8 RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ffffffff9fd34f20 RBP: ff4d611d8e245c00 R08: 0000000000000000 R09: ff5ad434865b3888 R10: ff5ad434865b3880 R11: ff4d61257fdc6fe8 R12: ff4d61160b74b8a0 R13: ff4d61160b74b8a0 R14: ff4d611d8e245c10 R15: ff4d611d8001ba70 FS: 0000000000000000(0000) GS:ff4d611d5ea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ff4d611fa1401000 CR3: 0000000aa0210001 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> intel_pasid_alloc_table+0x9c/0x1d0 dmar_insert_one_dev_info+0x423/0x540 ? device_to_iommu+0x12d/0x2f0 intel_iommu_attach_device+0x116/0x290 __iommu_attach_device+0x1a/0x90 iommu_group_add_device+0x190/0x2c0 __iommu_probe_device+0x13e/0x250 iommu_probe_device+0x24/0x150 iommu_bus_notifier+0x69/0x90 blocking_notifier_call_chain+0x5a/0x80 device_add+0x3db/0x7b0 ? arch_memremap_can_ram_remap+0x19/0x50 ? memremap+0x75/0x140 pci_device_add+0x193/0x1d0 pci_scan_single_device+0xb9/0xf0 pci_scan_slot+0x4c/0x110 pci_scan_child_bus_extend+0x3a/0x290 vmd_enable_domain.constprop.0+0x63e/0x820 vmd_probe+0x163/0x190 local_pci_probe+0x42/0x80 work_for_cpu_fn+0x13/0x20 process_one_work+0x1e2/0x3b0 worker_thread+0x1c4/0x3a0 ? rescuer_thread+0x370/0x370 kthread+0xc7/0xf0 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- ... Kernel panic - not syncing: Fatal exception Kernel Offset: 0x1ca00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]---
The following 'lspci' output shows devices '10000:80:*' are subdevices of the VMD device 0000:59:00.5:
$ lspci ... 0000:59:00.5 RAID bus controller: Intel Corporation Volume Management Device NVMe RAID Controller (rev 20) ... 10000:80:01.0 PCI bridge: Intel Corporation Device 352a (rev 03) 10000:80:03.0 PCI bridge: Intel Corporation Device 352b (rev 03) 10000:80:05.0 PCI bridge: Intel Corporation Device 352c (rev 03) 10000:80:07.0 PCI bridge: Intel Corporation Device 352d (rev 03) 10000:81:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] 10000:82:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
The symptom 'list_add double add' is caused by the following failure message:
pci 10000:80:01.0: DMAR: Setup RID2PASID failed pci 10000:80:01.0: Failed to add to iommu group 42: -16 pci 10000:80:03.0: [8086:352b] type 01 class 0x060400
Device 10000:80:01.0 is the subdevice of the VMD device 0000:59:00.5, so invoking intel_pasid_alloc_table() gets the pasid_table of the VMD device 0000:59:00.5. Here is call path:
intel_pasid_alloc_table pci_for_each_dma_alias get_alias_pasid_table search_pasid_table
pci_real_dma_dev() in pci_for_each_dma_alias() gets the real dma device which is the VMD device 0000:59:00.5. However, pte of the VMD device 0000:59:00.5 has been configured during this message "pci 0000:59:00.5: Adding to iommu group 42". So, the status -EBUSY is returned when configuring pasid entry for device 10000:80:01.0.
It then invokes dmar_remove_one_dev_info() to release 'struct device_domain_info *' from iommu_devinfo_cache. But, the pasid table is not released because of the following statement in __dmar_remove_one_dev_info():
if (info->dev && !dev_is_real_dma_subdevice(info->dev)) { ... intel_pasid_free_table(info->dev); }
The subsequent dmar_insert_one_dev_info() operation of device 10000:80:03.0 allocates 'struct device_domain_info *' from iommu_devinfo_cache. The allocated address is the same address that is released previously for device 10000:80:01.0. Finally, invoking device_attach_pasid_table() causes the issue.
`git bisect` points to the offending commit 474dd1c65064 ("iommu/vt-d: Fix clearing real DMA device's scalable-mode context entries"), which releases the pasid table if the device is not the subdevice by checking the returned status of dev_is_real_dma_subdevice(). Reverting the offending commit can work around the issue.
The solution is to prevent from allocating pasid table if those devices are subdevices of the VMD device.
Fixes: 474dd1c65064 ("iommu/vt-d: Fix clearing real DMA device's scalable-mode context entries") Cc: stable@vger.kernel.org # v5.14+ Signed-off-by: Adrian Huang ahuang12@lenovo.com ---
Changes since v2: * Add Fixes tag and stable kernel CC, per Baolu
drivers/iommu/intel/iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 92fea3fbbb11..5b196cfe9ed2 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2738,7 +2738,7 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu, spin_unlock_irqrestore(&device_domain_lock, flags);
/* PASID table is mandatory for a PCI device in scalable mode. */ - if (dev && dev_is_pci(dev) && sm_supported(iommu)) { + if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev)) { ret = intel_pasid_alloc_table(dev); if (ret) { dev_err(dev, "PASID table allocation failed\n");
On 2/16/22 5:13 PM, Adrian Huang wrote:
pci_real_dma_dev() in pci_for_each_dma_alias() gets the real dma device which is the VMD device 0000:59:00.5. However, pte of the VMD device 0000:59:00.5 has been configured during this message "pci 0000:59:00.5: Adding to iommu group 42". So, the status -EBUSY is returned when configuring pasid entry for device 10000:80:01.0.
So the VMD subdevice (pci 10000:80:01.0) is an alias device of the "pci 0000:59:00.5", and it uses the Source-ID of "pci 0000:59:00.5" for DMA transactions. Do I understand it right? If so, it makes sense to skip setting up pasid entry for VMD subdevices.
Another thing I am still concerned is about the context entry setup. What does the context entries look like for both VMD and subdevices after domain_context_mapping() being called?
Best regards, baolu
On Fri, Feb 18, 2022 at 10:30 AM Lu Baolu baolu.lu@linux.intel.com wrote:
On 2/16/22 5:13 PM, Adrian Huang wrote:
pci_real_dma_dev() in pci_for_each_dma_alias() gets the real dma device which is the VMD device 0000:59:00.5. However, pte of the VMD device 0000:59:00.5 has been configured during this message "pci 0000:59:00.5: Adding to iommu group 42". So, the status -EBUSY is returned when configuring pasid entry for device 10000:80:01.0.
So the VMD subdevice (pci 10000:80:01.0) is an alias device of the "pci 0000:59:00.5", and it uses the Source-ID of "pci 0000:59:00.5" for DMA transactions. Do I understand it right? If so, it makes sense to skip setting up pasid entry for VMD subdevices.
Yes, your understanding is correct.
Another thing I am still concerned is about the context entry setup. What does the context entries look like for both VMD and subdevices after domain_context_mapping() being called?
pasid_table in struct device_domain_info is NULL because the field pasid_table is configured in intel_pasid_alloc_table().
The following statement in domain_context_mapping_one() is true for subdevices because the context is configured by the real VMD device 0000:59:00.5. So, domain_context_mapping() does nothing for subdevices. if (context_present(context)) goto out_unlock;
Here is the log for your reference with pr_debug() enabled.
[ 19.063445] pci 0000:59:00.5: Adding to iommu group 42 ... [ 22.673502] DMAR: Set context mapping for 59:00.5 .. [ 32.089696] vmd 0000:59:00.5: PCI host bridge to bus 10000:80 [ 32.119452] pci 10000:80:01.0: [8086:352a] type 01 class 0x060400 [ 32.126302] pci 10000:80:01.0: reg 0x10: [mem 0x00000000-0x0001ffff 64bit] [ 32.134023] pci 10000:80:01.0: enabling Extended Tags [ 32.139730] pci 10000:80:01.0: PME# supported from D0 D3hot D3cold [ 32.160526] DMAR: Set context mapping for 59:00.5 [ 32.171090] pci 10000:80:01.0: Adding to iommu group 42 ...
-- Adrian
On 2/18/22 4:21 PM, Huang Adrian wrote:
Another thing I am still concerned is about the context entry setup. What does the context entries look like for both VMD and subdevices after domain_context_mapping() being called?
pasid_table in struct device_domain_info is NULL because the field pasid_table is configured in intel_pasid_alloc_table().
The following statement in domain_context_mapping_one() is true for subdevices because the context is configured by the real VMD device 0000:59:00.5. So, domain_context_mapping() does nothing for subdevices. if (context_present(context)) goto out_unlock;
Here is the log for your reference with pr_debug() enabled.
[ 19.063445] pci 0000:59:00.5: Adding to iommu group 42 ... [ 22.673502] DMAR: Set context mapping for 59:00.5 .. [ 32.089696] vmd 0000:59:00.5: PCI host bridge to bus 10000:80 [ 32.119452] pci 10000:80:01.0: [8086:352a] type 01 class 0x060400 [ 32.126302] pci 10000:80:01.0: reg 0x10: [mem 0x00000000-0x0001ffff 64bit] [ 32.134023] pci 10000:80:01.0: enabling Extended Tags [ 32.139730] pci 10000:80:01.0: PME# supported from D0 D3hot D3cold [ 32.160526] DMAR: Set context mapping for 59:00.5 [ 32.171090] pci 10000:80:01.0: Adding to iommu group 42 ...
That's clear to me now. Thank you very much!
Best regards, baolu
linux-stable-mirror@lists.linaro.org