On Mon, Nov 11, 2024 at 03:21:36PM +0800, Joseph Jang wrote:
On 2024/10/19 3:34 AM, Bjorn Helgaas wrote:
On Tue, Sep 03, 2024 at 06:44:26PM -0700, Joseph Jang wrote:
Validate there are no duplicate hwirq from the irq debug file system /sys/kernel/debug/irq/irqs/* per chip name.
One example log show 2 duplicated hwirq in the irq debug file system.
$ sudo cat /sys/kernel/debug/irq/irqs/163 handler: handle_fasteoi_irq device: 0019:00:00.0 <SNIP> node: 1 affinity: 72-143 effectiv: 76 domain: irqchip@0x0000100022040000-3 hwirq: 0xc8000000 chip: ITS-MSI flags: 0x20
$ sudo cat /sys/kernel/debug/irq/irqs/174 handler: handle_fasteoi_irq device: 0039:00:00.0 <SNIP> node: 3 affinity: 216-287 effectiv: 221 domain: irqchip@0x0000300022040000-3 hwirq: 0xc8000000 chip: ITS-MSI flags: 0x20
The irq-check.sh can help to collect hwirq and chip name from /sys/kernel/debug/irq/irqs/* and print error log when find duplicate hwirq per chip name.
Kernel patch ("PCI/MSI: Fix MSI hwirq truncation") [1] fix above issue. [1]: https://lore.kernel.org/all/20240115135649.708536-1-vidyas@nvidia.com/
I don't know enough about this issue to understand the details. It seems like you look for duplicate hwirqs in chips with the same name, e.g., "ITS-MSI" in this case? That name seems too generic to me (might there be several instances of "ITS-MSI" in a system?)
As I know, each PCIe device typically has only one ITS-MSI controller. Having multiple ITS-MSI instances for the same device would lead to confusion and potential conflicts in interrupt routing.
Also, the name may come from chip->irq_print_chip(), so it apparently relies on irqchip drivers to make the names unique if there are multiple instances?
I would have expected looking for duplicates inside something more specific, like "irqchip@0x0000300022040000-3". But again, I don't know enough about the problem to speak confidently here.
In our case, If we look for duplicates by different irq domains like "irqchip@0x0000100022040000-3" and "irqchip@0x0000300022040000-3" as following example.
$ sudo cat /sys/kernel/debug/irq/irqs/163 handler: handle_fasteoi_irq device: 0019:00:00.0 <SNIP> node: 1 affinity: 72-143 effectiv: 76 domain: irqchip@0x0000100022040000-3 hwirq: 0xc8000000 chip: ITS-MSI flags: 0x20 $ sudo cat /sys/kernel/debug/irq/irqs/174 handler: handle_fasteoi_irq device: 0039:00:00.0 <SNIP> node: 3 affinity: 216-287 effectiv: 221 domain: irqchip@0x0000300022040000-3 hwirq: 0xc8000000 chip: ITS-MSI flags: 0x20
We could not detect the duplicated hwirq number (0xc8000000) in this case.
Again, this is really out of my area, but based on Documentation/core-api/irq/irq-domain.rst, I assumed the point of hwirq was that hwirq numbers were local to an interrupt controller, i.e., to an irq_domain.
If that's the case, it should not be a problem if hwirq number 0xc8000000 is used in two separate irq_domains.
Bjorn