On Tue, 13 Nov 2018 00:41:24 +0000, Gustavo Pimentel gustavo.pimentel@synopsys.com wrote:
On 09/11/2018 11:34, Marc Zyngier wrote:
On 08/11/18 19:49, Trent Piepho wrote:
On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote:
On 07/11/18 20:17, Trent Piepho wrote:
On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
On 06/11/18 19:40, Trent Piepho wrote: > > What about stable kernels that don't have the hierarchical API?
My goal is to fix mainline first. Once we have something that works on mainline, we can look at propagating the fix to other versions. But mainline always comes first.
This is a regression that went into 4.14. Wouldn't the appropriate action for the stable series be to undo the regression?
This is not how stable works. Stable kernels *only* contain patches that are backported from mainline, and do not take standalone patch.
Furthermore, your fix is to actually undo someone else's fix. Who is right? In the absence of any documentation, the answer is "nobody".
Little more history to this bug. The code was originally the way it is now, but this same bug was fixed in 2013 in https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.or&... g/patch/3333681/
Then that lasted four years until it was changed Aug 2017 in https://urldefense.proofpoint.com/v2/url?u=https-3A__pa&d=DwICaQ&c=D... tchwork.kernel.org/patch/9893303/
That lasted just six months until someone tried to revert it, https://urldefense.proofpoint.com/v2/url?u=https-3A__p&d=DwICaQ&c=DP... atchwork.kernel.org/patch/9893303/
Seems pretty clear the way it is now is much worse than the way it was before, even if the previous design may have had another flaw. Though I've yet to see anyone point out something makes the previous design broken. Sub-optimal yes, but not broken.
This is not what was reported by the previous submitter. I guess they changed this for a reason, no? I'm prepared to admit this is a end-point driver bug, but we need to understand what is happening and stop changing this driver randomly.
Anything can be backported to stable once we understand the issue. At the moment, we're just playing games moving stuff around and hope nothing else will break. That's not a sustainable way of maintaining this driver. At the moment, the only patch I'm inclined to propose until we get an actual interrupt handling flow from Synopsys is to mark this driver as "BROKEN".
It feels like you're using this bug to hold designware hostage in a broken kernel, and me along with them. I don't have the documentation, no one does, there's no way for me to give you want you want. But I've got hardware that doesn't work in the mainline kernel.
I take it as a personal offence that I'd be holding anything or anyone hostage. I think I have a long enough track record working with the Linux kernel not to take any of this nonsense. What's my interest in keeping anything in this sorry state? Think about it for a minute.
When I'm asking for documentation, I'm not asking you. I perfectly understood that you don't have any. We need Synopsys to step up and give us a simple description of what the MSI workflow is. Because no matter how you randomly mutate this code, it will still be broken until we get a clear answer from *Synopsys*.
Gustavo, here's one simple ask. Please reply to this email with a step by step, register by register description of how an MSI must be handled on this HW. We do need it, and we need it now.
Hi Marc, I thought that I did respond to your question on Nov 8. However I will repeat it and add some extra information to it now.
About the registers:
PCIE_MSI_INTR0_MASK When an MSI is received for a masked interrupt, the corresponding status bit gets set in the interrupt status register but the controller will not signal it. As soon as the masked interrupt is unmasked and assuming the status bit is still set the controller will signal it.
So this is *really* a mask.
PCIE_MSI_INTR0_ENABLE Enables/disables every interrupt. When an MSI is received from a disabled interrupt, no status bit gets set in MSI controller interrupt status register.
And that's not a mask at all. This allows the interrupt to be simply lost. It is just a bit worrying that this is exactly what the DWC driver is using, and faces all kind of interrupt loss when enabling/disabling interrupts (which do have a mask/unmask semantic -- yes, the language is confusing).
About this new callbacks:
The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains hierarchical API patch [1] and based on what I've seen so far, before this patch there were no interrupt masking been performed.
Based on the information provided, its very likely that I should use the PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return from Linux plumbers conference, I will test the system using the PCIE_MSI_INTR0_MASK register.
I'm at Plumbers as well, so feel free to grab me and we'll sort this out. We've wasted enough time on this broken driver.
Thanks,
M.