On Fri, Jun 19, 2020 at 03:30:32PM -0400, Felix Kuehling wrote:
Am 2020-06-19 um 3:11 p.m. schrieb Alex Deucher:
On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse jglisse@redhat.com wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
For nouveau the notifier do not need to wait for anything it can update the GPU page table right away. Modulo needing to write to GPU memory using dma engine if the GPU page table is in GPU memory that is not accessible from the CPU but that's never the case for nouveau so far (but i expect it will be at one point).
So i see this as 2 different cases, the user ptr case, which does pin pages by the way, where things are synchronous. Versus the HMM cases where everything is asynchronous.
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait. The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.
All devices which support recoverable page faults also have a dedicated paging engine for the kernel driver which the driver already makes use of. We can also update the GPU page tables with the CPU.
We have a potential problem with CPU updating page tables while the GPU is retrying on page table entries because 64 bit CPU transactions don't arrive in device memory atomically.
We are using SDMA for page table updates. This currently goes through a the DRM GPU scheduler to a special SDMA queue that's used by kernel-mode only. But since it's based on the DRM GPU scheduler, we do use dma-fence to wait for completion.
Yeah my worry is mostly that some cross dma fence leak into it but it should never happen realy, maybe there is a way to catch if it does and print a warning.
So yes you can use dma fence, as long as they do not have cross-dep. Another expectation is that they complete quickly and usualy page table update do.
Cheers, Jérôme