Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2

21 Mar 2018


      On 03/21/2018 03:46 PM, Jerome Glisse wrote:
...
On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
...
On 03/21/2018 11:03 AM, Jerome Glisse wrote:
...
On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
...
On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
...
From: Ralph Campbell rcampbell@nvidia.com
<snip>
...
...
Hi Jerome,
This presents a deadlock problem (details below). As for solution ideas, 
Mark Hairgrove points out that the MMU notifiers had to solve the
same sort of problem, and part of the solution involves "avoid
holding locks when issuing these callbacks". That's not an entire 
solution description, of course, but it seems like a good start.
Anyway, for the deadlock problem:
Each of these ->release callbacks potentially has to wait for the 
hmm_invalidate_range() callbacks to finish. That is not shown in any
code directly, but it's because: when a device driver is processing 
the above ->release callback, it has to allow any in-progress operations 
to finish up (as specified clearly in your comment documentation above).
Some of those operations will invariably need to do things that result 
in page invalidations, thus triggering the hmm_invalidate_range() callback.
Then, the hmm_invalidate_range() callback tries to acquire the same 
hmm->mirrors_sem lock, thus leading to deadlock:
hmm_invalidate_range():
// ...
   down_read(&hmm->mirrors_sem);
   list_for_each_entry(mirror, &hmm->mirrors, list)
   	mirror->ops->sync_cpu_device_pagetables(mirror, action,
   						start, end);
   up_read(&hmm->mirrors_sem);
That is just illegal, the release callback is not allowed to trigger
invalidation all it does is kill all device's threads and stop device
page fault from happening. So there is no deadlock issues. I can re-
inforce the comment some more (see [1] for example on what it should
be).
That rule is fine, and it is true that the .release callback will not 
directly trigger any invalidations. However, the problem is in letting 
any *existing* outstanding operations finish up. We have to let 
existing operations "drain", in order to meet the requirement that 
everything is done when .release returns.
For example, if a device driver thread is in the middle of working through
its fault buffer, it will call migrate_vma(), which will in turn unmap
pages. That will cause an hmm_invalidate_range() callback, which tries
to take hmm->mirrors_sems, and we deadlock.
There's no way to "kill" such a thread while it's in the middle of
migrate_vma(), you have to let it finish up.
...
Also it is illegal for the sync callback to trigger any mmu_notifier
callback. I thought this was obvious. The sync callback should only
update device page table and do _nothing else_. No way to make this
re-entrant.
That is obvious, yes. I am not trying to say there is any problem with
that rule. It's the "drain outstanding operations during .release", 
above, that is the real problem.
Maybe just relax the release callback wording, it should stop any
more processing of fault buffer but not wait for it to finish. In
nouveau code i kill thing but i do not wait hence i don't deadlock.
But you may crash, because that approach allows .release to finish
up, thus removing the mm entirely, out from under (for example)
a migrate_vma call--or any other call that refers to the mm.
It doesn't seem too hard to avoid the problem, though: maybe we
can just drop the lock while doing the mirror->ops->release callback.
There are a few ways to do this, but one example is:
-- take the lock,
        -- copy the list to a local list, deleting entries as you go,
    -- drop the lock, 
    -- iterate through the local list copy and 
        -- issue the mirror->ops->release callbacks.
At this point, more items could have been added to the list, so repeat
the above until the original list is empty.
This is subject to a limited starvation case if mirror keep getting 
registered, but I think we can ignore that, because it only lasts as long as 
mirrors keep getting added, and then it finishes up.
...
What matter is to stop any further processing. Yes some fault might
be in flight but they will serialize on various lock.
Those faults in flight could already be at a point where they have taken
whatever locks they need, so we don't dare let the mm get destroyed while
such fault handling is in progress.
So just do not
...
wait in the release callback, kill thing. I might have a bug where i
still fill in GPU page table in nouveau, i will check nouveau code
for that.
Again, we can't "kill" a thread of execution (this would often be an
interrupt bottom half context, btw) while it is, for example,
in the middle of migrate_vma.
I really don't believe there is a safe way to do this without draining
the existing operations before .release returns, and for that, we'll need to 
issue the .release callbacks while not holding locks.
thanks,
-- 
John Hubbard
NVIDIA

> 
> Kill thing should also kill the channel (i don't do that in nouveau
> because i am waiting on some channel patchset) but i am not sure if
> hardware like it if we kill channel before stoping fault notification.
> 
> Cheers,
> Jérôme
>

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2