Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling

29 Nov 2018

      On Thu, Nov 29, 2018 at 9:07 AM Logan Gunthorpe logang@deltatee.com wrote:
...
On 2018-11-28 8:10 p.m., Dan Williams wrote:
...
Yes, please send a proper patch.
Ok, I'll send one shortly.
...
Although, I'm still not sure I see
the problem with the order of the percpu-ref kill. It's likely more
efficient to put the kill after the put_page() loop because the
percpu-ref will still be in "fast" per-cpu mode, but the kernel panic
should not be possible as long as their is a wait_for_completion()
before the exit, unless something else is wrong.
The series of events looks something like this:

Some p2pdma user calls pci_alloc_p2pmem() to get some memory to DMA

to taking a reference to the pgmap.
2) Another process unbinds the underlying p2pdma driver and the devm
chain starts to unwind.
3) devm_memremap_pages_release() is called and it kills the reference
and drop's it's last reference.
Oh! Yes, nice find. We need to wait for the percpu-ref to be dead and
all outstanding references dropped before we can proceed to
arch_remove_memory(), and I think this problem has been there since
day one because the final exit was always after devm_memremap_pages()
release which means arch_remove_memory() was always racing any final
put_page(). I'll take a look, it seems the arch_remove_pages() call
needs to be moved out-of-line to its own context and wait for the
final exit of the percpu-ref.
...

arch_remove_memory() is called which will remove all the struct pages.
We eventually get to pci_p2pdma_release() where we wait for the

completion indicating all the pages have been freed.
6) The user in (1) tries to use the page that has been removed,
typically by calling pci_p2pdma_map_sg(), but the page doesn't exist so
the kernel panics.
So we really need the wait in (5) to occur before (4) but after (3) so
that the pages continue to exist until the last reference is dropped.
...
Certainly you can't move the wait_for_completion() into your ->kill()
callback without switching the ordering, but I'm not on board with
that change until I understand a bit more about why you think
device-dax might be broken?
I took a look at the p2pdma shutdown path and the:
    if (percpu_ref_is_dying(ref))
            return;

...looks fishy. If multiple agents can overlap their requests for the
same range why not track that simply as additional refs? Could it be
the crash that you are seeing is a result of mis-accounting when it is
safe to assume the page allocation can be freed?
Yeah, someone else mentioned the same thing during review but if I
remove it, there can be a double kill() on a hypothetical driver that
might call pci_p2pdma_add_resource() twice. The issue is we only have
one percpu_ref per device not one per range/BAR.
Though, now that I look at it, the current change in question will be
wrong if there are two devm_memremap_pages_release()s to call. Both need
to drop their references before we can wait_for_completion() ;(. I guess
I need multiple percpu_refs or more complex changes to
devm_memremap_pages_release().
Can you just have a normal device-level kref for this case? On final
device-level kref_put then kill the percpu_ref? I guess the problem is
devm semantics where p2pdma only gets one callback on a driver
->remove() event. I'm not sure how to support multiple references of
the same pages without creating a non-devm version of
devm_memremap_pages(). I'm not opposed to that, but afaiu I don't
think p2pdma is compatible with devm as long as it supports N>1:1
mappings of the same range.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling