On Thu, Nov 29, 2018 at 9:07 AM Logan Gunthorpe logang@deltatee.com wrote:
On 2018-11-28 8:10 p.m., Dan Williams wrote:
Yes, please send a proper patch.
Ok, I'll send one shortly.
Although, I'm still not sure I see the problem with the order of the percpu-ref kill. It's likely more efficient to put the kill after the put_page() loop because the percpu-ref will still be in "fast" per-cpu mode, but the kernel panic should not be possible as long as their is a wait_for_completion() before the exit, unless something else is wrong.
The series of events looks something like this:
- Some p2pdma user calls pci_alloc_p2pmem() to get some memory to DMA
to taking a reference to the pgmap. 2) Another process unbinds the underlying p2pdma driver and the devm chain starts to unwind. 3) devm_memremap_pages_release() is called and it kills the reference and drop's it's last reference.
Oh! Yes, nice find. We need to wait for the percpu-ref to be dead and all outstanding references dropped before we can proceed to arch_remove_memory(), and I think this problem has been there since day one because the final exit was always after devm_memremap_pages() release which means arch_remove_memory() was always racing any final put_page(). I'll take a look, it seems the arch_remove_pages() call needs to be moved out-of-line to its own context and wait for the final exit of the percpu-ref.
- arch_remove_memory() is called which will remove all the struct pages.
- We eventually get to pci_p2pdma_release() where we wait for the
completion indicating all the pages have been freed. 6) The user in (1) tries to use the page that has been removed, typically by calling pci_p2pdma_map_sg(), but the page doesn't exist so the kernel panics.
So we really need the wait in (5) to occur before (4) but after (3) so that the pages continue to exist until the last reference is dropped.
Certainly you can't move the wait_for_completion() into your ->kill() callback without switching the ordering, but I'm not on board with that change until I understand a bit more about why you think device-dax might be broken?
I took a look at the p2pdma shutdown path and the:
if (percpu_ref_is_dying(ref)) return;
...looks fishy. If multiple agents can overlap their requests for the same range why not track that simply as additional refs? Could it be the crash that you are seeing is a result of mis-accounting when it is safe to assume the page allocation can be freed?
Yeah, someone else mentioned the same thing during review but if I remove it, there can be a double kill() on a hypothetical driver that might call pci_p2pdma_add_resource() twice. The issue is we only have one percpu_ref per device not one per range/BAR.
Though, now that I look at it, the current change in question will be wrong if there are two devm_memremap_pages_release()s to call. Both need to drop their references before we can wait_for_completion() ;(. I guess I need multiple percpu_refs or more complex changes to devm_memremap_pages_release().
Can you just have a normal device-level kref for this case? On final device-level kref_put then kill the percpu_ref? I guess the problem is devm semantics where p2pdma only gets one callback on a driver ->remove() event. I'm not sure how to support multiple references of the same pages without creating a non-devm version of devm_memremap_pages(). I'm not opposed to that, but afaiu I don't think p2pdma is compatible with devm as long as it supports N>1:1 mappings of the same range.