On 2018-11-28 8:10 p.m., Dan Williams wrote:
Yes, please send a proper patch.
Ok, I'll send one shortly.
Although, I'm still not sure I see the problem with the order of the percpu-ref kill. It's likely more efficient to put the kill after the put_page() loop because the percpu-ref will still be in "fast" per-cpu mode, but the kernel panic should not be possible as long as their is a wait_for_completion() before the exit, unless something else is wrong.
The series of events looks something like this:
1) Some p2pdma user calls pci_alloc_p2pmem() to get some memory to DMA to taking a reference to the pgmap. 2) Another process unbinds the underlying p2pdma driver and the devm chain starts to unwind. 3) devm_memremap_pages_release() is called and it kills the reference and drop's it's last reference. 4) arch_remove_memory() is called which will remove all the struct pages. 5) We eventually get to pci_p2pdma_release() where we wait for the completion indicating all the pages have been freed. 6) The user in (1) tries to use the page that has been removed, typically by calling pci_p2pdma_map_sg(), but the page doesn't exist so the kernel panics.
So we really need the wait in (5) to occur before (4) but after (3) so that the pages continue to exist until the last reference is dropped.
Certainly you can't move the wait_for_completion() into your ->kill() callback without switching the ordering, but I'm not on board with that change until I understand a bit more about why you think device-dax might be broken?
I took a look at the p2pdma shutdown path and the:
if (percpu_ref_is_dying(ref)) return;
...looks fishy. If multiple agents can overlap their requests for the same range why not track that simply as additional refs? Could it be the crash that you are seeing is a result of mis-accounting when it is safe to assume the page allocation can be freed?
Yeah, someone else mentioned the same thing during review but if I remove it, there can be a double kill() on a hypothetical driver that might call pci_p2pdma_add_resource() twice. The issue is we only have one percpu_ref per device not one per range/BAR.
Though, now that I look at it, the current change in question will be wrong if there are two devm_memremap_pages_release()s to call. Both need to drop their references before we can wait_for_completion() ;(. I guess I need multiple percpu_refs or more complex changes to devm_memremap_pages_release().
Thanks
Logan