On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone?
Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.
We have the entry locked:
/* * Make sure 'entry' remains valid while we drop * the i_pages lock. */ dax_lock_entry(xas, entry); /* * Besides huge zero pages the only other thing that gets * downgraded are empty entries which don't need to be * unmapped. */ if (dax_is_zero_entry(entry)) { xas_unlock_irq(xas); unmap_mapping_pages(mapping, xas->xa_index & ~PG_PMD_COLOUR, PG_PMD_NR, false); xas_reset(xas); xas_lock_irq(xas); }
If something can remove a locked entry, then that would seem like the real bug. Might be worth inserting a lookup there to make sure that it hasn't happened, I suppose?
Nope, added a check, we do in fact get the same locked entry back after dropping the lock.
The deadlock revolves around the mmap_sem. One thread holds it for read and then gets stuck indefinitely in get_unlocked_entry(). Once that happens another rocksdb thread tries to mmap and gets stuck trying to take the mmap_sem for write. Then all new readers, including ps and top that try to access a remote vma, then get queued behind that write.
It could also be the case that we're missing a wake up.
OK, I have a Theory.
get_unlocked_entry() doesn't check the size of the entry being waited for. So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry, which is (a) foolish, because we know it's going to fall back, and (b) can lead to a missed wakeup because it's going to sleep waiting for the PMD entry to come unlocked. Which it won't, unless there's a happy accident that happens to map to the same hash bucket.
Let's see if I can steal some time this weekend to whip up a patch.
Theory seems to have some evidence... I instrumented fs/dax.c to track outstanding 'lock' entries and 'wait' events. At the time of the hang we see no locks held and the waiter is waiting on a pmd entry:
[ 4001.354334] fs/dax locked entries: 0 [ 4001.358425] fs/dax wait entries: 1 [ 4001.362227] db_bench/2445 index: 0x0 shift: 6 [ 4001.367099] grab_mapping_entry+0x17a/0x260 [ 4001.371773] dax_iomap_pmd_fault.isra.43+0x168/0x7a0 [ 4001.377316] ext4_dax_huge_fault+0x16f/0x1f0 [ 4001.382086] __handle_mm_fault+0x411/0x1390 [ 4001.386756] handle_mm_fault+0x172/0x360