Ever since the conversion of DAX to the Xarray a RocksDB benchmark has been encountering intermittent lockups. The backtraces always include the filesystem-DAX PMD path, multi-order entries have been a source of bugs in the past, and disabling the PMD path allows a test that fails in minutes to run for an hour.
The regression has been bisected to "dax: Convert page fault handlers to XArray", but little progress has been made on the root cause debug. Unless / until root cause can be identified mark CONFIG_FS_DAX_PMD broken to preclude intermittent lockups. Reverting the Xarray conversion also works, but that change is too big to backport. The implementation is committed to Xarray at this point.
Link: https://lore.kernel.org/linux-fsdevel/CAPcyv4hwHpX-MkUEqxwdTj7wCCZCN4RV-L4js... Fixes: b15cd800682f ("dax: Convert page fault handlers to XArray") Cc: Matthew Wilcox willy@infradead.org Cc: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Reported-by: Robert Barror robert.barror@intel.com Reported-by: Seema Pandit seema.pandit@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- fs/Kconfig | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/Kconfig b/fs/Kconfig index f1046cf6ad85..85eecd0d4c5d 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -66,6 +66,9 @@ config FS_DAX_PMD depends on FS_DAX depends on ZONE_DEVICE depends on TRANSPARENT_HUGEPAGE + # intermittent lockups since commit b15cd800682f "dax: Convert + # page fault handlers to XArray" + depends on BROKEN
# Selected by DAX drivers that do not expect filesystem DAX to support # get_user_pages() of DAX mappings. I.e. "limited" indicates no support
On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
Ever since the conversion of DAX to the Xarray a RocksDB benchmark has been encountering intermittent lockups. The backtraces always include the filesystem-DAX PMD path, multi-order entries have been a source of bugs in the past, and disabling the PMD path allows a test that fails in minutes to run for an hour.
On May 4th, I asked you:
Since this is provoked by a fatal signal, it must have something to do with a killable or interruptible sleep. There's only one of those in the DAX code; fatal_signal_pending() in dax_iomap_actor(). Does rocksdb do I/O with write() or through a writable mmap()? I'd like to know before I chase too far down this fault tree analysis.
On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox willy@infradead.org wrote:
On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
Ever since the conversion of DAX to the Xarray a RocksDB benchmark has been encountering intermittent lockups. The backtraces always include the filesystem-DAX PMD path, multi-order entries have been a source of bugs in the past, and disabling the PMD path allows a test that fails in minutes to run for an hour.
On May 4th, I asked you:
Since this is provoked by a fatal signal, it must have something to do with a killable or interruptible sleep. There's only one of those in the DAX code; fatal_signal_pending() in dax_iomap_actor(). Does rocksdb do I/O with write() or through a writable mmap()? I'd like to know before I chase too far down this fault tree analysis.
RocksDB in this case is using write() for writes and mmap() for reads.
On Thu, Jun 27, 2019 at 9:06 AM Dan Williams dan.j.williams@intel.com wrote:
On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox willy@infradead.org wrote:
On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
Ever since the conversion of DAX to the Xarray a RocksDB benchmark has been encountering intermittent lockups. The backtraces always include the filesystem-DAX PMD path, multi-order entries have been a source of bugs in the past, and disabling the PMD path allows a test that fails in minutes to run for an hour.
On May 4th, I asked you:
Since this is provoked by a fatal signal, it must have something to do with a killable or interruptible sleep. There's only one of those in the DAX code; fatal_signal_pending() in dax_iomap_actor(). Does rocksdb do I/O with write() or through a writable mmap()? I'd like to know before I chase too far down this fault tree analysis.
RocksDB in this case is using write() for writes and mmap() for reads.
It's not clear to me that a fatal signal is a component of the failure as much as it's the way to detect that the benchmark has indeed locked up.
On Thu, Jun 27, 2019 at 11:29 AM Dan Williams dan.j.williams@intel.com wrote:
On Thu, Jun 27, 2019 at 9:06 AM Dan Williams dan.j.williams@intel.com wrote:
On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox willy@infradead.org wrote:
On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
Ever since the conversion of DAX to the Xarray a RocksDB benchmark has been encountering intermittent lockups. The backtraces always include the filesystem-DAX PMD path, multi-order entries have been a source of bugs in the past, and disabling the PMD path allows a test that fails in minutes to run for an hour.
On May 4th, I asked you:
Since this is provoked by a fatal signal, it must have something to do with a killable or interruptible sleep. There's only one of those in the DAX code; fatal_signal_pending() in dax_iomap_actor(). Does rocksdb do I/O with write() or through a writable mmap()? I'd like to know before I chase too far down this fault tree analysis.
RocksDB in this case is using write() for writes and mmap() for reads.
It's not clear to me that a fatal signal is a component of the failure as much as it's the way to detect that the benchmark has indeed locked up.
Even though db_bench is run with the mmap_read=1 option:
cmd="${rocksdb_dir}/db_bench $params_r --benchmarks=readwhilewriting \ --use_existing_db=1 \ --mmap_read=1 \ --num=$num_keys \ --threads=$num_read_threads \
When the lockup occurs there are db_bench processes in the write fault path:
[ 1666.635212] db_bench D 0 2492 2435 0x00000000 [ 1666.641339] Call Trace: [ 1666.644072] ? __schedule+0x24f/0x680 [ 1666.648162] ? __switch_to_asm+0x34/0x70 [ 1666.652545] schedule+0x29/0x90 [ 1666.656054] get_unlocked_entry+0xcd/0x120 [ 1666.660629] ? dax_iomap_actor+0x270/0x270 [ 1666.665206] grab_mapping_entry+0x14f/0x230 [ 1666.669878] dax_iomap_pmd_fault.isra.42+0x14d/0x950 [ 1666.675425] ? futex_wait+0x122/0x230 [ 1666.679518] ext4_dax_huge_fault+0x16f/0x1f0 [ 1666.684288] __handle_mm_fault+0x411/0x1350 [ 1666.688961] ? do_futex+0xca/0xbb0 [ 1666.692760] ? __switch_to_asm+0x34/0x70 [ 1666.697144] handle_mm_fault+0xbe/0x1e0 [ 1666.701429] __do_page_fault+0x249/0x4f0 [ 1666.705811] do_page_fault+0x32/0x110 [ 1666.709903] ? page_fault+0x8/0x30 [ 1666.713702] page_fault+0x1e/0x30
...where __handle_mm_fault+0x411 is in wp_huge_pmd():
(gdb) li *(__handle_mm_fault+0x411) 0xffffffff812713d1 is in __handle_mm_fault (mm/memory.c:3800). 3795 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd) 3796 { 3797 if (vma_is_anonymous(vmf->vma)) 3798 return do_huge_pmd_wp_page(vmf, orig_pmd); 3799 if (vmf->vma->vm_ops->huge_fault) 3800 return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); 3801 3802 /* COW handled on pte level: split pmd */ 3803 VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma); 3804 __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone?
/* * Besides huge zero pages the only other thing that gets * downgraded are empty entries which don't need to be * unmapped. */ - if (pmd_downgrade && dax_is_zero_entry(entry)) - unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR, - PG_PMD_NR, false); - - err = radix_tree_preload( - mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM); - if (err) { - if (pmd_downgrade) - put_locked_mapping_entry(mapping, index); - return ERR_PTR(err); - } - xa_lock_irq(&mapping->i_pages); - - if (!entry) { - /* - * We needed to drop the i_pages lock while calling - * radix_tree_preload() and we didn't have an entry to - * lock. See if another thread inserted an entry at - * our index during this time. - */ - entry = __radix_tree_lookup(&mapping->i_pages, index, - NULL, &slot); - if (entry) { - radix_tree_preload_end(); - xa_unlock_irq(&mapping->i_pages); - goto restart; - } + if (dax_is_zero_entry(entry)) { + xas_unlock_irq(xas); + unmap_mapping_pages(mapping, + xas->xa_index & ~PG_PMD_COLOUR, + PG_PMD_NR, false); + xas_reset(xas); + xas_lock_irq(xas); }
- if (pmd_downgrade) { - dax_disassociate_entry(entry, mapping, false); - radix_tree_delete(&mapping->i_pages, index); - mapping->nrexceptional--; - dax_wake_mapping_entry_waiter(&mapping->i_pages, - index, entry, true); - } + dax_disassociate_entry(entry, mapping, false); + xas_store(xas, NULL); /* undo the PMD join */ + dax_wake_entry(xas, entry, true); + mapping->nrexceptional--;
On Thu, Jun 27, 2019 at 11:58 AM Dan Williams dan.j.williams@intel.com wrote:
On Thu, Jun 27, 2019 at 11:29 AM Dan Williams dan.j.williams@intel.com wrote:
On Thu, Jun 27, 2019 at 9:06 AM Dan Williams dan.j.williams@intel.com wrote:
On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox willy@infradead.org wrote:
On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
Ever since the conversion of DAX to the Xarray a RocksDB benchmark has been encountering intermittent lockups. The backtraces always include the filesystem-DAX PMD path, multi-order entries have been a source of bugs in the past, and disabling the PMD path allows a test that fails in minutes to run for an hour.
On May 4th, I asked you:
Since this is provoked by a fatal signal, it must have something to do with a killable or interruptible sleep. There's only one of those in the DAX code; fatal_signal_pending() in dax_iomap_actor(). Does rocksdb do I/O with write() or through a writable mmap()? I'd like to know before I chase too far down this fault tree analysis.
RocksDB in this case is using write() for writes and mmap() for reads.
It's not clear to me that a fatal signal is a component of the failure as much as it's the way to detect that the benchmark has indeed locked up.
Even though db_bench is run with the mmap_read=1 option:
cmd="${rocksdb_dir}/db_bench $params_r --benchmarks=readwhilewriting \ --use_existing_db=1 \ --mmap_read=1 \ --num=$num_keys \ --threads=$num_read_threads \
When the lockup occurs there are db_bench processes in the write fault path:
[ 1666.635212] db_bench D 0 2492 2435 0x00000000 [ 1666.641339] Call Trace: [ 1666.644072] ? __schedule+0x24f/0x680 [ 1666.648162] ? __switch_to_asm+0x34/0x70 [ 1666.652545] schedule+0x29/0x90 [ 1666.656054] get_unlocked_entry+0xcd/0x120 [ 1666.660629] ? dax_iomap_actor+0x270/0x270 [ 1666.665206] grab_mapping_entry+0x14f/0x230 [ 1666.669878] dax_iomap_pmd_fault.isra.42+0x14d/0x950 [ 1666.675425] ? futex_wait+0x122/0x230 [ 1666.679518] ext4_dax_huge_fault+0x16f/0x1f0 [ 1666.684288] __handle_mm_fault+0x411/0x1350 [ 1666.688961] ? do_futex+0xca/0xbb0 [ 1666.692760] ? __switch_to_asm+0x34/0x70 [ 1666.697144] handle_mm_fault+0xbe/0x1e0 [ 1666.701429] __do_page_fault+0x249/0x4f0 [ 1666.705811] do_page_fault+0x32/0x110 [ 1666.709903] ? page_fault+0x8/0x30 [ 1666.713702] page_fault+0x1e/0x30
...where __handle_mm_fault+0x411 is in wp_huge_pmd():
(gdb) li *(__handle_mm_fault+0x411) 0xffffffff812713d1 is in __handle_mm_fault (mm/memory.c:3800). 3795 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd) 3796 { 3797 if (vma_is_anonymous(vmf->vma)) 3798 return do_huge_pmd_wp_page(vmf, orig_pmd); 3799 if (vmf->vma->vm_ops->huge_fault) 3800 return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); 3801 3802 /* COW handled on pte level: split pmd */ 3803 VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma); 3804 __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone?
Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.
On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone?
Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.
We have the entry locked:
/* * Make sure 'entry' remains valid while we drop * the i_pages lock. */ dax_lock_entry(xas, entry);
/* * Besides huge zero pages the only other thing that gets * downgraded are empty entries which don't need to be * unmapped. */ if (dax_is_zero_entry(entry)) { xas_unlock_irq(xas); unmap_mapping_pages(mapping, xas->xa_index & ~PG_PMD_COLOUR, PG_PMD_NR, false); xas_reset(xas); xas_lock_irq(xas); }
If something can remove a locked entry, then that would seem like the real bug. Might be worth inserting a lookup there to make sure that it hasn't happened, I suppose?
On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone?
Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.
We have the entry locked:
/* * Make sure 'entry' remains valid while we drop * the i_pages lock. */ dax_lock_entry(xas, entry); /* * Besides huge zero pages the only other thing that gets * downgraded are empty entries which don't need to be * unmapped. */ if (dax_is_zero_entry(entry)) { xas_unlock_irq(xas); unmap_mapping_pages(mapping, xas->xa_index & ~PG_PMD_COLOUR, PG_PMD_NR, false); xas_reset(xas); xas_lock_irq(xas); }
If something can remove a locked entry, then that would seem like the real bug. Might be worth inserting a lookup there to make sure that it hasn't happened, I suppose?
Nope, added a check, we do in fact get the same locked entry back after dropping the lock.
The deadlock revolves around the mmap_sem. One thread holds it for read and then gets stuck indefinitely in get_unlocked_entry(). Once that happens another rocksdb thread tries to mmap and gets stuck trying to take the mmap_sem for write. Then all new readers, including ps and top that try to access a remote vma, then get queued behind that write.
It could also be the case that we're missing a wake up.
On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone?
Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.
If something can remove a locked entry, then that would seem like the real bug. Might be worth inserting a lookup there to make sure that it hasn't happened, I suppose?
Nope, added a check, we do in fact get the same locked entry back after dropping the lock.
Okay, good, glad to have ruled that out.
The deadlock revolves around the mmap_sem. One thread holds it for read and then gets stuck indefinitely in get_unlocked_entry(). Once that happens another rocksdb thread tries to mmap and gets stuck trying to take the mmap_sem for write. Then all new readers, including ps and top that try to access a remote vma, then get queued behind that write.
It could also be the case that we're missing a wake up.
That was the conclusion I came to; that one thread holding the mmap sem for read isn't being woken up when it should be. Just need to find it ... obviously it's something to do with the PMD entries.
On Fri, Jun 28, 2019 at 9:37 AM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone?
Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.
If something can remove a locked entry, then that would seem like the real bug. Might be worth inserting a lookup there to make sure that it hasn't happened, I suppose?
Nope, added a check, we do in fact get the same locked entry back after dropping the lock.
Okay, good, glad to have ruled that out.
The deadlock revolves around the mmap_sem. One thread holds it for read and then gets stuck indefinitely in get_unlocked_entry(). Once that happens another rocksdb thread tries to mmap and gets stuck trying to take the mmap_sem for write. Then all new readers, including ps and top that try to access a remote vma, then get queued behind that write.
It could also be the case that we're missing a wake up.
That was the conclusion I came to; that one thread holding the mmap sem for read isn't being woken up when it should be. Just need to find it ... obviously it's something to do with the PMD entries.
Can you explain to me one more time, yes I'm slow on the uptake on this, the difference between xas_load() and xas_find_conflict() and why it's ok for dax_lock_page() to use xas_load() while grab_mapping_entry() uses xas_find_conflict()?
On Fri, Jun 28, 2019 at 09:39:01AM -0700, Dan Williams wrote:
On Fri, Jun 28, 2019 at 9:37 AM Matthew Wilcox willy@infradead.org wrote:
That was the conclusion I came to; that one thread holding the mmap sem for read isn't being woken up when it should be. Just need to find it ... obviously it's something to do with the PMD entries.
Can you explain to me one more time, yes I'm slow on the uptake on this, the difference between xas_load() and xas_find_conflict() and why it's ok for dax_lock_page() to use xas_load() while grab_mapping_entry() uses xas_find_conflict()?
When used with a non-zero 'order', xas_find_conflict() will return an entry whereas xas_load() might return a pointer to a node. dax_lock_page() always uses a zero order, so they would always do the same thing (xas_find_conflict() would be less efficient).
On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone?
Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.
We have the entry locked:
/* * Make sure 'entry' remains valid while we drop * the i_pages lock. */ dax_lock_entry(xas, entry); /* * Besides huge zero pages the only other thing that gets * downgraded are empty entries which don't need to be * unmapped. */ if (dax_is_zero_entry(entry)) { xas_unlock_irq(xas); unmap_mapping_pages(mapping, xas->xa_index & ~PG_PMD_COLOUR, PG_PMD_NR, false); xas_reset(xas); xas_lock_irq(xas); }
If something can remove a locked entry, then that would seem like the real bug. Might be worth inserting a lookup there to make sure that it hasn't happened, I suppose?
Nope, added a check, we do in fact get the same locked entry back after dropping the lock.
The deadlock revolves around the mmap_sem. One thread holds it for read and then gets stuck indefinitely in get_unlocked_entry(). Once that happens another rocksdb thread tries to mmap and gets stuck trying to take the mmap_sem for write. Then all new readers, including ps and top that try to access a remote vma, then get queued behind that write.
It could also be the case that we're missing a wake up.
OK, I have a Theory.
get_unlocked_entry() doesn't check the size of the entry being waited for. So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry, which is (a) foolish, because we know it's going to fall back, and (b) can lead to a missed wakeup because it's going to sleep waiting for the PMD entry to come unlocked. Which it won't, unless there's a happy accident that happens to map to the same hash bucket.
Let's see if I can steal some time this weekend to whip up a patch.
On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone?
Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.
We have the entry locked:
/* * Make sure 'entry' remains valid while we drop * the i_pages lock. */ dax_lock_entry(xas, entry); /* * Besides huge zero pages the only other thing that gets * downgraded are empty entries which don't need to be * unmapped. */ if (dax_is_zero_entry(entry)) { xas_unlock_irq(xas); unmap_mapping_pages(mapping, xas->xa_index & ~PG_PMD_COLOUR, PG_PMD_NR, false); xas_reset(xas); xas_lock_irq(xas); }
If something can remove a locked entry, then that would seem like the real bug. Might be worth inserting a lookup there to make sure that it hasn't happened, I suppose?
Nope, added a check, we do in fact get the same locked entry back after dropping the lock.
The deadlock revolves around the mmap_sem. One thread holds it for read and then gets stuck indefinitely in get_unlocked_entry(). Once that happens another rocksdb thread tries to mmap and gets stuck trying to take the mmap_sem for write. Then all new readers, including ps and top that try to access a remote vma, then get queued behind that write.
It could also be the case that we're missing a wake up.
OK, I have a Theory.
get_unlocked_entry() doesn't check the size of the entry being waited for. So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry, which is (a) foolish, because we know it's going to fall back, and (b) can lead to a missed wakeup because it's going to sleep waiting for the PMD entry to come unlocked. Which it won't, unless there's a happy accident that happens to map to the same hash bucket.
Let's see if I can steal some time this weekend to whip up a patch.
Theory seems to have some evidence... I instrumented fs/dax.c to track outstanding 'lock' entries and 'wait' events. At the time of the hang we see no locks held and the waiter is waiting on a pmd entry:
[ 4001.354334] fs/dax locked entries: 0 [ 4001.358425] fs/dax wait entries: 1 [ 4001.362227] db_bench/2445 index: 0x0 shift: 6 [ 4001.367099] grab_mapping_entry+0x17a/0x260 [ 4001.371773] dax_iomap_pmd_fault.isra.43+0x168/0x7a0 [ 4001.377316] ext4_dax_huge_fault+0x16f/0x1f0 [ 4001.382086] __handle_mm_fault+0x411/0x1390 [ 4001.386756] handle_mm_fault+0x172/0x360
On Sun, Jun 30, 2019 at 12:27 AM Dan Williams dan.j.williams@intel.com wrote:
On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone?
Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.
We have the entry locked:
/* * Make sure 'entry' remains valid while we drop * the i_pages lock. */ dax_lock_entry(xas, entry); /* * Besides huge zero pages the only other thing that gets * downgraded are empty entries which don't need to be * unmapped. */ if (dax_is_zero_entry(entry)) { xas_unlock_irq(xas); unmap_mapping_pages(mapping, xas->xa_index & ~PG_PMD_COLOUR, PG_PMD_NR, false); xas_reset(xas); xas_lock_irq(xas); }
If something can remove a locked entry, then that would seem like the real bug. Might be worth inserting a lookup there to make sure that it hasn't happened, I suppose?
Nope, added a check, we do in fact get the same locked entry back after dropping the lock.
The deadlock revolves around the mmap_sem. One thread holds it for read and then gets stuck indefinitely in get_unlocked_entry(). Once that happens another rocksdb thread tries to mmap and gets stuck trying to take the mmap_sem for write. Then all new readers, including ps and top that try to access a remote vma, then get queued behind that write.
It could also be the case that we're missing a wake up.
OK, I have a Theory.
get_unlocked_entry() doesn't check the size of the entry being waited for. So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry, which is (a) foolish, because we know it's going to fall back, and (b) can lead to a missed wakeup because it's going to sleep waiting for the PMD entry to come unlocked. Which it won't, unless there's a happy accident that happens to map to the same hash bucket.
Let's see if I can steal some time this weekend to whip up a patch.
Theory seems to have some evidence... I instrumented fs/dax.c to track outstanding 'lock' entries and 'wait' events. At the time of the hang we see no locks held and the waiter is waiting on a pmd entry:
[ 4001.354334] fs/dax locked entries: 0 [ 4001.358425] fs/dax wait entries: 1 [ 4001.362227] db_bench/2445 index: 0x0 shift: 6 [ 4001.367099] grab_mapping_entry+0x17a/0x260 [ 4001.371773] dax_iomap_pmd_fault.isra.43+0x168/0x7a0 [ 4001.377316] ext4_dax_huge_fault+0x16f/0x1f0 [ 4001.382086] __handle_mm_fault+0x411/0x1390 [ 4001.386756] handle_mm_fault+0x172/0x360
In fact, this naive fix is holding up so far:
@@ -215,7 +216,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas, * queue to the start of that PMD. This ensures that all offsets in * the range covered by the PMD map to the same bit lock. */ - if (dax_is_pmd_entry(entry)) + //if (dax_is_pmd_entry(entry)) index &= ~PG_PMD_COLOUR; key->xa = xas->xa; key->entry_start = index;
On Sun, Jun 30, 2019 at 01:01:04AM -0700, Dan Williams wrote:
On Sun, Jun 30, 2019 at 12:27 AM Dan Williams dan.j.williams@intel.com wrote:
On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> This bug feels like we failed to unlock, or unlocked the wrong entry > and this hunk in the bisected commit looks suspect to me. Why do we > still need to drop the lock now that the radix_tree_preload() calls > are gone?
Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.
We have the entry locked:
/* * Make sure 'entry' remains valid while we drop * the i_pages lock. */ dax_lock_entry(xas, entry); /* * Besides huge zero pages the only other thing that gets * downgraded are empty entries which don't need to be * unmapped. */ if (dax_is_zero_entry(entry)) { xas_unlock_irq(xas); unmap_mapping_pages(mapping, xas->xa_index & ~PG_PMD_COLOUR, PG_PMD_NR, false); xas_reset(xas); xas_lock_irq(xas); }
If something can remove a locked entry, then that would seem like the real bug. Might be worth inserting a lookup there to make sure that it hasn't happened, I suppose?
Nope, added a check, we do in fact get the same locked entry back after dropping the lock.
The deadlock revolves around the mmap_sem. One thread holds it for read and then gets stuck indefinitely in get_unlocked_entry(). Once that happens another rocksdb thread tries to mmap and gets stuck trying to take the mmap_sem for write. Then all new readers, including ps and top that try to access a remote vma, then get queued behind that write.
It could also be the case that we're missing a wake up.
OK, I have a Theory.
get_unlocked_entry() doesn't check the size of the entry being waited for. So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry, which is (a) foolish, because we know it's going to fall back, and (b) can lead to a missed wakeup because it's going to sleep waiting for the PMD entry to come unlocked. Which it won't, unless there's a happy accident that happens to map to the same hash bucket.
Let's see if I can steal some time this weekend to whip up a patch.
Theory seems to have some evidence... I instrumented fs/dax.c to track outstanding 'lock' entries and 'wait' events. At the time of the hang we see no locks held and the waiter is waiting on a pmd entry:
[ 4001.354334] fs/dax locked entries: 0 [ 4001.358425] fs/dax wait entries: 1 [ 4001.362227] db_bench/2445 index: 0x0 shift: 6 [ 4001.367099] grab_mapping_entry+0x17a/0x260 [ 4001.371773] dax_iomap_pmd_fault.isra.43+0x168/0x7a0 [ 4001.377316] ext4_dax_huge_fault+0x16f/0x1f0 [ 4001.382086] __handle_mm_fault+0x411/0x1390 [ 4001.386756] handle_mm_fault+0x172/0x360
In fact, this naive fix is holding up so far:
@@ -215,7 +216,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas, * queue to the start of that PMD. This ensures that all offsets in * the range covered by the PMD map to the same bit lock. */
if (dax_is_pmd_entry(entry))
//if (dax_is_pmd_entry(entry)) index &= ~PG_PMD_COLOUR; key->xa = xas->xa; key->entry_start = index;
Hah, that's a great naive fix! Thanks for trying that out.
I think my theory was slightly mistaken, but your fix has the effect of fixing the actual problem too.
The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of 512), but xas_find_conflict() does _not_ adjust xa_index (... which I really should have mentioned in the documentation). So we go to sleep on the PMD-aligned index instead of the index of the PTE. Your patch fixes this by using the PMD-aligned index for PTEs too.
I'm trying to come up with a clean fix for this. Clearly we shouldn't wait for a PTE entry if we're looking for a PMD entry. But what should get_unlocked_entry() return if it detects that case? We could have it return an error code encoded as an internal entry, like grab_mapping_entry() does. Or we could have it return the _locked_ PTE entry, and have callers interpret that.
At least get_unlocked_entry() is static, but it's got quite a few callers. Trying to discern which ones might ask for a PMD entry is a bit tricky. So this seems like a large patch which might have bugs.
Thoughts?
On Sun, Jun 30, 2019 at 8:23 AM Matthew Wilcox willy@infradead.org wrote:
On Sun, Jun 30, 2019 at 01:01:04AM -0700, Dan Williams wrote:
On Sun, Jun 30, 2019 at 12:27 AM Dan Williams dan.j.williams@intel.com wrote:
On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox willy@infradead.org wrote:
On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote: > > This bug feels like we failed to unlock, or unlocked the wrong entry > > and this hunk in the bisected commit looks suspect to me. Why do we > > still need to drop the lock now that the radix_tree_preload() calls > > are gone? > > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I > wonder why we don't restart the lookup like the old implementation.
We have the entry locked:
/* * Make sure 'entry' remains valid while we drop * the i_pages lock. */ dax_lock_entry(xas, entry); /* * Besides huge zero pages the only other thing that gets * downgraded are empty entries which don't need to be * unmapped. */ if (dax_is_zero_entry(entry)) { xas_unlock_irq(xas); unmap_mapping_pages(mapping, xas->xa_index & ~PG_PMD_COLOUR, PG_PMD_NR, false); xas_reset(xas); xas_lock_irq(xas); }
If something can remove a locked entry, then that would seem like the real bug. Might be worth inserting a lookup there to make sure that it hasn't happened, I suppose?
Nope, added a check, we do in fact get the same locked entry back after dropping the lock.
The deadlock revolves around the mmap_sem. One thread holds it for read and then gets stuck indefinitely in get_unlocked_entry(). Once that happens another rocksdb thread tries to mmap and gets stuck trying to take the mmap_sem for write. Then all new readers, including ps and top that try to access a remote vma, then get queued behind that write.
It could also be the case that we're missing a wake up.
OK, I have a Theory.
get_unlocked_entry() doesn't check the size of the entry being waited for. So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry, which is (a) foolish, because we know it's going to fall back, and (b) can lead to a missed wakeup because it's going to sleep waiting for the PMD entry to come unlocked. Which it won't, unless there's a happy accident that happens to map to the same hash bucket.
Let's see if I can steal some time this weekend to whip up a patch.
Theory seems to have some evidence... I instrumented fs/dax.c to track outstanding 'lock' entries and 'wait' events. At the time of the hang we see no locks held and the waiter is waiting on a pmd entry:
[ 4001.354334] fs/dax locked entries: 0 [ 4001.358425] fs/dax wait entries: 1 [ 4001.362227] db_bench/2445 index: 0x0 shift: 6 [ 4001.367099] grab_mapping_entry+0x17a/0x260 [ 4001.371773] dax_iomap_pmd_fault.isra.43+0x168/0x7a0 [ 4001.377316] ext4_dax_huge_fault+0x16f/0x1f0 [ 4001.382086] __handle_mm_fault+0x411/0x1390 [ 4001.386756] handle_mm_fault+0x172/0x360
In fact, this naive fix is holding up so far:
@@ -215,7 +216,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas, * queue to the start of that PMD. This ensures that all offsets in * the range covered by the PMD map to the same bit lock. */
if (dax_is_pmd_entry(entry))
//if (dax_is_pmd_entry(entry)) index &= ~PG_PMD_COLOUR; key->xa = xas->xa; key->entry_start = index;
Hah, that's a great naive fix! Thanks for trying that out.
I think my theory was slightly mistaken, but your fix has the effect of fixing the actual problem too.
The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of 512), but xas_find_conflict() does _not_ adjust xa_index (... which I really should have mentioned in the documentation). So we go to sleep on the PMD-aligned index instead of the index of the PTE. Your patch fixes this by using the PMD-aligned index for PTEs too.
I'm trying to come up with a clean fix for this. Clearly we shouldn't wait for a PTE entry if we're looking for a PMD entry. But what should get_unlocked_entry() return if it detects that case? We could have it return an error code encoded as an internal entry, like grab_mapping_entry() does. Or we could have it return the _locked_ PTE entry, and have callers interpret that.
At least get_unlocked_entry() is static, but it's got quite a few callers. Trying to discern which ones might ask for a PMD entry is a bit tricky. So this seems like a large patch which might have bugs.
Thoughts?
...but if it was a problem of just mismatched waitqueue's I would have expected it to trigger prior to commit b15cd800682f "dax: Convert page fault handlers to XArray". This hunk, if I'm reading it correctly, looks suspicious: @index in this case is coming directly from vm->pgoff without pmd alignment adjustment whereas after the conversion it's always pmd aligned from the xas->xa_index. So perhaps the issue is that the lock happens at pte granularity. I expect it would cause the old put_locked_mapping_entry() to WARN, but maybe that avoids the lockup and was missed in the bisect.
@@ -884,21 +711,18 @@ static void *dax_insert_entry(struct address_space *mapping, * existing entry is a PMD, we will just leave the PMD in the * tree and dirty it if necessary. */ - struct radix_tree_node *node; - void **slot; - void *ret; - - ret = __radix_tree_lookup(pages, index, &node, &slot); - WARN_ON_ONCE(ret != entry); - __radix_tree_replace(pages, node, slot, - new_entry, NULL); + void *old = dax_lock_entry(xas, new_entry); + WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) | + DAX_LOCKED)); entry = new_entry; + } else { + xas_load(xas); /* Walk the xa_state */ }
if (dirty) - radix_tree_tag_set(pages, index, PAGECACHE_TAG_DIRTY); + xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
- xa_unlock_irq(pages); + xas_unlock_irq(xas); return entry; }
On Sun, Jun 30, 2019 at 02:37:32PM -0700, Dan Williams wrote:
On Sun, Jun 30, 2019 at 8:23 AM Matthew Wilcox willy@infradead.org wrote:
I think my theory was slightly mistaken, but your fix has the effect of fixing the actual problem too.
The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of 512), but xas_find_conflict() does _not_ adjust xa_index (... which I really should have mentioned in the documentation). So we go to sleep on the PMD-aligned index instead of the index of the PTE. Your patch fixes this by using the PMD-aligned index for PTEs too.
I'm trying to come up with a clean fix for this. Clearly we shouldn't wait for a PTE entry if we're looking for a PMD entry. But what should get_unlocked_entry() return if it detects that case? We could have it return an error code encoded as an internal entry, like grab_mapping_entry() does. Or we could have it return the _locked_ PTE entry, and have callers interpret that.
At least get_unlocked_entry() is static, but it's got quite a few callers. Trying to discern which ones might ask for a PMD entry is a bit tricky. So this seems like a large patch which might have bugs.
Thoughts?
...but if it was a problem of just mismatched waitqueue's I would have expected it to trigger prior to commit b15cd800682f "dax: Convert page fault handlers to XArray".
That commit converts grab_mapping_entry() (called by dax_iomap_pmd_fault()) from calling get_unlocked_mapping_entry() to calling get_unlocked_entry(). get_unlocked_mapping_entry() (eventually) called __radix_tree_lookup() instead of dax_find_conflict().
This hunk, if I'm reading it correctly, looks suspicious: @index in this case is coming directly from vm->pgoff without pmd alignment adjustment whereas after the conversion it's always pmd aligned from the xas->xa_index. So perhaps the issue is that the lock happens at pte granularity. I expect it would cause the old put_locked_mapping_entry() to WARN, but maybe that avoids the lockup and was missed in the bisect.
I don't think that hunk is the problem. The __radix_tree_lookup() is going to return a 'slot' which points to the canonical slot, no matter which of the 512 indices corresponding to that slot is chosen. So I think it's going to do essentially the same thing.
@@ -884,21 +711,18 @@ static void *dax_insert_entry(struct address_space *mapping, * existing entry is a PMD, we will just leave the PMD in the * tree and dirty it if necessary. */
struct radix_tree_node *node;
void **slot;
void *ret;
ret = __radix_tree_lookup(pages, index, &node, &slot);
WARN_ON_ONCE(ret != entry);
__radix_tree_replace(pages, node, slot,
new_entry, NULL);
void *old = dax_lock_entry(xas, new_entry);
WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) |
DAX_LOCKED)); entry = new_entry;
} else {
xas_load(xas); /* Walk the xa_state */ } if (dirty)
radix_tree_tag_set(pages, index, PAGECACHE_TAG_DIRTY);
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
xa_unlock_irq(pages);
xas_unlock_irq(xas); return entry;
}
On Mon, Jul 1, 2019 at 8:34 PM Matthew Wilcox willy@infradead.org wrote:
On Sun, Jun 30, 2019 at 02:37:32PM -0700, Dan Williams wrote:
On Sun, Jun 30, 2019 at 8:23 AM Matthew Wilcox willy@infradead.org wrote:
I think my theory was slightly mistaken, but your fix has the effect of fixing the actual problem too.
The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of 512), but xas_find_conflict() does _not_ adjust xa_index (... which I really should have mentioned in the documentation). So we go to sleep on the PMD-aligned index instead of the index of the PTE. Your patch fixes this by using the PMD-aligned index for PTEs too.
I'm trying to come up with a clean fix for this. Clearly we shouldn't wait for a PTE entry if we're looking for a PMD entry. But what should get_unlocked_entry() return if it detects that case? We could have it return an error code encoded as an internal entry, like grab_mapping_entry() does. Or we could have it return the _locked_ PTE entry, and have callers interpret that.
At least get_unlocked_entry() is static, but it's got quite a few callers. Trying to discern which ones might ask for a PMD entry is a bit tricky. So this seems like a large patch which might have bugs.
Thoughts?
...but if it was a problem of just mismatched waitqueue's I would have expected it to trigger prior to commit b15cd800682f "dax: Convert page fault handlers to XArray".
That commit converts grab_mapping_entry() (called by dax_iomap_pmd_fault()) from calling get_unlocked_mapping_entry() to calling get_unlocked_entry(). get_unlocked_mapping_entry() (eventually) called __radix_tree_lookup() instead of dax_find_conflict().
This hunk, if I'm reading it correctly, looks suspicious: @index in this case is coming directly from vm->pgoff without pmd alignment adjustment whereas after the conversion it's always pmd aligned from the xas->xa_index. So perhaps the issue is that the lock happens at pte granularity. I expect it would cause the old put_locked_mapping_entry() to WARN, but maybe that avoids the lockup and was missed in the bisect.
I don't think that hunk is the problem. The __radix_tree_lookup() is going to return a 'slot' which points to the canonical slot, no matter which of the 512 indices corresponding to that slot is chosen. So I think it's going to do essentially the same thing.
Yeah, no warnings on the parent commit for the regression.
I'd be inclined to do the brute force fix of not trying to get fancy with separate PTE/PMD waitqueues and then follow on with a more clever performance enhancement later. Thoughts about that?
On 02/07/2019 18:37, Dan Williams wrote: <>
I'd be inclined to do the brute force fix of not trying to get fancy with separate PTE/PMD waitqueues and then follow on with a more clever performance enhancement later. Thoughts about that?
Sir Dan
I do not understand how separate waitqueues are any performance enhancement? The all point of the waitqueues is that there is enough of them and the hash function does a good radomization spread to effectively grab a single locker per waitqueue unless the system is very contended and waitqueues are shared. Which is good because it means you effectively need a back pressure to the app. (Because pmem IO is mostly CPU bound with no long term sleeps I do not think you will ever get to that situation)
So the way I understand it having twice as many waitqueues serving two types will be better performance over all then, segregating the types each with half the number of queues.
(Regardless of the above problem of where the segregation is not race clean)
Thanks Boaz
On Tue, Jul 2, 2019 at 5:23 PM Boaz Harrosh openosd@gmail.com wrote:
On 02/07/2019 18:37, Dan Williams wrote: <>
I'd be inclined to do the brute force fix of not trying to get fancy with separate PTE/PMD waitqueues and then follow on with a more clever performance enhancement later. Thoughts about that?
Sir Dan
I do not understand how separate waitqueues are any performance enhancement? The all point of the waitqueues is that there is enough of them and the hash function does a good radomization spread to effectively grab a single locker per waitqueue unless the system is very contended and waitqueues are shared.
Right, the fix in question limits the input to the hash calculation by masking the input to always be 2MB aligned.
Which is good because it means you effectively need a back pressure to the app. (Because pmem IO is mostly CPU bound with no long term sleeps I do not think you will ever get to that situation)
So the way I understand it having twice as many waitqueues serving two types will be better performance over all then, segregating the types each with half the number of queues.
Yes, but the trick is how to manage cases where someone waiting on one type needs to be woken up by an event on the other. So all I'm saying it lets live with more hash collisions until we can figure out a race free way to better scale waitqueue usage.
(Regardless of the above problem of where the segregation is not race clean)
Thanks Boaz
On 03/07/2019 03:42, Dan Williams wrote:
On Tue, Jul 2, 2019 at 5:23 PM Boaz Harrosh openosd@gmail.com wrote:
<>
Yes, but the trick is how to manage cases where someone waiting on one type needs to be woken up by an event on the other.
Exactly I'm totally with you on this.
So all I'm saying it lets live with more hash collisions until we can figure out a race free way to better scale waitqueue usage.
Yes and lets actually do real measurements to see if this really hurts needlessly. Maybe not so much
Thanks Boaz
<>
On Sun 30-06-19 08:23:24, Matthew Wilcox wrote:
On Sun, Jun 30, 2019 at 01:01:04AM -0700, Dan Williams wrote:
@@ -215,7 +216,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas, * queue to the start of that PMD. This ensures that all offsets in * the range covered by the PMD map to the same bit lock. */
if (dax_is_pmd_entry(entry))
//if (dax_is_pmd_entry(entry)) index &= ~PG_PMD_COLOUR; key->xa = xas->xa; key->entry_start = index;
Hah, that's a great naive fix! Thanks for trying that out.
I think my theory was slightly mistaken, but your fix has the effect of fixing the actual problem too.
The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of 512), but xas_find_conflict() does _not_ adjust xa_index (... which I really should have mentioned in the documentation). So we go to sleep on the PMD-aligned index instead of the index of the PTE. Your patch fixes this by using the PMD-aligned index for PTEs too.
I'm trying to come up with a clean fix for this. Clearly we shouldn't wait for a PTE entry if we're looking for a PMD entry. But what should get_unlocked_entry() return if it detects that case? We could have it return an error code encoded as an internal entry, like grab_mapping_entry() does. Or we could have it return the _locked_ PTE entry, and have callers interpret that.
At least get_unlocked_entry() is static, but it's got quite a few callers. Trying to discern which ones might ask for a PMD entry is a bit tricky. So this seems like a large patch which might have bugs.
Yeah. So get_unlocked_entry() is used in several cases:
1) Case where we already have entry at given index but it is locked and we need it unlocked so that we can do our thing `(dax_writeback_one(), dax_layout_busy_page()).
2) Case where we want any entry covering given index (in __dax_invalidate_entry()). This is essentially the same as case 1) since we have already looked up the entry (just didn't propagate that information from mm/truncate.c) - we want any unlocked entry covering given index.
3) Cases where we really want entry at given index and we have some entry order constraints (dax_insert_pfn_mkwrite(), grab_mapping_entry()).
Honestly I'd make the rule that get_unlocked_entry() returns entry of any order that is covering given index. I agree it may be unnecessarily waiting for PTE entry lock for the case where in case 3) we are really looking only for PMD entry but that seems like a relatively small cost for the simplicity of the interface.
BTW, looking into the xarray code, I think I found another difference between the old radix tree code and the new xarray code that could cause issues. In the old radix tree code if we tried to insert PMD entry but there was some PTE entry in the covered range, we'd get EEXIST error back and the DAX fault code relies on this. I don't see how similar behavior is achieved by xas_store()...
Honza
On Mon, Jul 01, 2019 at 02:11:19PM +0200, Jan Kara wrote:
BTW, looking into the xarray code, I think I found another difference between the old radix tree code and the new xarray code that could cause issues. In the old radix tree code if we tried to insert PMD entry but there was some PTE entry in the covered range, we'd get EEXIST error back and the DAX fault code relies on this. I don't see how similar behavior is achieved by xas_store()...
Are you referring to this?
- entry = dax_make_locked(0, size_flag | DAX_EMPTY); - - err = __radix_tree_insert(&mapping->i_pages, index, - dax_entry_order(entry), entry); - radix_tree_preload_end(); - if (err) { - xa_unlock_irq(&mapping->i_pages); - /* - * Our insertion of a DAX entry failed, most likely - * because we were inserting a PMD entry and it - * collided with a PTE sized entry at a different - * index in the PMD range. We haven't inserted - * anything into the radix tree and have no waiters to - * wake. - */ - return ERR_PTR(err); - }
If so, that can't happen any more because we no longer drop the i_pages lock while the entry is NULL, so the entry is always locked while the i_pages lock is dropped.
On Wed 03-07-19 08:47:00, Matthew Wilcox wrote:
On Mon, Jul 01, 2019 at 02:11:19PM +0200, Jan Kara wrote:
BTW, looking into the xarray code, I think I found another difference between the old radix tree code and the new xarray code that could cause issues. In the old radix tree code if we tried to insert PMD entry but there was some PTE entry in the covered range, we'd get EEXIST error back and the DAX fault code relies on this. I don't see how similar behavior is achieved by xas_store()...
Are you referring to this?
entry = dax_make_locked(0, size_flag | DAX_EMPTY);
err = __radix_tree_insert(&mapping->i_pages, index,
dax_entry_order(entry), entry);
radix_tree_preload_end();
if (err) {
xa_unlock_irq(&mapping->i_pages);
/*
* Our insertion of a DAX entry failed, most likely
* because we were inserting a PMD entry and it
* collided with a PTE sized entry at a different
* index in the PMD range. We haven't inserted
* anything into the radix tree and have no waiters to
* wake.
*/
return ERR_PTR(err);
}
Mostly yes.
If so, that can't happen any more because we no longer drop the i_pages lock while the entry is NULL, so the entry is always locked while the i_pages lock is dropped.
Ah, I have misinterpretted what xas_find_conflict() does. I'm sorry for the noise. I find it somewhat unfortunate that xas_find_conflict() will not return in any way the index where it has found the conflicting entry. We could then use it for the wait logic as well and won't have to resort to some masking tricks...
Honza
linux-stable-mirror@lists.linaro.org