From: Michal Hocko mhocko@suse.com
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 [<ffffffff811c5777>] shrink_page_list+0x907/0x960 [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 [<ffffffff811c70a8>] shrink_node+0xd8/0x300 [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [<ffffffff8122df2d>] try_charge+0x14d/0x720 [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 [<ffffffff81074247>] pte_alloc_one+0x17/0x40 [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 [<ffffffff811e5d93>] do_fault+0x103/0x970 [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 [<ffffffff8171bce8>] page_fault+0x28/0x30 [<ffffffffffffffff>] 0xffffffffffffffff
task2: [<ffffffff811aadc6>] __lock_page+0x86/0xa0 [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 [<ffffffff811bbede>] do_writepages+0x1e/0x30 [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 [<ffffffff81273568>] wb_writeback+0x268/0x300 [<ffffffff81273d24>] wb_workfn+0xb4/0x390 [<ffffffff810a2f19>] process_one_work+0x189/0x420 [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 [<ffffffff810a9786>] kthread+0xe6/0x100 [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 [<ffffffffffffffff>] 0xffffffffffffffff
He adds : task1 is waiting for the PageWriteback bit of the page that task2 has : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED : bit the page which tasks1 has locked.
More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock: lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_pne shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B)
# flush A, B to clear the writeback
This accumulating of more pages to flush is used by several filesystems to generate a more optimal IO patterns.
Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this.
There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately.
Reported-and-Debugged-by: Liu Bo bo.liu@linux.alibaba.com Cc: stable Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") Signed-off-by: Michal Hocko mhocko@suse.com --- mm/memory.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c index 4ad2d293ddc2..bb78e90a9b70 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret;
+ /* + * Preallocate pte before we take page_lock because this might lead to + * deadlocks for memcg reclaim which waits for pages under writeback. + */ + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address); + if (!vmf->prealloc_pte) + return VM_FAULT_OOM; + smp_wmb(); /* See comment in __pte_alloc() */ + } + ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW)))
On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
From: Michal Hocko mhocko@suse.com
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 [<ffffffff811c5777>] shrink_page_list+0x907/0x960 [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 [<ffffffff811c70a8>] shrink_node+0xd8/0x300 [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [<ffffffff8122df2d>] try_charge+0x14d/0x720 [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 [<ffffffff81074247>] pte_alloc_one+0x17/0x40 [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 [<ffffffff811e5d93>] do_fault+0x103/0x970 [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 [<ffffffff8171bce8>] page_fault+0x28/0x30 [<ffffffffffffffff>] 0xffffffffffffffff
task2: [<ffffffff811aadc6>] __lock_page+0x86/0xa0 [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 [<ffffffff811bbede>] do_writepages+0x1e/0x30 [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 [<ffffffff81273568>] wb_writeback+0x268/0x300 [<ffffffff81273d24>] wb_workfn+0xb4/0x390 [<ffffffff810a2f19>] process_one_work+0x189/0x420 [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 [<ffffffff810a9786>] kthread+0xe6/0x100 [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 [<ffffffffffffffff>] 0xffffffffffffffff
He adds : task1 is waiting for the PageWriteback bit of the page that task2 has : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED : bit the page which tasks1 has locked.
More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock: lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_pne shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B)
# flush A, B to clear the writeback
This accumulating of more pages to flush is used by several filesystems to generate a more optimal IO patterns.
Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this.
There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately.
Reported-and-Debugged-by: Liu Bo bo.liu@linux.alibaba.com Cc: stable Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") Signed-off-by: Michal Hocko mhocko@suse.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Will you take care about converting vmf_insert_* to use the pre-allocated page table?
On Thu 13-12-18 13:41:47, Kirill A. Shutemov wrote:
On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
From: Michal Hocko mhocko@suse.com
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 [<ffffffff811c5777>] shrink_page_list+0x907/0x960 [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 [<ffffffff811c70a8>] shrink_node+0xd8/0x300 [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [<ffffffff8122df2d>] try_charge+0x14d/0x720 [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 [<ffffffff81074247>] pte_alloc_one+0x17/0x40 [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 [<ffffffff811e5d93>] do_fault+0x103/0x970 [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 [<ffffffff8171bce8>] page_fault+0x28/0x30 [<ffffffffffffffff>] 0xffffffffffffffff
task2: [<ffffffff811aadc6>] __lock_page+0x86/0xa0 [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 [<ffffffff811bbede>] do_writepages+0x1e/0x30 [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 [<ffffffff81273568>] wb_writeback+0x268/0x300 [<ffffffff81273d24>] wb_workfn+0xb4/0x390 [<ffffffff810a2f19>] process_one_work+0x189/0x420 [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 [<ffffffff810a9786>] kthread+0xe6/0x100 [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 [<ffffffffffffffff>] 0xffffffffffffffff
He adds : task1 is waiting for the PageWriteback bit of the page that task2 has : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED : bit the page which tasks1 has locked.
More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock: lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_pne shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B)
# flush A, B to clear the writeback
This accumulating of more pages to flush is used by several filesystems to generate a more optimal IO patterns.
Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this.
There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately.
Reported-and-Debugged-by: Liu Bo bo.liu@linux.alibaba.com Cc: stable Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") Signed-off-by: Michal Hocko mhocko@suse.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Thanks!
Will you take care about converting vmf_insert_* to use the pre-allocated page table?
I can try but I would appreciate if somebody more familiar with the code could do that. I am busy as hell and I do not want to promis something I will likely not get to soon.
On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
From: Michal Hocko mhocko@suse.com
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 [<ffffffff811c5777>] shrink_page_list+0x907/0x960 [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 [<ffffffff811c70a8>] shrink_node+0xd8/0x300 [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [<ffffffff8122df2d>] try_charge+0x14d/0x720 [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 [<ffffffff81074247>] pte_alloc_one+0x17/0x40 [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 [<ffffffff811e5d93>] do_fault+0x103/0x970 [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 [<ffffffff8171bce8>] page_fault+0x28/0x30 [<ffffffffffffffff>] 0xffffffffffffffff
task2: [<ffffffff811aadc6>] __lock_page+0x86/0xa0 [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 [<ffffffff811bbede>] do_writepages+0x1e/0x30 [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 [<ffffffff81273568>] wb_writeback+0x268/0x300 [<ffffffff81273d24>] wb_workfn+0xb4/0x390 [<ffffffff810a2f19>] process_one_work+0x189/0x420 [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 [<ffffffff810a9786>] kthread+0xe6/0x100 [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 [<ffffffffffffffff>] 0xffffffffffffffff
He adds : task1 is waiting for the PageWriteback bit of the page that task2 has : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED : bit the page which tasks1 has locked.
More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock: lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_pne shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B)
# flush A, B to clear the writeback
This accumulating of more pages to flush is used by several filesystems to generate a more optimal IO patterns.
Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this.
There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately.
Reported-and-Debugged-by: Liu Bo bo.liu@linux.alibaba.com Cc: stable Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") Signed-off-by: Michal Hocko mhocko@suse.com
Acked-by: Johannes Weiner hannes@cmpxchg.org
Just one nit:
@@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret;
- /*
* Preallocate pte before we take page_lock because this might lead to
* deadlocks for memcg reclaim which waits for pages under writeback.
*/
- if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
if (!vmf->prealloc_pte)
return VM_FAULT_OOM;
smp_wmb(); /* See comment in __pte_alloc() */
- }
Could you be more specific in the deadlock comment? git blame will work fine for a while, but it becomes a pain to find corresponding patches after stuff gets moved around for years.
In particular the race diagram between reclaim with a page lock held and the fs doing SetPageWriteback batches before kicking off IO would be useful directly in the code, IMO.
On Thu 13-12-18 17:04:00, Johannes Weiner wrote: [...]
Acked-by: Johannes Weiner hannes@cmpxchg.org
Thanks!
Just one nit:
@@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret;
- /*
* Preallocate pte before we take page_lock because this might lead to
* deadlocks for memcg reclaim which waits for pages under writeback.
*/
- if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
if (!vmf->prealloc_pte)
return VM_FAULT_OOM;
smp_wmb(); /* See comment in __pte_alloc() */
- }
Could you be more specific in the deadlock comment? git blame will work fine for a while, but it becomes a pain to find corresponding patches after stuff gets moved around for years.
In particular the race diagram between reclaim with a page lock held and the fs doing SetPageWriteback batches before kicking off IO would be useful directly in the code, IMO.
This?
diff --git a/mm/memory.c b/mm/memory.c index bb78e90a9b70..ece221e4da6d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2995,7 +2995,18 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
/* * Preallocate pte before we take page_lock because this might lead to - * deadlocks for memcg reclaim which waits for pages under writeback. + * deadlocks for memcg reclaim which waits for pages under writeback: + * lock_page(A) + * SetPageWriteback(A) + * unlock_page(A) + * lock_page(B) + * lock_page(B) + * pte_alloc_pne + * shrink_page_list + * wait_on_page_writeback(A) + * SetPageWriteback(B) + * unlock_page(B) + * # flush A, B to clear the writeback */ if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
From: Michal Hocko mhocko@suse.com
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 [<ffffffff811c5777>] shrink_page_list+0x907/0x960 [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 [<ffffffff811c70a8>] shrink_node+0xd8/0x300 [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [<ffffffff8122df2d>] try_charge+0x14d/0x720 [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 [<ffffffff81074247>] pte_alloc_one+0x17/0x40 [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 [<ffffffff811e5d93>] do_fault+0x103/0x970 [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 [<ffffffff8171bce8>] page_fault+0x28/0x30 [<ffffffffffffffff>] 0xffffffffffffffff
task2: [<ffffffff811aadc6>] __lock_page+0x86/0xa0 [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 [<ffffffff811bbede>] do_writepages+0x1e/0x30 [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 [<ffffffff81273568>] wb_writeback+0x268/0x300 [<ffffffff81273d24>] wb_workfn+0xb4/0x390 [<ffffffff810a2f19>] process_one_work+0x189/0x420 [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 [<ffffffff810a9786>] kthread+0xe6/0x100 [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 [<ffffffffffffffff>] 0xffffffffffffffff
He adds : task1 is waiting for the PageWriteback bit of the page that task2 has : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED : bit the page which tasks1 has locked.
More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock: lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_pne shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B)
# flush A, B to clear the writeback
This accumulating of more pages to flush is used by several filesystems to generate a more optimal IO patterns.
Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this.
There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately.
Thanks for the update.
Looks good to me.
Reviewed-by: Liu Bo bo.liu@linux.alibaba.com
thanks, -liubo
Reported-and-Debugged-by: Liu Bo bo.liu@linux.alibaba.com Cc: stable Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") Signed-off-by: Michal Hocko mhocko@suse.com
mm/memory.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c index 4ad2d293ddc2..bb78e90a9b70 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret;
- /*
* Preallocate pte before we take page_lock because this might lead to
* deadlocks for memcg reclaim which waits for pages under writeback.
*/
- if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
if (!vmf->prealloc_pte)
return VM_FAULT_OOM;
smp_wmb(); /* See comment in __pte_alloc() */
- }
- ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW)))
-- 2.19.2
linux-stable-mirror@lists.linaro.org