On Fri, Mar 25, 2022 at 04:14:28PM -0400, Rik van Riel wrote:
In some cases it appears the invalidation of a hwpoisoned page fails because the page is still mapped in another process. This can cause a program to be continuously restarted and die when it page faults on the page that was not invalidated. Avoid that problem by unmapping the hwpoisoned page when we find it.
Another issue is that sometimes we end up oopsing in finish_fault, if the code tries to do something with the now-NULL vmf->page. I did not hit this error when submitting the previous patch because there are several opportunities for alloc_set_pte to bail out before accessing vmf->page, and that apparently happened on those systems, and most of the time on other systems, too.
However, across several million systems that error does occur a handful of times a day. It can be avoided by returning VM_FAULT_NOPAGE which will cause do_read_fault to return before calling finish_fault.
Fixes: e53ac7374e64 ("mm: invalidate hwpoison page cache page in fault path") Cc: Oscar Salvador osalvador@suse.de Cc: Miaohe Lin linmiaohe@huawei.com Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Mel Gorman mgorman@suse.de Cc: Johannes Weiner hannes@cmpxchg.org Cc: Andrew Morton akpm@linux-foundation.org Cc: stable@vger.kernel.org
mm/memory.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index be44d0b36b18..76e3af9639d9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3918,14 +3918,18 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; if (unlikely(PageHWPoison(vmf->page))) {
vm_fault_t poisonret = VM_FAULT_HWPOISON; if (ret & VM_FAULT_LOCKED) {struct page *page = vmf->page;
if (page_mapped(page))
unmap_mapping_pages(page_mapping(page),
page->index, 1, false); /* Retry if a clean page was removed from the cache. */
if (invalidate_inode_page(vmf->page))
poisonret = 0;
unlock_page(vmf->page);
if (invalidate_inode_page(page))
poisonret = VM_FAULT_NOPAGE;
What is the effect of returning VM_FAULT_NOPAGE? I take that we are cool because the pte has been installed and points to a new page? (I could not find where that is being done).