Re: [PATCH RFC] mm: migrate: Fix races of __find_get_block() and page migration

12 Jul 2019

On Fri, Jul 12, 2019 at 01:20:56PM +0200, Jan Kara wrote:
...
On Fri 12-07-19 11:10:43, Mel Gorman wrote:
...
On Fri, Jul 12, 2019 at 11:17:46AM +0200, Jan Kara wrote:
...
On Thu 11-07-19 17:04:55, Andrew Morton wrote:
...
On Thu, 11 Jul 2019 14:58:38 +0200 Jan Kara jack@suse.cz wrote:
...
buffer_migrate_page_norefs() can race with bh users in a following way:
CPU1					CPU2
buffer_migrate_page_norefs()
  buffer_migrate_lock_buffers()
  checks bh refs
  spin_unlock(&mapping->private_lock)
   				__find_get_block()
   				  spin_lock(&mapping->private_lock)
   				  grab bh ref
   				  spin_unlock(&mapping->private_lock)
  move page				  do bh work
This can result in various issues like lost updates to buffers (i.e.
metadata corruption) or use after free issues for the old page.
Closing this race window is relatively difficult. We could hold
mapping->private_lock in buffer_migrate_page_norefs() until we are
finished with migrating the page but the lock hold times would be rather
big. So let's revert to a more careful variant of page migration requiring
eviction of buffers on migrated page. This is effectively
fallback_migrate_page() that additionally invalidates bh LRUs in case
try_to_free_buffers() failed.
Is this premature optimization?  Holding ->private_lock while messing
with the buffers would be the standard way of addressing this.  The
longer hold times *might* be an issue, but we don't know this, do we? 
If there are indeed such problems then they could be improved by, say,
doing more of the newpage preparation prior to taking ->private_lock.
I didn't check how long the private_lock hold times would actually be, it
just seems there's a lot of work done before the page is fully migrated a
we could release the lock. And since the lock blocks bh lookup,
set_page_dirty(), etc. for the whole device, it just seemed as a bad idea.
I don't think much of a newpage setup can be moved outside of private_lock

in particular page cache replacement, page copying, page state migration

all need to be there so that bh code doesn't get confused.
But I guess it's fair to measure at least ballpark numbers of what the lock
hold times would be to get idea whether the contention concern is
substantiated or not.
I think it would be tricky to measure and quantify how much the contention
is an issue. While it would be possible to construct a microbenchmark that
should illustrate the problem, it would tell us relatively little about
how much of a problem it is generally. It would be relatively difficult
to detect the contention and stalls in block lookups due to migration
would be tricky to spot. Careful use of lock_stat might help but
enabling that has consequences of its own.
However, a rise in allocation failures due to dirty pages not being
migrated is relatively easy to detect and the consequences are relatively
benign -- failed high-order allocation that is usually ok versus a stall
on block lookups that could have a wider general impact.
On that basis, I think the patch you proposed is the more appropriate as
a fix for the race which has the potential for data corruption. So;
Acked-by: Mel Gorman mgorman@techsingularity.net
Thanks. And I agree with you that detecting failed migrations is generally
easier than detecting private_lock contention. Anyway, out of curiosity, I
did run thpfioscale workload in mmtests with some additional metadata
workload on the system to increase proportion of bdev page cache and added
tracepoints to see how long the relevant part of __buffer_migrate_page()
lasts (patch attached). The longest duration of the critical section was
311 us which is significant. But that was an outlier by far. The most of
times critical section lasted couple of us. The full histogram is here:
[min - 0.000006]: 2907 93.202950%
(0.000006 - 0.000011]: 105 3.366464%
(0.000011 - 0.000016]: 36 1.154216%
(0.000016 - 0.000021]: 45 1.442770%
(0.000021 - 0.000026]: 13 0.416800%
(0.000026 - 0.000031]: 4 0.128246%
(0.000031 - 0.000036]: 2 0.064123%
(0.000036 - 0.000041]: 2 0.064123%
(0.000041 - 0.000046]: 1 0.032062%
(0.000046 - 0.000050]: 1 0.032062%
(0.000050 - 0.000055]: 0 0.000000%
(0.000055 - 0.000060]: 0 0.000000%
(0.000060 - 0.000065]: 0 0.000000%
(0.000065 - 0.000070]: 0 0.000000%
(0.000070 - 0.000075]: 2 0.064123%
(0.000075 - 0.000080]: 0 0.000000%
(0.000080 - 0.000085]: 0 0.000000%
(0.000085 - 0.000090]: 0 0.000000%
(0.000090 - 0.000095]: 0 0.000000%
(0.000095 - max]: 1 0.032062%
So although I still think that just failing the migration if we cannot
invalidate buffer heads is a safer choice, just extending the private_lock
protected section does not seem as bad as I was afraid.
That does not seem too bad and your revised patch looks functionally
fine. I'd leave out the tracepoints though because a perf probe would have
got roughly the same data and the tracepoint may be too specific to track
another class of problem. Whether the tracepoint survives or not and
with a changelog added;
Acked-by: Mel Gorman mgorman@techsingularity.net
Andrew, which version do you want to go with, the original version or
this one that holds private_lock for slightly longer during migration?
...
From fd584fb6fa6d48e1fae1077d2ab0021ae9c98edf Mon Sep 17 00:00:00 2001
From: Jan Kara jack@suse.cz
Date: Wed, 10 Jul 2019 11:31:01 +0200
Subject: [PATCH] mm: migrate: Fix race with __find_get_block()
Signed-off-by: Jan Kara jack@suse.cz
include/trace/events/migrate.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 mm/migrate.c                   |  8 +++++++-
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 705b33d1e395..15473a508216 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -70,6 +70,48 @@ TRACE_EVENT(mm_migrate_pages,
   	__print_symbolic(__entry->mode, MIGRATE_MODE),
   	__print_symbolic(__entry->reason, MIGRATE_REASON))
 );



+TRACE_EVENT(mm_migrate_buffers_begin,


TP_PROTO(struct address_space *mapping, unsigned long index),

TP_ARGS(mapping, index),

TP_STRUCT__entry(
__field(unsigned long,	mapping)


__field(unsigned long,	index)


),

TP_fast_assign(
__entry->mapping	= (unsigned long)mapping;


__entry->index		= index;


),

TP_printk("mapping=%lx index=%lu", __entry->mapping, __entry->index)

+);



+TRACE_EVENT(mm_migrate_buffers_end,


TP_PROTO(struct address_space *mapping, unsigned long index, int rc),

TP_ARGS(mapping, index, rc),

TP_STRUCT__entry(
__field(unsigned long,	mapping)


__field(unsigned long,	index)


__field(int,		rc)


),

TP_fast_assign(
__entry->mapping	= (unsigned long)mapping;


__entry->index		= index;


__entry->rc		= rc;


),

TP_printk("mapping=%lx index=%lu rc=%d", __entry->mapping, __entry->index, __entry->rc)

+);




#endif /* _TRACE_MIGRATE_H */
 
 /* This part must be outside protection */
diff --git a/mm/migrate.c b/mm/migrate.c
index e9594bc0d406..bce0f1b03a60 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -763,6 +763,7 @@ static int __buffer_migrate_page(struct address_space *mapping,
 recheck_buffers:
   	busy = false;
   	spin_lock(&mapping->private_lock);

trace_mm_migrate_buffers_begin(mapping, page->index);

bh = head;
do {
	if (atomic_read(&bh->b_count)) {

@@ -771,12 +772,13 @@ static int __buffer_migrate_page(struct address_space *mapping,
   		}
   		bh = bh->b_this_page;
   	} while (bh != head);

spin_unlock(&mapping->private_lock);

if (busy) {
	if (invalidated) {
		rc = -EAGAIN;
		goto unlock_buffers;
	}


	trace_mm_migrate_buffers_end(mapping, page->index, -EAGAIN);


	spin_unlock(&mapping->private_lock);
invalidate_bh_lrus();
invalidated = true;
goto recheck_buffers;



@@ -809,6 +811,10 @@ static int __buffer_migrate_page(struct address_space *mapping,
 
   rc = MIGRATEPAGE_SUCCESS;
 unlock_buffers:

if (check_refs) {
trace_mm_migrate_buffers_end(mapping, page->index, rc);


spin_unlock(&mapping->private_lock);


}
bh = head;
do {
unlock_buffer(bh);

-- 
2.16.4
-- 
Mel Gorman
SUSE Labs

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH RFC] mm: migrate: Fix races of __find_get_block() and page migration

Signed-off-by: Jan Kara jack@suse.cz