The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From debd1c065d2037919a7da67baf55cc683fee09f0 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Fri, 17 May 2019 10:44:25 +0300
Subject: [PATCH] btrfs: Ensure replaced device doesn't have pending chunk
allocation
Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update
operations during transaction commit") combined the way certain
operations are recoded in a transaction. As a result an ASSERT was added
in dev_replace_finish to ensure the new code works correctly.
Unfortunately I got reports that it's possible to trigger the assert,
meaning that during a device replace it's possible to have an unfinished
chunk allocation on the source device.
This is supposed to be prevented by the fact that a transaction is
committed before finishing the replace oepration and alter acquiring the
chunk mutex. This is not sufficient since by the time the transaction is
committed and the chunk mutex acquired it's possible to allocate a chunk
depending on the workload being executed on the replaced device. This
bug has been present ever since device replace was introduced but there
was never code which checks for it.
The correct way to fix is to ensure that there is no pending device
modification operation when the chunk mutex is acquire and if there is
repeat transaction commit. Unfortunately it's not possible to just
exclude the source device from btrfs_fs_devices::dev_alloc_list since
this causes ENOSPC to be hit in transaction commit.
Fixing that in another way would need to add special cases to handle the
last writes and forbid new ones. The looped transaction fix is more
obvious, and can be easily backported. The runtime of dev-replace is
long so there's no noticeable delay caused by that.
Reported-by: David Sterba <dsterba(a)suse.com>
Fixes: 391cd9df81ac ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 55c15f31d00d..ee0989c7e3a9 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -603,17 +603,33 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
}
btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
- trans = btrfs_start_transaction(root, 0);
- if (IS_ERR(trans)) {
- mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
- return PTR_ERR(trans);
+ /*
+ * We have to use this loop approach because at this point src_device
+ * has to be available for transaction commit to complete, yet new
+ * chunks shouldn't be allocated on the device.
+ */
+ while (1) {
+ trans = btrfs_start_transaction(root, 0);
+ if (IS_ERR(trans)) {
+ mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+ return PTR_ERR(trans);
+ }
+ ret = btrfs_commit_transaction(trans);
+ WARN_ON(ret);
+
+ /* Prevent write_all_supers() during the finishing procedure */
+ mutex_lock(&fs_info->fs_devices->device_list_mutex);
+ /* Prevent new chunks being allocated on the source device */
+ mutex_lock(&fs_info->chunk_mutex);
+
+ if (!list_empty(&src_device->post_commit_list)) {
+ mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+ mutex_unlock(&fs_info->chunk_mutex);
+ } else {
+ break;
+ }
}
- ret = btrfs_commit_transaction(trans);
- WARN_ON(ret);
- /* keep away write_all_supers() during the finishing procedure */
- mutex_lock(&fs_info->fs_devices->device_list_mutex);
- mutex_lock(&fs_info->chunk_mutex);
down_write(&dev_replace->rwsem);
dev_replace->replace_state =
scrub_ret ? BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
@@ -662,7 +678,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
btrfs_device_set_disk_total_bytes(tgt_device,
src_device->disk_total_bytes);
btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used);
- ASSERT(list_empty(&src_device->post_commit_list));
tgt_device->commit_total_bytes = src_device->commit_total_bytes;
tgt_device->commit_bytes_used = src_device->bytes_used;
The patch below does not apply to the 5.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From debd1c065d2037919a7da67baf55cc683fee09f0 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Fri, 17 May 2019 10:44:25 +0300
Subject: [PATCH] btrfs: Ensure replaced device doesn't have pending chunk
allocation
Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update
operations during transaction commit") combined the way certain
operations are recoded in a transaction. As a result an ASSERT was added
in dev_replace_finish to ensure the new code works correctly.
Unfortunately I got reports that it's possible to trigger the assert,
meaning that during a device replace it's possible to have an unfinished
chunk allocation on the source device.
This is supposed to be prevented by the fact that a transaction is
committed before finishing the replace oepration and alter acquiring the
chunk mutex. This is not sufficient since by the time the transaction is
committed and the chunk mutex acquired it's possible to allocate a chunk
depending on the workload being executed on the replaced device. This
bug has been present ever since device replace was introduced but there
was never code which checks for it.
The correct way to fix is to ensure that there is no pending device
modification operation when the chunk mutex is acquire and if there is
repeat transaction commit. Unfortunately it's not possible to just
exclude the source device from btrfs_fs_devices::dev_alloc_list since
this causes ENOSPC to be hit in transaction commit.
Fixing that in another way would need to add special cases to handle the
last writes and forbid new ones. The looped transaction fix is more
obvious, and can be easily backported. The runtime of dev-replace is
long so there's no noticeable delay caused by that.
Reported-by: David Sterba <dsterba(a)suse.com>
Fixes: 391cd9df81ac ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 55c15f31d00d..ee0989c7e3a9 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -603,17 +603,33 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
}
btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
- trans = btrfs_start_transaction(root, 0);
- if (IS_ERR(trans)) {
- mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
- return PTR_ERR(trans);
+ /*
+ * We have to use this loop approach because at this point src_device
+ * has to be available for transaction commit to complete, yet new
+ * chunks shouldn't be allocated on the device.
+ */
+ while (1) {
+ trans = btrfs_start_transaction(root, 0);
+ if (IS_ERR(trans)) {
+ mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+ return PTR_ERR(trans);
+ }
+ ret = btrfs_commit_transaction(trans);
+ WARN_ON(ret);
+
+ /* Prevent write_all_supers() during the finishing procedure */
+ mutex_lock(&fs_info->fs_devices->device_list_mutex);
+ /* Prevent new chunks being allocated on the source device */
+ mutex_lock(&fs_info->chunk_mutex);
+
+ if (!list_empty(&src_device->post_commit_list)) {
+ mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+ mutex_unlock(&fs_info->chunk_mutex);
+ } else {
+ break;
+ }
}
- ret = btrfs_commit_transaction(trans);
- WARN_ON(ret);
- /* keep away write_all_supers() during the finishing procedure */
- mutex_lock(&fs_info->fs_devices->device_list_mutex);
- mutex_lock(&fs_info->chunk_mutex);
down_write(&dev_replace->rwsem);
dev_replace->replace_state =
scrub_ret ? BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
@@ -662,7 +678,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
btrfs_device_set_disk_total_bytes(tgt_device,
src_device->disk_total_bytes);
btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used);
- ASSERT(list_empty(&src_device->post_commit_list));
tgt_device->commit_total_bytes = src_device->commit_total_bytes;
tgt_device->commit_bytes_used = src_device->bytes_used;
The patch titled
Subject: mm: mmu_gather: remove __tlb_reset_range() for force flush
has been added to the -mm tree. Its filename is
mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-mmu_gather-remove-__tlb_reset_r…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-mmu_gather-remove-__tlb_reset_r…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Yang Shi <yang.shi(a)linux.alibaba.com>
Subject: mm: mmu_gather: remove __tlb_reset_range() for force flush
A few new fields were added to mmu_gather to make TLB flush smarter for
huge page by telling what level of page table is changed.
__tlb_reset_range() is used to reset all these page table state to
unchanged, which is called by TLB flush for parallel mapping changes for
the same range under non-exclusive lock (i.e. read mmap_sem). Before
commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"),
the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update PTEs in
parallel don't remove page tables. But, the forementioned commit may do
munmap() under read mmap_sem and free page tables. This may result in
program hang on aarch64 reported by Jan Stancek. The problem could be
reproduced by his test program with slightly modified below.
---8<---
static int map_size = 4096;
static int num_iter = 500;
static long threads_total;
static void *distant_area;
void *map_write_unmap(void *ptr)
{
int *fd = ptr;
unsigned char *map_address;
int i, j = 0;
for (i = 0; i < num_iter; i++) {
map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (map_address == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (j = 0; j < map_size; j++)
map_address[j] = 'b';
if (munmap(map_address, map_size) == -1) {
perror("munmap");
exit(1);
}
}
return NULL;
}
void *dummy(void *ptr)
{
return NULL;
}
int main(void)
{
pthread_t thid[2];
/* hint for mmap in map_write_unmap() */
distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
distant_area += DISTANT_MMAP_SIZE / 2;
while (1) {
pthread_create(&thid[0], NULL, map_write_unmap, NULL);
pthread_create(&thid[1], NULL, dummy, NULL);
pthread_join(thid[0], NULL);
pthread_join(thid[1], NULL);
}
}
---8<---
The program may bring in parallel execution like below:
t1 t2
munmap(map_address)
downgrade_write(&mm->mmap_sem);
unmap_region()
tlb_gather_mmu()
inc_tlb_flush_pending(tlb->mm);
free_pgtables()
tlb->freed_tables = 1
tlb->cleared_pmds = 1
pthread_exit()
madvise(thread_stack, 8M, MADV_DONTNEED)
zap_page_range()
tlb_gather_mmu()
inc_tlb_flush_pending(tlb->mm);
tlb_finish_mmu()
if (mm_tlb_flush_nested(tlb->mm))
__tlb_reset_range()
__tlb_reset_range() would reset freed_tables and cleared_* bits, but this
may cause inconsistency for munmap() which do free page tables. Then it
may result in some architectures, e.g. aarch64, may not flush TLB
completely as expected to have stale TLB entries remained.
Use fullmm flush since it yields much better performance on aarch64 and
non-fullmm doesn't yields significant difference on x86.
The original proposed fix came from Jan Stancek who mainly debugged this
issue, I just wrapped up everything together.
Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.…
Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
Signed-off-by: Jan Stancek <jstancek(a)redhat.com>
Reported-by: Jan Stancek <jstancek(a)redhat.com>
Tested-by: Jan Stancek <jstancek(a)redhat.com>
Suggested-by: Will Deacon <will.deacon(a)arm.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Nick Piggin <npiggin(a)gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com>
Cc: Nadav Amit <namit(a)vmware.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: <stable(a)vger.kernel.org> [4.20+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmu_gather.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
--- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush
+++ a/mm/mmu_gather.c
@@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t
{
/*
* If there are parallel threads are doing PTE changes on same range
- * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
- * flush by batching, a thread has stable TLB entry can fail to flush
- * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
- * forcefully if we detect parallel PTE batching threads.
+ * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
+ * flush by batching, one thread may end up seeing inconsistent PTEs
+ * and result in having stale TLB entries. So flush TLB forcefully
+ * if we detect parallel PTE batching threads.
+ *
+ * However, some syscalls, e.g. munmap(), may free page tables, this
+ * needs force flush everything in the given range. Otherwise this
+ * may result in having stale TLB entries for some architectures,
+ * e.g. aarch64, that could specify flush what level TLB.
*/
if (mm_tlb_flush_nested(tlb->mm)) {
+ /*
+ * The aarch64 yields better performance with fullmm by
+ * avoiding multiple CPUs spamming TLBI messages at the
+ * same time.
+ *
+ * On x86 non-fullmm doesn't yield significant difference
+ * against fullmm.
+ */
+ tlb->fullmm = 1;
__tlb_reset_range(tlb);
- __tlb_adjust_range(tlb, start, end - start);
+ tlb->freed_tables = 1;
}
tlb_flush_mmu(tlb);
_
Patches currently in -mm which might be from yang.shi(a)linux.alibaba.com are
mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6b1f72e5b82a5c2a4da4d1ebb8cc01913ddbea21 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 20 May 2019 09:55:42 +0100
Subject: [PATCH] Btrfs: incremental send, fix file corruption when no-holes
feature is enabled
When using the no-holes feature, if we have a file with prealloc extents
with a start offset beyond the file's eof, doing an incremental send can
cause corruption of the file due to incorrect hole detection. Such case
requires that the prealloc extent(s) exist in both the parent and send
snapshots, and that a hole is punched into the file that covers all its
extents that do not cross the eof boundary.
Example reproducer:
$ mkfs.btrfs -f -O no-holes /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar
$ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/base.snap /mnt/sdb/base
$ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr
$ md5sum /mnt/sdb/incr/foobar
816df6f64deba63b029ca19d880ee10a /mnt/sdb/incr/foobar
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/base.snap /mnt/sdc
$ btrfs receive -f /tmp/incr.snap /mnt/sdc
$ md5sum /mnt/sdc/incr/foobar
cf2ef71f4a9e90c2f6013ba3b2257ed2 /mnt/sdc/incr/foobar
--> Different checksum, because the prealloc extent beyond the
file's eof confused the hole detection code and it assumed
a hole starting at offset 0 and ending at the offset of the
prealloc extent (1200Kb) instead of ending at the offset
500Kb (the file's size).
Fix this by ensuring we never cross the file's size when issuing the
write operations for a hole.
Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
CC: stable(a)vger.kernel.org # 3.14+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index dd38dfe174df..ca30b1f06e2d 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -4999,6 +4999,12 @@ static int send_hole(struct send_ctx *sctx, u64 end)
if (offset >= sctx->cur_inode_size)
return 0;
+ /*
+ * Don't go beyond the inode's i_size due to prealloc extents that start
+ * after the i_size.
+ */
+ end = min_t(u64, end, sctx->cur_inode_size);
+
if (sctx->flags & BTRFS_SEND_FLAG_NO_FILE_DATA)
return send_update_extent(sctx, offset, end - offset);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6b1f72e5b82a5c2a4da4d1ebb8cc01913ddbea21 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 20 May 2019 09:55:42 +0100
Subject: [PATCH] Btrfs: incremental send, fix file corruption when no-holes
feature is enabled
When using the no-holes feature, if we have a file with prealloc extents
with a start offset beyond the file's eof, doing an incremental send can
cause corruption of the file due to incorrect hole detection. Such case
requires that the prealloc extent(s) exist in both the parent and send
snapshots, and that a hole is punched into the file that covers all its
extents that do not cross the eof boundary.
Example reproducer:
$ mkfs.btrfs -f -O no-holes /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar
$ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/base.snap /mnt/sdb/base
$ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr
$ md5sum /mnt/sdb/incr/foobar
816df6f64deba63b029ca19d880ee10a /mnt/sdb/incr/foobar
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/base.snap /mnt/sdc
$ btrfs receive -f /tmp/incr.snap /mnt/sdc
$ md5sum /mnt/sdc/incr/foobar
cf2ef71f4a9e90c2f6013ba3b2257ed2 /mnt/sdc/incr/foobar
--> Different checksum, because the prealloc extent beyond the
file's eof confused the hole detection code and it assumed
a hole starting at offset 0 and ending at the offset of the
prealloc extent (1200Kb) instead of ending at the offset
500Kb (the file's size).
Fix this by ensuring we never cross the file's size when issuing the
write operations for a hole.
Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
CC: stable(a)vger.kernel.org # 3.14+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index dd38dfe174df..ca30b1f06e2d 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -4999,6 +4999,12 @@ static int send_hole(struct send_ctx *sctx, u64 end)
if (offset >= sctx->cur_inode_size)
return 0;
+ /*
+ * Don't go beyond the inode's i_size due to prealloc extents that start
+ * after the i_size.
+ */
+ end = min_t(u64, end, sctx->cur_inode_size);
+
if (sctx->flags & BTRFS_SEND_FLAG_NO_FILE_DATA)
return send_update_extent(sctx, offset, end - offset);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 60d9f50308e5df19bc18c2fefab0eba4a843900a Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 16 May 2019 15:48:55 +0100
Subject: [PATCH] Btrfs: fix fsync not persisting changed attributes of a
directory
While logging an inode we follow its ancestors and for each one we mark
it as logged in the current transaction, even if we have not logged it.
As a consequence if we change an attribute of an ancestor, such as the
UID or GID for example, and then explicitly fsync it, we end up not
logging the inode at all despite returning success to user space, which
results in the attribute being lost if a power failure happens after
the fsync.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ chown 6007:6007 /mnt/dir
$ sync
$ chown 9003:9003 /mnt/dir
$ touch /mnt/dir/file
$ xfs_io -c fsync /mnt/dir/file
# fsync our directory after fsync'ing the new file, should persist the
# new values for the uid and gid.
$ xfs_io -c fsync /mnt/dir
<power failure>
$ mount /dev/sdb /mnt
$ stat -c %u:%g /mnt/dir
6007:6007
--> should be 9003:9003, the uid and gid were not persisted, despite
the explicit fsync on the directory prior to the power failure
Fix this by not updating the logged_trans field of ancestor inodes when
logging an inode, since we have not logged them. Let only future calls to
btrfs_log_inode() to mark inodes as logged.
This could be triggered by my recent fsync fuzz tester for fstests, for
which an fstests patch exists titled "fstests: generic, fsync fuzz tester
with fsstress".
Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6c47f6ed3e94..de729acee738 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5478,7 +5478,6 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
{
int ret = 0;
struct dentry *old_parent = NULL;
- struct btrfs_inode *orig_inode = inode;
/*
* for regular files, if its inode is already on disk, we don't
@@ -5498,16 +5497,6 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
}
while (1) {
- /*
- * If we are logging a directory then we start with our inode,
- * not our parent's inode, so we need to skip setting the
- * logged_trans so that further down in the log code we don't
- * think this inode has already been logged.
- */
- if (inode != orig_inode)
- inode->logged_trans = trans->transid;
- smp_mb();
-
if (btrfs_must_commit_transaction(trans, inode)) {
ret = 1;
break;
@@ -6384,7 +6373,6 @@ void btrfs_record_unlink_dir(struct btrfs_trans_handle *trans,
* if this directory was already logged any new
* names for this file/dir will get recorded
*/
- smp_mb();
if (dir->logged_trans == trans->transid)
return;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 60d9f50308e5df19bc18c2fefab0eba4a843900a Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 16 May 2019 15:48:55 +0100
Subject: [PATCH] Btrfs: fix fsync not persisting changed attributes of a
directory
While logging an inode we follow its ancestors and for each one we mark
it as logged in the current transaction, even if we have not logged it.
As a consequence if we change an attribute of an ancestor, such as the
UID or GID for example, and then explicitly fsync it, we end up not
logging the inode at all despite returning success to user space, which
results in the attribute being lost if a power failure happens after
the fsync.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ chown 6007:6007 /mnt/dir
$ sync
$ chown 9003:9003 /mnt/dir
$ touch /mnt/dir/file
$ xfs_io -c fsync /mnt/dir/file
# fsync our directory after fsync'ing the new file, should persist the
# new values for the uid and gid.
$ xfs_io -c fsync /mnt/dir
<power failure>
$ mount /dev/sdb /mnt
$ stat -c %u:%g /mnt/dir
6007:6007
--> should be 9003:9003, the uid and gid were not persisted, despite
the explicit fsync on the directory prior to the power failure
Fix this by not updating the logged_trans field of ancestor inodes when
logging an inode, since we have not logged them. Let only future calls to
btrfs_log_inode() to mark inodes as logged.
This could be triggered by my recent fsync fuzz tester for fstests, for
which an fstests patch exists titled "fstests: generic, fsync fuzz tester
with fsstress".
Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6c47f6ed3e94..de729acee738 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5478,7 +5478,6 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
{
int ret = 0;
struct dentry *old_parent = NULL;
- struct btrfs_inode *orig_inode = inode;
/*
* for regular files, if its inode is already on disk, we don't
@@ -5498,16 +5497,6 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
}
while (1) {
- /*
- * If we are logging a directory then we start with our inode,
- * not our parent's inode, so we need to skip setting the
- * logged_trans so that further down in the log code we don't
- * think this inode has already been logged.
- */
- if (inode != orig_inode)
- inode->logged_trans = trans->transid;
- smp_mb();
-
if (btrfs_must_commit_transaction(trans, inode)) {
ret = 1;
break;
@@ -6384,7 +6373,6 @@ void btrfs_record_unlink_dir(struct btrfs_trans_handle *trans,
* if this directory was already logged any new
* names for this file/dir will get recorded
*/
- smp_mb();
if (dir->logged_trans == trans->transid)
return;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5338e43abbab13791144d37fd8846847062351c6 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 15 May 2019 16:02:47 +0100
Subject: [PATCH] Btrfs: fix wrong ctime and mtime of a directory after log
replay
When replaying a log that contains a new file or directory name that needs
to be added to its parent directory, we end up updating the mtime and the
ctime of the parent directory to the current time after we have set their
values to the correct ones (set at fsync time), efectivelly losing them.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ touch /mnt/dir/file
# fsync of the directory is optional, not needed
$ xfs_io -c fsync /mnt/dir
$ xfs_io -c fsync /mnt/dir/file
$ stat -c %Y /mnt/dir
1557856079
<power failure>
$ sleep 3
$ mount /dev/sdb /mnt
$ stat -c %Y /mnt/dir
1557856082
--> should have been 1557856079, the mtime is updated to the current
time when replaying the log
Fix this by not updating the mtime and ctime to the current time at
btrfs_add_link() when we are replaying a log tree.
This could be triggered by my recent fsync fuzz tester for fstests, for
which an fstests patch exists titled "fstests: generic, fsync fuzz tester
with fsstress".
Fixes: e02119d5a7b43 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b6d549c993f6..6bebc0ca751d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6433,8 +6433,18 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
btrfs_i_size_write(parent_inode, parent_inode->vfs_inode.i_size +
name_len * 2);
inode_inc_iversion(&parent_inode->vfs_inode);
- parent_inode->vfs_inode.i_mtime = parent_inode->vfs_inode.i_ctime =
- current_time(&parent_inode->vfs_inode);
+ /*
+ * If we are replaying a log tree, we do not want to update the mtime
+ * and ctime of the parent directory with the current time, since the
+ * log replay procedure is responsible for setting them to their correct
+ * values (the ones it had when the fsync was done).
+ */
+ if (!test_bit(BTRFS_FS_LOG_RECOVERING, &root->fs_info->flags)) {
+ struct timespec64 now = current_time(&parent_inode->vfs_inode);
+
+ parent_inode->vfs_inode.i_mtime = now;
+ parent_inode->vfs_inode.i_ctime = now;
+ }
ret = btrfs_update_inode(trans, root, &parent_inode->vfs_inode);
if (ret)
btrfs_abort_transaction(trans, ret);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5338e43abbab13791144d37fd8846847062351c6 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 15 May 2019 16:02:47 +0100
Subject: [PATCH] Btrfs: fix wrong ctime and mtime of a directory after log
replay
When replaying a log that contains a new file or directory name that needs
to be added to its parent directory, we end up updating the mtime and the
ctime of the parent directory to the current time after we have set their
values to the correct ones (set at fsync time), efectivelly losing them.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ touch /mnt/dir/file
# fsync of the directory is optional, not needed
$ xfs_io -c fsync /mnt/dir
$ xfs_io -c fsync /mnt/dir/file
$ stat -c %Y /mnt/dir
1557856079
<power failure>
$ sleep 3
$ mount /dev/sdb /mnt
$ stat -c %Y /mnt/dir
1557856082
--> should have been 1557856079, the mtime is updated to the current
time when replaying the log
Fix this by not updating the mtime and ctime to the current time at
btrfs_add_link() when we are replaying a log tree.
This could be triggered by my recent fsync fuzz tester for fstests, for
which an fstests patch exists titled "fstests: generic, fsync fuzz tester
with fsstress".
Fixes: e02119d5a7b43 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b6d549c993f6..6bebc0ca751d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6433,8 +6433,18 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
btrfs_i_size_write(parent_inode, parent_inode->vfs_inode.i_size +
name_len * 2);
inode_inc_iversion(&parent_inode->vfs_inode);
- parent_inode->vfs_inode.i_mtime = parent_inode->vfs_inode.i_ctime =
- current_time(&parent_inode->vfs_inode);
+ /*
+ * If we are replaying a log tree, we do not want to update the mtime
+ * and ctime of the parent directory with the current time, since the
+ * log replay procedure is responsible for setting them to their correct
+ * values (the ones it had when the fsync was done).
+ */
+ if (!test_bit(BTRFS_FS_LOG_RECOVERING, &root->fs_info->flags)) {
+ struct timespec64 now = current_time(&parent_inode->vfs_inode);
+
+ parent_inode->vfs_inode.i_mtime = now;
+ parent_inode->vfs_inode.i_ctime = now;
+ }
ret = btrfs_update_inode(trans, root, &parent_inode->vfs_inode);
if (ret)
btrfs_abort_transaction(trans, ret);
I'm announcing the release of the 5.1.7 kernel.
All users of the 5.1 kernel series must upgrade.
The updated 5.1.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.1.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
drivers/crypto/vmx/ghash.c | 212 ++++++-----------
drivers/net/bonding/bond_main.c | 15 -
drivers/net/dsa/mv88e6xxx/chip.c | 2
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 19 -
drivers/net/ethernet/broadcom/bnxt/bnxt.h | 6
drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 2
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c | 2
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c | 5
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 11
drivers/net/ethernet/freescale/fec_main.c | 2
drivers/net/ethernet/marvell/mvneta.c | 4
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 10
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 13 +
drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 6
drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_erp.c | 11
drivers/net/ethernet/realtek/r8169.c | 3
drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c | 4
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 8
drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c | 3
drivers/net/phy/marvell10g.c | 13 +
drivers/net/usb/usbnet.c | 6
include/linux/siphash.h | 5
include/net/netns/ipv4.h | 2
include/uapi/linux/tipc_config.h | 10
net/core/dev.c | 2
net/core/ethtool.c | 8
net/core/skbuff.c | 6
net/ipv4/igmp.c | 47 ++-
net/ipv4/ip_output.c | 4
net/ipv4/route.c | 12
net/ipv6/ip6_output.c | 4
net/ipv6/output_core.c | 30 +-
net/ipv6/raw.c | 2
net/ipv6/route.c | 6
net/llc/llc_output.c | 2
net/sched/act_api.c | 3
net/tipc/core.c | 32 +-
net/tipc/subscr.h | 5
net/tipc/topsrv.c | 14 -
net/tls/tls_device.c | 9
net/tls/tls_sw.c | 19 -
tools/testing/selftests/net/tls.c | 34 ++
43 files changed, 359 insertions(+), 256 deletions(-)
Andy Duan (1):
net: fec: fix the clk mismatch in failed_reset path
Antoine Tenart (1):
net: mvpp2: fix bad MVPP2_TXQ_SCHED_TOKEN_CNTR_REG queue value
Chris Packham (1):
tipc: Avoid copying bytes beyond the supplied data
Daniel Axtens (1):
crypto: vmx - ghash: do nosimd fallback manually
David Ahern (1):
ipv6: Fix redirect with VRF
David S. Miller (1):
Revert "tipc: fix modprobe tipc failed after switch order of device registration"
Eric Dumazet (5):
inet: switch IP ID generator to siphash
ipv4/igmp: fix another memory leak in igmpv3_del_delrec()
ipv4/igmp: fix build error if !CONFIG_IP_MULTICAST
llc: fix skb leak in llc_build_and_send_ui_pkt()
net-gro: fix use-after-free read in napi_gro_frags()
Greg Kroah-Hartman (1):
Linux 5.1.7
Heiner Kallweit (1):
r8169: fix MAC address being lost in PCI D3
Jakub Kicinski (6):
net/tls: fix lowat calculation if some data came from previous record
selftests/tls: test for lowat overshoot with multiple records
net/tls: fix no wakeup on partial reads
selftests/tls: add test for sleeping even though there is data
net/tls: fix state removal with feature flags off
net/tls: don't ignore netdev notifications if no TLS features
Jarod Wilson (1):
bonding/802.3ad: fix slave link initialization transition states
Jiri Pirko (1):
mlxsw: spectrum_acl: Avoid warning after identical rules insertion
Jisheng Zhang (2):
net: mvneta: Fix err code path of probe
net: stmmac: fix reset gpio free missing
Junwei Hu (1):
tipc: fix modprobe tipc failed after switch order of device registration
Kloetzke Jan (1):
usbnet: fix kernel crash after disconnect
Maxime Chevallier (1):
ethtool: Check for vlan etype or vlan tci when parsing flow_rule
Michael Chan (3):
bnxt_en: Fix aggregation buffer leak under OOM condition.
bnxt_en: Fix possible BUG() condition when calling pci_disable_msix().
bnxt_en: Reduce memory usage when running in kdump kernel.
Mike Manning (1):
ipv6: Consider sk_bound_dev_if when binding a raw socket to an address
Parav Pandit (2):
net/mlx5: Avoid double free in fs init error unwinding path
net/mlx5: Allocate root ns memory using kzalloc to match kfree
Raju Rangoju (1):
cxgb4: offload VLAN flows regardless of VLAN ethtype
Rasmus Villemoes (1):
net: dsa: mv88e6xxx: fix handling of upper half of STATS_TYPE_PORT
Russell King (1):
net: phy: marvell10g: report if the PHY fails to boot firmware
Saeed Mahameed (1):
net/mlx5e: Disable rxhash when CQE compress is enabled
Tan, Tee Min (1):
net: stmmac: fix ethtool flow control not able to get/set
Vishal Kulkarni (1):
cxgb4: Revert "cxgb4: Remove SGE_HOST_PAGE_SIZE dependency on page size"
Vlad Buslov (1):
net: sched: don't use tc_action->order during action dump
Weifeng Voon (1):
net: stmmac: dma channel control register need to be init first
Willem de Bruijn (1):
net: correct zerocopy refcnt with udp MSG_MORE