From: Andreas Rohner <andreas.rohner(a)gmx.net>
Subject: nilfs2: fix race condition that causes file system corruption
There is a race condition between nilfs_dirty_inode() and
nilfs_set_file_dirty().
When a file is opened, nilfs_dirty_inode() is called to update the access
timestamp in the inode. It calls __nilfs_mark_inode_dirty() in a separate
transaction. __nilfs_mark_inode_dirty() caches the ifile buffer_head in
the i_bh field of the inode info structure and marks it as dirty.
After some data was written to the file in another transaction, the
function nilfs_set_file_dirty() is called, which adds the inode to the
ns_dirty_files list.
Then the segment construction calls nilfs_segctor_collect_dirty_files(),
which goes through the ns_dirty_files list and checks the i_bh field. If
there is a cached buffer_head in i_bh it is not marked as dirty again.
Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
transactions, it is possible that a segment construction that writes out
the ifile occurs in-between the two. If this happens the inode is not on
the ns_dirty_files list, but its ifile block is still marked as dirty and
written out.
In the next segment construction, the data for the file is written out and
nilfs_bmap_propagate() updates the b-tree. Eventually the bmap root is
written into the i_bh block, which is not dirty, because it was written
out in another segment construction.
As a result the bmap update can be lost, which leads to file system
corruption. Either the virtual block address points to an unallocated DAT
block, or the DAT entry will be reused for something different.
The error can remain undetected for a long time. A typical error message
would be one of the "bad btree" errors or a warning that a DAT entry could
not be found.
This bug can be reproduced reliably by a simple benchmark that creates and
overwrites millions of 4k files.
Link: http://lkml.kernel.org/r/1509367935-3086-2-git-send-email-konishi.ryusuke@l…
Signed-off-by: Andreas Rohner <andreas.rohner(a)gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)lab.ntt.co.jp>
Tested-by: Andreas Rohner <andreas.rohner(a)gmx.net>
Tested-by: Ryusuke Konishi <konishi.ryusuke(a)lab.ntt.co.jp>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/segment.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff -puN fs/nilfs2/segment.c~nilfs2-fix-race-condition-that-causes-file-system-corruption fs/nilfs2/segment.c
--- a/fs/nilfs2/segment.c~nilfs2-fix-race-condition-that-causes-file-system-corruption
+++ a/fs/nilfs2/segment.c
@@ -1954,8 +1954,6 @@ static int nilfs_segctor_collect_dirty_f
err, ii->vfs_inode.i_ino);
return err;
}
- mark_buffer_dirty(ibh);
- nilfs_mdt_mark_dirty(ifile);
spin_lock(&nilfs->ns_inode_lock);
if (likely(!ii->i_bh))
ii->i_bh = ibh;
@@ -1964,6 +1962,10 @@ static int nilfs_segctor_collect_dirty_f
goto retry;
}
+ // Always redirty the buffer to avoid race condition
+ mark_buffer_dirty(ii->i_bh);
+ nilfs_mdt_mark_dirty(ifile);
+
clear_bit(NILFS_I_QUEUED, &ii->i_state);
set_bit(NILFS_I_BUSY, &ii->i_state);
list_move_tail(&ii->i_dirty, &sci->sc_dirty_files);
_
From: NeilBrown <neilb(a)suse.com>
Subject: autofs: don't fail mount for transient error
Currently if the autofs kernel module gets an error when writing to the
pipe which links to the daemon, then it marks the whole moutpoint as
catatonic, and it will stop working.
It is possible that the error is transient. This can happen if the daemon
is slow and more than 16 requests queue up. If a subsequent process tries
to queue a request, and is then signalled, the write to the pipe will
return -ERESTARTSYS and autofs will take that as total failure.
So change the code to assess -ERESTARTSYS and -ENOMEM as transient
failures which only abort the current request, not the whole mountpoint.
It isn't a crash or a data corruption, but having autofs mountpoints
suddenly stop working is rather inconvenient.
Ian said:
: And given the problems with a half dozen (or so) user space applications
: consuming large amounts of CPU under heavy mount and umount activity this
: could happen more easily than we expect.
Link: http://lkml.kernel.org/r/87y3norvgp.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb(a)suse.com>
Acked-by: Ian Kent <raven(a)themaw.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/autofs4/waitq.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff -puN fs/autofs4/waitq.c~autofs-dont-fail-mount-for-transient-error fs/autofs4/waitq.c
--- a/fs/autofs4/waitq.c~autofs-dont-fail-mount-for-transient-error
+++ a/fs/autofs4/waitq.c
@@ -81,7 +81,8 @@ static int autofs4_write(struct autofs_s
spin_unlock_irqrestore(¤t->sighand->siglock, flags);
}
- return (bytes > 0);
+ /* if 'wr' returned 0 (impossible) we assume -EIO (safe) */
+ return bytes == 0 ? 0 : wr < 0 ? wr : -EIO;
}
static void autofs4_notify_daemon(struct autofs_sb_info *sbi,
@@ -95,6 +96,7 @@ static void autofs4_notify_daemon(struct
} pkt;
struct file *pipe = NULL;
size_t pktsz;
+ int ret;
pr_debug("wait id = 0x%08lx, name = %.*s, type=%d\n",
(unsigned long) wq->wait_queue_token,
@@ -169,7 +171,18 @@ static void autofs4_notify_daemon(struct
mutex_unlock(&sbi->wq_mutex);
if (autofs4_write(sbi, pipe, &pkt, pktsz))
+ switch (ret = autofs4_write(sbi, pipe, &pkt, pktsz)) {
+ case 0:
+ break;
+ case -ENOMEM:
+ case -ERESTARTSYS:
+ /* Just fail this one */
+ autofs4_wait_release(sbi, wq->wait_queue_token, ret);
+ break;
+ default:
autofs4_catatonic_mode(sbi);
+ break;
+ }
fput(pipe);
}
_
From: Vitaly Wool <vitalywool(a)gmail.com>
Subject: mm/z3fold.c: use kref to prevent page free/compact race
There is a race in the current z3fold implementation between do_compact()
called in a work queue context and the page release procedure when page's
kref goes to 0. do_compact() may be waiting for page lock, which is
released by release_z3fold_page_locked right before putting the page onto
the "stale" list, and then the page may be freed as do_compact() modifies
its contents.
The mechanism currently implemented to handle that (checking the
PAGE_STALE flag) is not reliable enough. Instead, we'll use page's kref
counter to guarantee that the page is not released if its compaction is
scheduled. It then becomes compaction function's responsibility to
decrease the counter and quit immediately if the page was actually freed.
Link: http://lkml.kernel.org/r/20171117092032.00ea56f42affbed19f4fcc6c@gmail.com
Signed-off-by: Vitaly Wool <vitaly.wool(a)sonymobile.com>
Cc: <Oleksiy.Avramchenko(a)sony.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/z3fold.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff -puN mm/z3fold.c~z3fold-use-kref-to-prevent-page-free-compact-race mm/z3fold.c
--- a/mm/z3fold.c~z3fold-use-kref-to-prevent-page-free-compact-race
+++ a/mm/z3fold.c
@@ -404,8 +404,7 @@ static void do_compact_page(struct z3fol
WARN_ON(z3fold_page_trylock(zhdr));
else
z3fold_page_lock(zhdr);
- if (test_bit(PAGE_STALE, &page->private) ||
- !test_and_clear_bit(NEEDS_COMPACTING, &page->private)) {
+ if (WARN_ON(!test_and_clear_bit(NEEDS_COMPACTING, &page->private))) {
z3fold_page_unlock(zhdr);
return;
}
@@ -413,6 +412,11 @@ static void do_compact_page(struct z3fol
list_del_init(&zhdr->buddy);
spin_unlock(&pool->lock);
+ if (kref_put(&zhdr->refcount, release_z3fold_page_locked)) {
+ atomic64_dec(&pool->pages_nr);
+ return;
+ }
+
z3fold_compact_page(zhdr);
unbuddied = get_cpu_ptr(pool->unbuddied);
fchunks = num_free_chunks(zhdr);
@@ -753,9 +757,11 @@ static void z3fold_free(struct z3fold_po
list_del_init(&zhdr->buddy);
spin_unlock(&pool->lock);
zhdr->cpu = -1;
+ kref_get(&zhdr->refcount);
do_compact_page(zhdr, true);
return;
}
+ kref_get(&zhdr->refcount);
queue_work_on(zhdr->cpu, pool->compact_wq, &zhdr->work);
z3fold_page_unlock(zhdr);
}
_
The patch titled
Subject: mm/z3fold.c: use kref to prevent page free/compact race
has been added to the -mm tree. Its filename is
z3fold-use-kref-to-prevent-page-free-compact-race.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/z3fold-use-kref-to-prevent-page-fr…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/z3fold-use-kref-to-prevent-page-fr…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Vitaly Wool <vitalywool(a)gmail.com>
Subject: mm/z3fold.c: use kref to prevent page free/compact race
There is a race in the current z3fold implementation between do_compact()
called in a work queue context and the page release procedure when page's
kref goes to 0. do_compact() may be waiting for page lock, which is
released by release_z3fold_page_locked right before putting the page onto
the "stale" list, and then the page may be freed as do_compact() modifies
its contents.
The mechanism currently implemented to handle that (checking the
PAGE_STALE flag) is not reliable enough. Instead, we'll use page's kref
counter to guarantee that the page is not released if its compaction is
scheduled. It then becomes compaction function's responsibility to
decrease the counter and quit immediately if the page was actually freed.
Link: http://lkml.kernel.org/r/20171117092032.00ea56f42affbed19f4fcc6c@gmail.com
Signed-off-by: Vitaly Wool <vitaly.wool(a)sonymobile.com>
Cc: <Oleksiy.Avramchenko(a)sony.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/z3fold.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff -puN mm/z3fold.c~z3fold-use-kref-to-prevent-page-free-compact-race mm/z3fold.c
--- a/mm/z3fold.c~z3fold-use-kref-to-prevent-page-free-compact-race
+++ a/mm/z3fold.c
@@ -404,8 +404,7 @@ static void do_compact_page(struct z3fol
WARN_ON(z3fold_page_trylock(zhdr));
else
z3fold_page_lock(zhdr);
- if (test_bit(PAGE_STALE, &page->private) ||
- !test_and_clear_bit(NEEDS_COMPACTING, &page->private)) {
+ if (WARN_ON(!test_and_clear_bit(NEEDS_COMPACTING, &page->private))) {
z3fold_page_unlock(zhdr);
return;
}
@@ -413,6 +412,11 @@ static void do_compact_page(struct z3fol
list_del_init(&zhdr->buddy);
spin_unlock(&pool->lock);
+ if (kref_put(&zhdr->refcount, release_z3fold_page_locked)) {
+ atomic64_dec(&pool->pages_nr);
+ return;
+ }
+
z3fold_compact_page(zhdr);
unbuddied = get_cpu_ptr(pool->unbuddied);
fchunks = num_free_chunks(zhdr);
@@ -753,9 +757,11 @@ static void z3fold_free(struct z3fold_po
list_del_init(&zhdr->buddy);
spin_unlock(&pool->lock);
zhdr->cpu = -1;
+ kref_get(&zhdr->refcount);
do_compact_page(zhdr, true);
return;
}
+ kref_get(&zhdr->refcount);
queue_work_on(zhdr->cpu, pool->compact_wq, &zhdr->work);
z3fold_page_unlock(zhdr);
}
_
Patches currently in -mm which might be from vitalywool(a)gmail.com are
z3fold-use-kref-to-prevent-page-free-compact-race.patch
A new field was introduced in 74d46992e0d9dee7f1f376de0d56d31614c8a17a,
bi_partno, instead of using bdev->bd_contains and encoding the partition
information in the bi_bdev field. __bio_clone_fast was changed to copy
the disk information, but not the partition information. At minimum,
this regressed bcache and caused data corruption.
Signed-off-by: Michael Lyle <mlyle(a)lyle.org>
Fixes: 74d46992e0d9dee7f1f376de0d56d31614c8a17a
Reported-by: Pavel Goran <via-bcache(a)pvgoran.name>
Reported-by: Campbell Steven <casteven(a)gmail.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: <stable(a)vger.kernel.org>
---
block/bio.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/block/bio.c b/block/bio.c
index 101c2a9b5481..33fa6b4af312 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -597,6 +597,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
* so we don't set nor calculate new physical/hw segment counts here
*/
bio->bi_disk = bio_src->bi_disk;
+ bio->bi_partno = bio_src->bi_partno;
bio_set_flag(bio, BIO_CLONED);
bio->bi_opf = bio_src->bi_opf;
bio->bi_write_hint = bio_src->bi_write_hint;
--
2.14.1
Resending as the first attempt is not showing up in the list archive.
This patch converts several network drivers to use smp_rmb
rather than read_barrier_depends. The initial issue was
discovered with ixgbe on a Power machine which resulted
in skb list corruption due to fetching a stale skb pointer.
More details can be found in the ixgbe patch description.
Brian King (7):
ixgbe: Fix skb list corruption on Power systems
i40e: Use smp_rmb rather than read_barrier_depends
ixgbevf: Use smp_rmb rather than read_barrier_depends
igbvf: Use smp_rmb rather than read_barrier_depends
igb: Use smp_rmb rather than read_barrier_depends
fm10k: Use smp_rmb rather than read_barrier_depends
i40evf: Use smp_rmb rather than read_barrier_depends
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +-
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
drivers/net/ethernet/intel/igbvf/netdev.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 ++-
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
8 files changed, 9 insertions(+), 8 deletions(-)
--
1.8.3.1
This is the start of the stable review cycle for the 4.9.63 release.
There are 39 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat Nov 18 17:42:01 UTC 2017.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.63-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.9.63-rc1
Willy Tarreau <w(a)1wt.eu>
misc: panel: properly restore atomic counter on error path
Nicholas Bellinger <nab(a)linux-iscsi.org>
qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT (v2)
Bart Van Assche <bart.vanassche(a)sandisk.com>
target/iscsi: Fix iSCSI task reassignment handling
Chi-hsien Lin <Chi-Hsien.Lin(a)cypress.com>
brcmfmac: remove setting IBSS mode when stopping AP
Bilal Amarni <bilal.amarni(a)gmail.com>
security/keys: add CONFIG_KEYS_COMPAT to Kconfig
Florian Westphal <fw(a)strlen.de>
netfilter: nat: Revert "netfilter: nat: convert nat bysrc hash to rhashtable"
Florian Westphal <fw(a)strlen.de>
netfilter: nat: avoid use of nf_conn_nat extension
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "ARM: dts: imx53-qsb-common: fix FEC pinmux config"
Takashi Iwai <tiwai(a)suse.de>
ALSA: seq: Cancel pending autoload work at unbinding device
Dmitry Torokhov <dmitry.torokhov(a)gmail.com>
Input: ims-psu - check if CDC union descriptor is sane
Alan Stern <stern(a)rowland.harvard.edu>
usb: usbtest: fix NULL pointer dereference
Johannes Berg <johannes.berg(a)intel.com>
mac80211: don't compare TKIP TX MIC key in reinstall prevention
Jason A. Donenfeld <Jason(a)zx2c4.com>
mac80211: use constant time comparison with keys
Johannes Berg <johannes.berg(a)intel.com>
mac80211: accept key reinstall without changing anything
Guillaume Nault <g.nault(a)alphalink.fr>
ppp: fix race in ppp device destruction
Cong Wang <xiyou.wangcong(a)gmail.com>
net_sched: avoid matching qdisc with zero handle
Xin Long <lucien.xin(a)gmail.com>
sctp: reset owner sk for data chunks on out queues when migrating a sock
Julien Gomes <julien(a)arista.com>
tun: allow positive return values on dev_get_valid_name() call
Xin Long <lucien.xin(a)gmail.com>
ip6_gre: update dst pmtu if dev mtu has been updated by toobig in __gre6_xmit
Xin Long <lucien.xin(a)gmail.com>
ip6_gre: only increase err_count for some certain type icmpv6 in ip6gre_err
Xin Long <lucien.xin(a)gmail.com>
ipip: only increase err_count for some certain type icmp in ipip_err
Girish Moodalbail <girish.moodalbail(a)oracle.com>
tap: double-free in error path in tap_open()
Andrei Vagin <avagin(a)openvz.org>
net/unix: don't show information about sockets from other namespaces
Eric Dumazet <edumazet(a)google.com>
tcp/dccp: fix other lockdep splats accessing ireq_opt
Eric Dumazet <edumazet(a)google.com>
tcp/dccp: fix lockdep splat in inet_csk_route_req()
Laszlo Toth <laszlth(a)gmail.com>
sctp: full support for ipv6 ip_nonlocal_bind & IP_FREEBIND
Eric Dumazet <edumazet(a)google.com>
ipv6: flowlabel: do not leave opt->tot_len with garbage
Craig Gallek <kraig(a)google.com>
soreuseport: fix initialization race
Eric Dumazet <edumazet(a)google.com>
packet: avoid panic in packet_getsockopt()
Eric Dumazet <edumazet(a)google.com>
tcp/dccp: fix ireq->opt races
Xin Long <lucien.xin(a)gmail.com>
sctp: add the missing sock_owned_by_user check in sctp_icmp_redirect
Cong Wang <xiyou.wangcong(a)gmail.com>
tun: call dev_get_valid_name() before register_netdevice()
Guillaume Nault <g.nault(a)alphalink.fr>
l2tp: check ps->sock before running pppol2tp_session_ioctl()
Eric Dumazet <edumazet(a)google.com>
tcp: fix tcp_mtu_probe() vs highest_sack
Eric Dumazet <edumazet(a)google.com>
net: call cgroup_sk_alloc() earlier in sk_clone_lock()
Jason A. Donenfeld <Jason(a)zx2c4.com>
netlink: do not set cb_running if dump's start() errs
Eric Dumazet <edumazet(a)google.com>
ipv6: addrconf: increment ifp refcount before ipv6_del_addr()
Craig Gallek <kraig(a)google.com>
tun/tap: sanitize TUNSETSNDBUF input
Alexey Kodanev <alexey.kodanev(a)oracle.com>
gso: fix payload length when gso_size is zero
-------------
Diffstat:
Makefile | 4 +-
arch/arm/boot/dts/imx53-qsb-common.dtsi | 20 +--
arch/powerpc/Kconfig | 5 -
arch/s390/Kconfig | 3 -
arch/sparc/Kconfig | 3 -
arch/x86/Kconfig | 4 -
drivers/input/misc/ims-pcu.c | 16 ++-
drivers/misc/panel.c | 23 +++-
drivers/net/macvtap.c | 20 +--
drivers/net/ppp/ppp_generic.c | 20 +++
drivers/net/tun.c | 7 +
.../broadcom/brcm80211/brcmfmac/cfg80211.c | 3 -
drivers/scsi/qla2xxx/tcm_qla2xxx.c | 33 -----
drivers/target/iscsi/iscsi_target.c | 19 +--
drivers/usb/misc/usbtest.c | 5 +-
include/linux/netdevice.h | 3 +
include/net/inet_sock.h | 8 +-
include/net/netfilter/nf_conntrack.h | 3 +-
include/net/netfilter/nf_nat.h | 1 -
include/net/tcp.h | 6 +-
include/target/target_core_base.h | 1 +
net/core/dev.c | 6 +-
net/core/sock.c | 3 +-
net/core/sock_reuseport.c | 12 +-
net/dccp/ipv4.c | 13 +-
net/ipv4/cipso_ipv4.c | 24 +---
net/ipv4/gre_offload.c | 2 +-
net/ipv4/inet_connection_sock.c | 9 +-
net/ipv4/inet_hashtables.c | 5 +-
net/ipv4/ipip.c | 59 ++++++---
net/ipv4/syncookies.c | 2 +-
net/ipv4/tcp_input.c | 2 +-
net/ipv4/tcp_ipv4.c | 21 +--
net/ipv4/tcp_output.c | 3 +-
net/ipv4/udp.c | 5 +-
net/ipv4/udp_offload.c | 2 +-
net/ipv6/addrconf.c | 1 +
net/ipv6/ip6_flowlabel.c | 1 +
net/ipv6/ip6_gre.c | 20 ++-
net/ipv6/ip6_offload.c | 2 +-
net/ipv6/ip6_output.c | 4 +-
net/l2tp/l2tp_ppp.c | 3 +
net/mac80211/key.c | 54 +++++++-
net/netfilter/nf_conntrack_core.c | 2 +-
net/netfilter/nf_nat_core.c | 146 ++++++++-------------
net/netlink/af_netlink.c | 13 +-
net/packet/af_packet.c | 24 ++--
net/sched/sch_api.c | 2 +
net/sctp/input.c | 2 +-
net/sctp/ipv6.c | 6 +-
net/sctp/socket.c | 32 +++++
net/unix/diag.c | 2 +
security/keys/Kconfig | 4 +
sound/core/seq/seq_device.c | 3 +
54 files changed, 403 insertions(+), 293 deletions(-)