This is the start of the stable review cycle for the 4.18.16 release.
There are 53 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat Oct 20 17:53:52 UTC 2018.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.18.16-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.18.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.18.16-rc1
Alexey Brodkin <abrodkin(a)synopsys.com>
ARC: build: Don't set CROSS_COMPILE in arch's Makefile
Alexey Brodkin <abrodkin(a)synopsys.com>
ARC: build: Get rid of toolchain check
Linus Torvalds <torvalds(a)linux-foundation.org>
mremap: properly flush TLB before releasing the page
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "vfs: fix freeze protection in mnt_want_write_file() for overlayfs"
Kairui Song <kasong(a)redhat.com>
x86/boot: Fix kexec booting failure in the SEV bit detection code
Arindam Nath <arindam.nath(a)amd.com>
iommu/amd: Return devid as alias for ACPI HID devices
Srikar Dronamraju <srikar(a)linux.vnet.ibm.com>
powerpc/numa: Use associativity if VPHN hcall is successful
Michael Neuling <mikey(a)neuling.org>
powerpc/tm: Avoid possible userspace r1 corruption on reclaim
Michael Neuling <mikey(a)neuling.org>
powerpc/tm: Fix userspace r13 corruption
Daniel Kurtz <djkurtz(a)chromium.org>
pinctrl/amd: poll InterruptEnable bits in amd_gpio_irq_set_type
Heiko Stuebner <heiko(a)sntech.de>
iommu/rockchip: Free irqs in shutdown handler
James Cowgill <jcowgill(a)debian.org>
RISC-V: include linux/ftrace.h in asm-prototypes.h
Selvin Xavier <selvin.xavier(a)broadcom.com>
RDMA/bnxt_re: Fix system crash during RDMA resource initialization
Tao Ren <taoren(a)fb.com>
clocksource/drivers/fttmr010: Fix set_next_event handler
Nathan Chancellor <natechancellor(a)gmail.com>
net/mlx4: Use cpumask_available for eq->affinity_mask
John Fastabend <john.fastabend(a)gmail.com>
bpf: test_maps, only support ESTABLISHED socks
John Fastabend <john.fastabend(a)gmail.com>
bpf: sockmap, fix transition through disconnect without close
John Fastabend <john.fastabend(a)gmail.com>
bpf: sockmap only allow ESTABLISHED sock state
Johannes Thumshirn <jthumshirn(a)suse.de>
scsi: sd: don't crash the host on invalid commands
Wen Xiong <wenxiong(a)linux.vnet.ibm.com>
scsi: ipr: System hung while dlpar adding primary ipr adapter back
Alexandru Gheorghe <alexandru-cosmin.gheorghe(a)arm.com>
drm: mali-dp: Call drm_crtc_vblank_reset on device init
James Smart <jsmart2021(a)gmail.com>
scsi: lpfc: Synchronize access to remoteport via rport
Majd Dibbiny <majd(a)mellanox.com>
RDMA/uverbs: Fix validity check for modify QP
Jisheng Zhang <Jisheng.Zhang(a)synaptics.com>
PCI: dwc: Fix scheduling while atomic issues
Sudarsana Reddy Kalluru <sudarsana.kalluru(a)cavium.com>
qed: Do not add VLAN 0 tag to untagged frames in multi-function mode.
Sudarsana Reddy Kalluru <sudarsana.kalluru(a)cavium.com>
qed: Fix populating the invalid stag value in multi function mode.
YueHaibing <yuehaibing(a)huawei.com>
net/smc: fix sizeof to int comparison
Ursula Braun <ubraun(a)linux.ibm.com>
net/smc: fix non-blocking connect problem
Kazuya Mizuguchi <kazuya.mizuguchi.ks(a)renesas.com>
ravb: do not write 1 to reserved bits
Christian Lamparter <chunkeey(a)gmail.com>
net: emac: fix fixed-link setup for the RTL8363SB switch
Sabrina Dubroca <sd(a)queasysnail.net>
selftests: pmtu: properly redirect stderr to /dev/null
Michael Schmitz <schmitzmic(a)gmail.com>
Input: atakbd - fix Atari CapsLock behaviour
Andreas Schwab <schwab(a)linux-m68k.org>
Input: atakbd - fix Atari keymap
Alexander Shishkin <alexander.shishkin(a)linux.intel.com>
intel_th: pci: Add Ice Lake PCH support
Laura Abbott <labbott(a)redhat.com>
scsi: ibmvscsis: Ensure partition name is properly NUL terminated
Laura Abbott <labbott(a)redhat.com>
scsi: ibmvscsis: Fix a stringop-overflow warning
Keerthy <j-keerthy(a)ti.com>
clocksource/drivers/ti-32k: Add CLOCK_SOURCE_SUSPEND_NONSTOP flag for non-am43 SoCs
Steve Wise <swise(a)opengridcomputing.com>
cxgb4: fix abort_req_rss6 struct
Marek Lindner <mareklindner(a)neomailbox.ch>
batman-adv: fix hardif_neigh refcount on queue_work() failure
Marek Lindner <mareklindner(a)neomailbox.ch>
batman-adv: fix backbone_gw refcount on queue_work() failure
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Prevent duplicated tvlv handler
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Prevent duplicated global TT entry
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Prevent duplicated softif_vlan entry
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Prevent duplicated nc_node entry
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Prevent duplicated gateway_node entry
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Fix segfault when writing to sysfs elp_interval
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Fix segfault when writing to throughput_override
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Avoid probe ELP information leak
Linus Walleij <linus.walleij(a)linaro.org>
spi: gpio: Fix copy-and-paste error
Jozef Balga <jozef.balga(a)gmail.com>
media: af9035: prevent buffer overflow on write
Sanyog Kale <sanyog.r.kale(a)intel.com>
soundwire: Fix acquiring bus lock twice during master release
Shreyas NC <shreyas.nc(a)intel.com>
soundwire: Fix incorrect exit after configuring stream
Shreyas NC <shreyas.nc(a)intel.com>
soundwire: Fix duplicate stream state assignment
-------------
Diffstat:
Makefile | 4 +-
arch/arc/Makefile | 24 +-----
arch/powerpc/kernel/tm.S | 20 ++++-
arch/powerpc/mm/numa.c | 4 +-
arch/riscv/include/asm/asm-prototypes.h | 7 ++
arch/x86/boot/compressed/mem_encrypt.S | 19 -----
drivers/clocksource/timer-fttmr010.c | 18 +++--
drivers/clocksource/timer-ti-32k.c | 3 +
drivers/gpu/drm/arm/malidp_drv.c | 1 +
drivers/hwtracing/intel_th/pci.c | 5 ++
drivers/infiniband/core/uverbs_cmd.c | 68 +++++++++++------
drivers/infiniband/hw/bnxt_re/main.c | 93 ++++++++++-------------
drivers/input/keyboard/atakbd.c | 74 +++++++------------
drivers/iommu/amd_iommu.c | 6 ++
drivers/iommu/rockchip-iommu.c | 6 ++
drivers/media/usb/dvb-usb-v2/af9035.c | 6 +-
drivers/net/ethernet/chelsio/cxgb4/t4_msg.h | 1 -
drivers/net/ethernet/ibm/emac/core.c | 15 ++--
drivers/net/ethernet/mellanox/mlx4/eq.c | 3 +-
drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 9 ++-
drivers/net/ethernet/qlogic/qed/qed_dcbx.h | 1 +
drivers/net/ethernet/qlogic/qed/qed_dev.c | 15 +++-
drivers/net/ethernet/qlogic/qed/qed_hsi.h | 4 +
drivers/net/ethernet/renesas/ravb.h | 5 ++
drivers/net/ethernet/renesas/ravb_main.c | 11 +--
drivers/net/ethernet/renesas/ravb_ptp.c | 2 +-
drivers/pci/controller/dwc/pcie-designware.c | 8 +-
drivers/pci/controller/dwc/pcie-designware.h | 3 +-
drivers/pinctrl/pinctrl-amd.c | 33 ++++++---
drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c | 5 +-
drivers/scsi/ipr.c | 106 +++++++++++++++------------
drivers/scsi/ipr.h | 1 +
drivers/scsi/lpfc/lpfc_attr.c | 15 ++--
drivers/scsi/lpfc/lpfc_debugfs.c | 10 +--
drivers/scsi/lpfc/lpfc_nvme.c | 11 ++-
drivers/scsi/sd.c | 3 +-
drivers/soundwire/stream.c | 23 ++++--
drivers/spi/spi-gpio.c | 4 +-
fs/namespace.c | 7 +-
include/linux/huge_mm.h | 2 +-
kernel/bpf/sockmap.c | 91 ++++++++++++++++++-----
mm/huge_memory.c | 10 +--
mm/mremap.c | 30 ++++----
net/batman-adv/bat_v_elp.c | 10 ++-
net/batman-adv/bridge_loop_avoidance.c | 10 ++-
net/batman-adv/gateway_client.c | 11 ++-
net/batman-adv/network-coding.c | 27 ++++---
net/batman-adv/soft-interface.c | 25 +++++--
net/batman-adv/sysfs.c | 30 +++++---
net/batman-adv/translation-table.c | 6 +-
net/batman-adv/tvlv.c | 8 +-
net/smc/af_smc.c | 7 +-
net/smc/smc_clc.c | 14 ++--
tools/testing/selftests/bpf/test_maps.c | 10 ++-
tools/testing/selftests/net/pmtu.sh | 4 +-
55 files changed, 563 insertions(+), 385 deletions(-)
From: Masami Hiramatsu <mhiramat(a)kernel.org>
Fix synthetic event to allow independent semicolon at end.
The synthetic_events interface accepts a semicolon after the
last word if there is no space.
# echo "myevent u64 var;" >> synthetic_events
But if there is a space, it returns an error.
# echo "myevent u64 var ;" > synthetic_events
sh: write error: Invalid argument
This behavior is difficult for users to understand. Let's
allow the last independent semicolon too.
Link: http://lkml.kernel.org/r/153986835420.18251.2191216690677025744.stgit@devbox
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Tom Zanussi <tom.zanussi(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
Fixes: commit 4b147936fa50 ("tracing: Add support for 'synthetic' events")
Signed-off-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events_hist.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 6ff83941065a..d239004aaf29 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1088,7 +1088,7 @@ static int create_synth_event(int argc, char **argv)
i += consumed - 1;
}
- if (i < argc) {
+ if (i < argc && strcmp(argv[i], ";") != 0) {
ret = -EINVAL;
goto err;
}
--
2.19.0
This is the start of the stable review cycle for the 4.9.135 release.
There are 35 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat Oct 20 17:54:00 UTC 2018.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.135-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.9.135-rc1
Long Li <longli(a)microsoft.com>
HV: properly delay KVP packets when negotiation is in progress
Theodore Ts'o <tytso(a)mit.edu>
ext4: avoid running out of journal credits when appending to an inline file
Frederic Weisbecker <fweisbec(a)gmail.com>
sched/cputime: Fix ksoftirqd cputime accounting regression
Frederic Weisbecker <fweisbec(a)gmail.com>
sched/cputime: Increment kcpustat directly on irqtime account
Frederic Weisbecker <fweisbec(a)gmail.com>
macintosh/rack-meter: Convert cputime64_t use to u64
Frederic Weisbecker <fweisbec(a)gmail.com>
sched/cputime: Convert kcpustat to nsecs
Stephen Warren <swarren(a)nvidia.com>
usb: gadget: serial: fix oops when data rx'd after close
Natanael Copa <ncopa(a)alpinelinux.org>
HID: quirks: fix support for Apple Magic Keyboards
Alexey Brodkin <abrodkin(a)synopsys.com>
ARC: build: Don't set CROSS_COMPILE in arch's Makefile
Alexey Brodkin <abrodkin(a)synopsys.com>
ARC: build: Get rid of toolchain check
Xin Long <lucien.xin(a)gmail.com>
netfilter: check for seqadj ext existence before adding it in nf_nat_setup_info
Jan Kara <jack(a)suse.cz>
mm: Preserve _PAGE_DEVMAP across mprotect() calls
Linus Torvalds <torvalds(a)linux-foundation.org>
mremap: properly flush TLB before releasing the page
Arindam Nath <arindam.nath(a)amd.com>
iommu/amd: Return devid as alias for ACPI HID devices
Michael Neuling <mikey(a)neuling.org>
powerpc/tm: Avoid possible userspace r1 corruption on reclaim
Michael Neuling <mikey(a)neuling.org>
powerpc/tm: Fix userspace r13 corruption
James Cowgill <jcowgill(a)debian.org>
RISC-V: include linux/ftrace.h in asm-prototypes.h
Nathan Chancellor <natechancellor(a)gmail.com>
net/mlx4: Use cpumask_available for eq->affinity_mask
Johannes Thumshirn <jthumshirn(a)suse.de>
scsi: sd: don't crash the host on invalid commands
Alexandru Gheorghe <alexandru-cosmin.gheorghe(a)arm.com>
drm: mali-dp: Call drm_crtc_vblank_reset on device init
Kazuya Mizuguchi <kazuya.mizuguchi.ks(a)renesas.com>
ravb: do not write 1 to reserved bits
Michael Schmitz <schmitzmic(a)gmail.com>
Input: atakbd - fix Atari CapsLock behaviour
Andreas Schwab <schwab(a)linux-m68k.org>
Input: atakbd - fix Atari keymap
Laura Abbott <labbott(a)redhat.com>
scsi: ibmvscsis: Ensure partition name is properly NUL terminated
Laura Abbott <labbott(a)redhat.com>
scsi: ibmvscsis: Fix a stringop-overflow warning
Keerthy <j-keerthy(a)ti.com>
clocksource/drivers/ti-32k: Add CLOCK_SOURCE_SUSPEND_NONSTOP flag for non-am43 SoCs
Marek Lindner <mareklindner(a)neomailbox.ch>
batman-adv: fix hardif_neigh refcount on queue_work() failure
Marek Lindner <mareklindner(a)neomailbox.ch>
batman-adv: fix backbone_gw refcount on queue_work() failure
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Prevent duplicated tvlv handler
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Prevent duplicated global TT entry
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Prevent duplicated softif_vlan entry
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Prevent duplicated nc_node entry
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Fix segfault when writing to sysfs elp_interval
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Fix segfault when writing to throughput_override
Jozef Balga <jozef.balga(a)gmail.com>
media: af9035: prevent buffer overflow on write
-------------
Diffstat:
Makefile | 4 +-
arch/arc/Makefile | 24 +---------
arch/powerpc/kernel/tm.S | 20 +++++++--
arch/riscv/include/asm/asm-prototypes.h | 7 +++
arch/s390/appldata/appldata_os.c | 16 +++----
arch/x86/include/asm/pgtable_types.h | 2 +-
drivers/clocksource/timer-ti-32k.c | 3 ++
drivers/cpufreq/cpufreq.c | 6 +--
drivers/cpufreq/cpufreq_governor.c | 2 +-
drivers/cpufreq/cpufreq_stats.c | 1 -
drivers/gpu/drm/arm/malidp_drv.c | 1 +
drivers/hid/hid-core.c | 3 ++
drivers/hv/hv_kvp.c | 13 +++---
drivers/input/keyboard/atakbd.c | 74 ++++++++++++-------------------
drivers/iommu/amd_iommu.c | 6 +++
drivers/macintosh/rack-meter.c | 28 ++++++------
drivers/media/usb/dvb-usb-v2/af9035.c | 6 ++-
drivers/net/ethernet/mellanox/mlx4/eq.c | 3 +-
drivers/net/ethernet/renesas/ravb.h | 5 +++
drivers/net/ethernet/renesas/ravb_main.c | 11 ++---
drivers/net/ethernet/renesas/ravb_ptp.c | 2 +-
drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c | 5 +--
drivers/scsi/sd.c | 3 +-
drivers/usb/gadget/function/u_serial.c | 2 +-
fs/ext4/ext4.h | 3 --
fs/ext4/inline.c | 38 +---------------
fs/ext4/xattr.c | 18 +-------
fs/proc/stat.c | 68 ++++++++++++++---------------
fs/proc/uptime.c | 7 +--
include/linux/huge_mm.h | 2 +-
kernel/sched/cpuacct.c | 2 +-
kernel/sched/cputime.c | 75 ++++++++++++++------------------
kernel/sched/sched.h | 12 +++--
mm/huge_memory.c | 10 ++---
mm/mremap.c | 30 ++++++-------
net/batman-adv/bat_v_elp.c | 8 +++-
net/batman-adv/bridge_loop_avoidance.c | 10 ++++-
net/batman-adv/network-coding.c | 27 +++++++-----
net/batman-adv/soft-interface.c | 25 ++++++++---
net/batman-adv/sysfs.c | 30 ++++++++-----
net/batman-adv/translation-table.c | 6 ++-
net/batman-adv/tvlv.c | 8 +++-
net/netfilter/nf_nat_core.c | 2 +-
43 files changed, 303 insertions(+), 325 deletions(-)
From: Ralph Campbell <rcampbell(a)nvidia.com>
Private ZONE_DEVICE pages use a special pte entry and thus are not
present. Properly handle this case in map_pte(), it is already handled
in check_pte(), the map_pte() part was lost in some rebase most probably.
Without this patch the slow migration path can not migrate back to any
private ZONE_DEVICE memory to regular memory. This was found after stress
testing migration back to system memory. This ultimatly can lead to the CPU
constantly page fault looping on the special swap entry.
Changes since v2:
- add comments explaining what is going on
Changes since v1:
- properly lock pte directory in map_pte()
Signed-off-by: Ralph Campbell <rcampbell(a)nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse(a)redhat.com>
Reviewed-by: Balbir Singh <bsingharora(a)gmail.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
mm/page_vma_mapped.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae3c2a35d61b..11df03e71288 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -21,7 +21,29 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
if (!is_swap_pte(*pvmw->pte))
return false;
} else {
- if (!pte_present(*pvmw->pte))
+ /*
+ * We get here when we are trying to unmap a private
+ * device page from the process address space. Such
+ * page is not CPU accessible and thus is mapped as
+ * a special swap entry, nonetheless it still does
+ * count as a valid regular mapping for the page (and
+ * is accounted as such in page maps count).
+ *
+ * So handle this special case as if it was a normal
+ * page mapping ie lock CPU page table and returns
+ * true.
+ *
+ * For more details on device private memory see HMM
+ * (include/linux/hmm.h or mm/hmm.c).
+ */
+ if (is_swap_pte(*pvmw->pte)) {
+ swp_entry_t entry;
+
+ /* Handle un-addressable ZONE_DEVICE memory */
+ entry = pte_to_swp_entry(*pvmw->pte);
+ if (!is_device_private_entry(entry))
+ return false;
+ } else if (!pte_present(*pvmw->pte))
return false;
}
}
--
2.17.2
Detaching of mark connector from fsnotify_put_mark() can race with
unmounting of the filesystem like:
CPU1 CPU2
fsnotify_put_mark()
spin_lock(&conn->lock);
...
inode = fsnotify_detach_connector_from_object(conn)
spin_unlock(&conn->lock);
generic_shutdown_super()
fsnotify_unmount_inodes()
sees connector detached for inode
-> nothing to do
evict_inode()
barfs on pending inode reference
iput(inode);
Resulting in "Busy inodes after unmount" message and possible kernel
oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
references from detached connectors.
Note that the accounting of outstanding inode references in the
superblock can cause some cacheline contention on the counter. OTOH it
happens only during deletion of the last notification mark from an inode
(or during unlinking of watched inode) and that is not too bad. I have
measured time to create & delete inotify watch 100000 times from 64
processes in parallel (each process having its own inotify group and its
own file on a shared superblock) on a 64 CPU machine. Average and
standard deviation of 15 runs look like:
Avg Stddev
Vanilla 9.817400 0.276165
Fixed 9.710467 0.228294
So there's no statistically significant difference.
Fixes: 6b3f05d24d35 ("fsnotify: Detach mark from object list when last reference is dropped")
CC: stable(a)vger.kernel.org
Signed-off-by: Jan Kara <jack(a)suse.cz>
---
fs/notify/fsnotify.c | 3 +++
fs/notify/mark.c | 39 +++++++++++++++++++++++++++++++--------
include/linux/fs.h | 3 +++
3 files changed, 37 insertions(+), 8 deletions(-)
Changes since v1:
* added Fixes tag
* improved fsnotify_drop_object to take object type
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index f174397b63a0..00d4f4357724 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -96,6 +96,9 @@ void fsnotify_unmount_inodes(struct super_block *sb)
if (iput_inode)
iput(iput_inode);
+ /* Wait for outstanding inode references from connectors */
+ wait_var_event(&sb->s_fsnotify_inode_refs,
+ !atomic_long_read(&sb->s_fsnotify_inode_refs));
}
/*
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 59cdb27826de..f4e330b5b379 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -179,17 +179,20 @@ static void fsnotify_connector_destroy_workfn(struct work_struct *work)
}
}
-static struct inode *fsnotify_detach_connector_from_object(
- struct fsnotify_mark_connector *conn)
+static void *fsnotify_detach_connector_from_object(
+ struct fsnotify_mark_connector *conn,
+ unsigned int *type)
{
struct inode *inode = NULL;
+ *type = conn->type;
if (conn->type == FSNOTIFY_OBJ_TYPE_DETACHED)
return NULL;
if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
inode = fsnotify_conn_inode(conn);
inode->i_fsnotify_mask = 0;
+ atomic_long_inc(&inode->i_sb->s_fsnotify_inode_refs);
} else if (conn->type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
fsnotify_conn_mount(conn)->mnt_fsnotify_mask = 0;
}
@@ -211,10 +214,29 @@ static void fsnotify_final_mark_destroy(struct fsnotify_mark *mark)
fsnotify_put_group(group);
}
+/* Drop object reference originally held by a connector */
+static void fsnotify_drop_object(unsigned int type, void *objp)
+{
+ struct inode *inode;
+ struct super_block *sb;
+
+ if (!objp)
+ return;
+ /* Currently only inode references are passed to be dropped */
+ if (WARN_ON_ONCE(type != FSNOTIFY_OBJ_TYPE_INODE))
+ return;
+ inode = objp;
+ sb = inode->i_sb;
+ iput(inode);
+ if (atomic_long_dec_and_test(&sb->s_fsnotify_inode_refs))
+ wake_up_var(&sb->s_fsnotify_inode_refs);
+}
+
void fsnotify_put_mark(struct fsnotify_mark *mark)
{
struct fsnotify_mark_connector *conn;
- struct inode *inode = NULL;
+ void *objp = NULL;
+ unsigned int type;
bool free_conn = false;
/* Catch marks that were actually never attached to object */
@@ -234,7 +256,7 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
conn = mark->connector;
hlist_del_init_rcu(&mark->obj_list);
if (hlist_empty(&conn->list)) {
- inode = fsnotify_detach_connector_from_object(conn);
+ objp = fsnotify_detach_connector_from_object(conn, &type);
free_conn = true;
} else {
__fsnotify_recalc_mask(conn);
@@ -242,7 +264,7 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
mark->connector = NULL;
spin_unlock(&conn->lock);
- iput(inode);
+ fsnotify_drop_object(type, objp);
if (free_conn) {
spin_lock(&destroy_lock);
@@ -709,7 +731,8 @@ void fsnotify_destroy_marks(fsnotify_connp_t *connp)
{
struct fsnotify_mark_connector *conn;
struct fsnotify_mark *mark, *old_mark = NULL;
- struct inode *inode;
+ void *objp;
+ unsigned int type;
conn = fsnotify_grab_connector(connp);
if (!conn)
@@ -735,11 +758,11 @@ void fsnotify_destroy_marks(fsnotify_connp_t *connp)
* mark references get dropped. It would lead to strange results such
* as delaying inode deletion or blocking unmount.
*/
- inode = fsnotify_detach_connector_from_object(conn);
+ objp = fsnotify_detach_connector_from_object(conn, &type);
spin_unlock(&conn->lock);
if (old_mark)
fsnotify_put_mark(old_mark);
- iput(inode);
+ fsnotify_drop_object(type, objp);
}
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 33322702c910..5090f3dcec3b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1428,6 +1428,9 @@ struct super_block {
/* Number of inodes with nlink == 0 but still referenced */
atomic_long_t s_remove_count;
+ /* Pending fsnotify inode refs */
+ atomic_long_t s_fsnotify_inode_refs;
+
/* Being remounted read-only */
int s_readonly_remount;
--
2.16.4
Detaching of mark connector from fsnotify_put_mark() can race with
unmounting of the filesystem like:
CPU1 CPU2
fsnotify_put_mark()
spin_lock(&conn->lock);
...
inode = fsnotify_detach_connector_from_object(conn)
spin_unlock(&conn->lock);
generic_shutdown_super()
fsnotify_unmount_inodes()
sees connector detached for inode
-> nothing to do
evict_inode()
barfs on pending inode reference
iput(inode);
Resulting in "Busy inodes after unmount" message and possible kernel
oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
references from detached connectors.
Note that the accounting of outstanding inode references in the
superblock can cause some cacheline contention on the counter. OTOH it
happens only during deletion of the last notification mark from an inode
(or during unlinking of watched inode) and that is not too bad. I have
measured time to create & delete inotify watch 100000 times from 64
processes in parallel (each process having its own inotify group and its
own file on a shared superblock) on a 64 CPU machine. Average and
standard deviation of 15 runs look like:
Avg Stddev
Vanilla 9.817400 0.276165
Fixed 9.710467 0.228294
So there's no statistically significant difference.
Fixes: 6b3f05d24d35 ("fsnotify: Detach mark from object list when last reference is dropped")
CC: stable(a)vger.kernel.org
Signed-off-by: Jan Kara <jack(a)suse.cz>
---
fs/notify/fsnotify.c | 3 +++
fs/notify/mark.c | 39 +++++++++++++++++++++++++++++++--------
include/linux/fs.h | 3 +++
3 files changed, 37 insertions(+), 8 deletions(-)
Changes since v2:
* fixed uninitialized warning
Changes since v1:
* added Fixes tag
* fsnotify_drop_object() now takes type
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index f174397b63a0..00d4f4357724 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -96,6 +96,9 @@ void fsnotify_unmount_inodes(struct super_block *sb)
if (iput_inode)
iput(iput_inode);
+ /* Wait for outstanding inode references from connectors */
+ wait_var_event(&sb->s_fsnotify_inode_refs,
+ !atomic_long_read(&sb->s_fsnotify_inode_refs));
}
/*
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 59cdb27826de..09535f6423fc 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -179,17 +179,20 @@ static void fsnotify_connector_destroy_workfn(struct work_struct *work)
}
}
-static struct inode *fsnotify_detach_connector_from_object(
- struct fsnotify_mark_connector *conn)
+static void *fsnotify_detach_connector_from_object(
+ struct fsnotify_mark_connector *conn,
+ unsigned int *type)
{
struct inode *inode = NULL;
+ *type = conn->type;
if (conn->type == FSNOTIFY_OBJ_TYPE_DETACHED)
return NULL;
if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
inode = fsnotify_conn_inode(conn);
inode->i_fsnotify_mask = 0;
+ atomic_long_inc(&inode->i_sb->s_fsnotify_inode_refs);
} else if (conn->type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
fsnotify_conn_mount(conn)->mnt_fsnotify_mask = 0;
}
@@ -211,10 +214,29 @@ static void fsnotify_final_mark_destroy(struct fsnotify_mark *mark)
fsnotify_put_group(group);
}
+/* Drop object reference originally held by a connector */
+static void fsnotify_drop_object(unsigned int type, void *objp)
+{
+ struct inode *inode;
+ struct super_block *sb;
+
+ if (!objp)
+ return;
+ /* Currently only inode references are passed to be dropped */
+ if (WARN_ON_ONCE(type != FSNOTIFY_OBJ_TYPE_INODE))
+ return;
+ inode = objp;
+ sb = inode->i_sb;
+ iput(inode);
+ if (atomic_long_dec_and_test(&sb->s_fsnotify_inode_refs))
+ wake_up_var(&sb->s_fsnotify_inode_refs);
+}
+
void fsnotify_put_mark(struct fsnotify_mark *mark)
{
struct fsnotify_mark_connector *conn;
- struct inode *inode = NULL;
+ void *objp = NULL;
+ unsigned int type = FSNOTIFY_OBJ_TYPE_DETACHED;
bool free_conn = false;
/* Catch marks that were actually never attached to object */
@@ -234,7 +256,7 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
conn = mark->connector;
hlist_del_init_rcu(&mark->obj_list);
if (hlist_empty(&conn->list)) {
- inode = fsnotify_detach_connector_from_object(conn);
+ objp = fsnotify_detach_connector_from_object(conn, &type);
free_conn = true;
} else {
__fsnotify_recalc_mask(conn);
@@ -242,7 +264,7 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
mark->connector = NULL;
spin_unlock(&conn->lock);
- iput(inode);
+ fsnotify_drop_object(type, objp);
if (free_conn) {
spin_lock(&destroy_lock);
@@ -709,7 +731,8 @@ void fsnotify_destroy_marks(fsnotify_connp_t *connp)
{
struct fsnotify_mark_connector *conn;
struct fsnotify_mark *mark, *old_mark = NULL;
- struct inode *inode;
+ void *objp;
+ unsigned int type;
conn = fsnotify_grab_connector(connp);
if (!conn)
@@ -735,11 +758,11 @@ void fsnotify_destroy_marks(fsnotify_connp_t *connp)
* mark references get dropped. It would lead to strange results such
* as delaying inode deletion or blocking unmount.
*/
- inode = fsnotify_detach_connector_from_object(conn);
+ objp = fsnotify_detach_connector_from_object(conn, &type);
spin_unlock(&conn->lock);
if (old_mark)
fsnotify_put_mark(old_mark);
- iput(inode);
+ fsnotify_drop_object(type, objp);
}
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 33322702c910..5090f3dcec3b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1428,6 +1428,9 @@ struct super_block {
/* Number of inodes with nlink == 0 but still referenced */
atomic_long_t s_remove_count;
+ /* Pending fsnotify inode refs */
+ atomic_long_t s_fsnotify_inode_refs;
+
/* Being remounted read-only */
int s_readonly_remount;
--
2.16.4