On 2019/4/20 7:16 上午, Sasha Levin wrote:
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: all
>
> The bot has tested the following trees: v5.0.8, v4.19.35, v4.14.112, v4.9.169, v4.4.178, v3.18.138.
>
> v5.0.8: Build OK!
> v4.19.35: Build OK!
> v4.14.112: Failed to apply! Possible dependencies:
> a728eacbbdd2 ("bcache: add journal statistic")
> c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
>
> v4.9.169: Failed to apply! Possible dependencies:
> a728eacbbdd2 ("bcache: add journal statistic")
> c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
>
> v4.4.178: Failed to apply! Possible dependencies:
> a728eacbbdd2 ("bcache: add journal statistic")
> c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
>
> v3.18.138: Failed to apply! Possible dependencies:
> a728eacbbdd2 ("bcache: add journal statistic")
> c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
>
>
> How should we proceed with this patch?
This patch will go into Linux v5.2. We can have them in stable after
they being upstream.
Thanks.
--
Coly Li
This is the start of the stable review cycle for the 4.9.170 release.
There are 50 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat Apr 20 16:03:22 UTC 2019.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.170-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.9.170-rc1
Lars Persson <lars.persson(a)axis.com>
net: stmmac: Set dma ring length before enabling the DMA
Arnaldo Carvalho de Melo <acme(a)redhat.com>
tools include: Adopt linux/bits.h
Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
tpm/tpm_crb: Avoid unaligned reads in crb_recv()
Pi-Hsun Shih <pihsun(a)chromium.org>
include/linux/swap.h: use offsetof() instead of custom __swapoffset macro
Stanislaw Gruszka <sgruszka(a)redhat.com>
lib/div64.c: off by one in shift
YueHaibing <yuehaibing(a)huawei.com>
appletalk: Fix use-after-free in atalk_proc_exit
Yang Shi <yang.shi(a)linaro.org>
ARM: 8839/1: kprobe: make patch_lock a raw_spinlock_t
Christophe Leroy <christophe.leroy(a)c-s.fr>
lkdtm: Add tests for NULL pointer dereference
Dmitry Osipenko <digetx(a)gmail.com>
soc/tegra: pmc: Drop locking from tegra_powergate_is_powered()
Julia Cartwright <julia(a)ni.com>
iommu/dmar: Fix buffer overflow during PCI bus notification
Ard Biesheuvel <ard.biesheuvel(a)linaro.org>
crypto: sha512/arm - fix crash bug in Thumb2 build
Ard Biesheuvel <ard.biesheuvel(a)linaro.org>
crypto: sha256/arm - fix crash bug in Thumb2 build
Vitaly Kuznetsov <vkuznets(a)redhat.com>
kernel: hung_task.c: disable on suspend
Steve French <stfrench(a)microsoft.com>
cifs: fallback to older infolevels on findfirst queryinfo retry
Ronald Tschalär <ronald(a)innovation.ch>
ACPI / SBS: Fix GPE storm on recent MacBookPro's
Bartlomiej Zolnierkiewicz <b.zolnierkie(a)samsung.com>
ARM: samsung: Limit SAMSUNG_PM_CHECK config option to non-Exynos platforms
Julian Sax <jsbc(a)gmx.de>
HID: i2c-hid: override HID descriptors for certain devices
Michal Simek <michal.simek(a)xilinx.com>
serial: uartps: console_setup() can't be placed to init section
Chao Yu <yuchao0(a)huawei.com>
f2fs: fix to do sanity check with current segment number
Dinu-Razvan Chis-Serban <justcsdr(a)gmail.com>
9p locks: add mount option for lock retry interval
Gertjan Halkes <gertjan(a)google.com>
9p: do not trust pdu content for stat item size
Siva Rebbagondla <siva.rebbagondla(a)redpinesignals.com>
rsi: improve kernel thread handling to fix kernel panic
Robert Jarzmik <robert.jarzmik(a)free.fr>
gpio: pxa: handle corner case of unprobed device
Darrick J. Wong <darrick.wong(a)oracle.com>
ext4: prohibit fstrim in norecovery mode
Steve French <stfrench(a)microsoft.com>
fix incorrect error code mapping for OBJECTID_NOT_FOUND
Nathan Chancellor <natechancellor(a)gmail.com>
x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error
Lu Baolu <baolu.lu(a)linux.intel.com>
iommu/vt-d: Check capability before disabling protected memory
Matthew Whitehead <tedheadster(a)gmail.com>
x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors
Aditya Pakki <pakki001(a)umn.edu>
x86/hpet: Prevent potential NULL pointer dereference
Jianguo Chen <chenjianguo3(a)huawei.com>
irqchip/mbigen: Don't clear eventid when freeing an MSI
Changbin Du <changbin.du(a)gmail.com>
perf tests: Fix a memory leak in test__perf_evsel__tp_sched_test()
Changbin Du <changbin.du(a)gmail.com>
perf tests: Fix a memory leak of cpu_map object in the openat_syscall_event_on_all_cpus test
Arnaldo Carvalho de Melo <acme(a)redhat.com>
perf evsel: Free evsel->counts in perf_evsel__exit()
Changbin Du <changbin.du(a)gmail.com>
perf hist: Add missing map__put() in error case
Changbin Du <changbin.du(a)gmail.com>
perf top: Fix error handling in cmd_top()
Changbin Du <changbin.du(a)gmail.com>
perf build-id: Fix memory leak in print_sdt_events()
Changbin Du <changbin.du(a)gmail.com>
perf config: Fix a memory leak in collect_config()
Changbin Du <changbin.du(a)gmail.com>
perf config: Fix an error in the config template documentation
David Arcari <darcari(a)redhat.com>
tools/power turbostat: return the exit status of a command
Matthew Garrett <matthewgarrett(a)google.com>
thermal/int340x_thermal: fix mode setting
Matthew Garrett <matthewgarrett(a)google.com>
thermal/int340x_thermal: Add additional UUIDs
Colin Ian King <colin.king(a)canonical.com>
ALSA: opl3: fix mismatch between snd_opl3_drum_switch definition and declaration
Arnd Bergmann <arnd(a)arndb.de>
mmc: davinci: remove extraneous __init annotation
Jack Morgenstein <jackm(a)dev.mellanox.co.il>
IB/mlx4: Fix race condition between catas error reset and aliasguid flows
Kangjie Lu <kjlu(a)umn.edu>
ALSA: sb8: add a check for request_region
Kangjie Lu <kjlu(a)umn.edu>
ALSA: echoaudio: add a check for ioremap_nocache
Lukas Czerner <lczerner(a)redhat.com>
ext4: report real fs size after failed resize
Lukas Czerner <lczerner(a)redhat.com>
ext4: add missing brelse() in add_new_gdb_meta_bg()
Stephane Eranian <eranian(a)google.com>
perf/core: Restore mmap record type correctly
Eugeniy Paltsev <Eugeniy.Paltsev(a)synopsys.com>
ARC: u-boot args: check that magic number is correct
-------------
Diffstat:
Makefile | 4 +-
arch/arc/kernel/head.S | 1 +
arch/arc/kernel/setup.c | 8 +
arch/arm/crypto/sha256-armv4.pl | 3 +-
arch/arm/crypto/sha256-core.S_shipped | 3 +-
arch/arm/crypto/sha512-armv4.pl | 3 +-
arch/arm/crypto/sha512-core.S_shipped | 3 +-
arch/arm/kernel/patch.c | 6 +-
arch/arm/plat-samsung/Kconfig | 2 +-
arch/x86/kernel/cpu/cyrix.c | 14 +-
arch/x86/kernel/hpet.c | 2 +
arch/x86/kernel/hw_breakpoint.c | 1 +
drivers/acpi/sbs.c | 8 +-
drivers/char/tpm/tpm_crb.c | 22 +-
drivers/gpio/gpio-pxa.c | 6 +
drivers/hid/i2c-hid/Makefile | 3 +
drivers/hid/i2c-hid/{i2c-hid.c => i2c-hid-core.c} | 56 ++--
drivers/hid/i2c-hid/i2c-hid-dmi-quirks.c | 376 ++++++++++++++++++++++
drivers/hid/i2c-hid/i2c-hid.h | 20 ++
drivers/infiniband/hw/mlx4/alias_GUID.c | 2 +-
drivers/iommu/dmar.c | 2 +-
drivers/iommu/intel-iommu.c | 3 +
drivers/irqchip/irq-mbigen.c | 3 +
drivers/misc/lkdtm.h | 2 +
drivers/misc/lkdtm_core.c | 2 +
drivers/misc/lkdtm_perms.c | 18 ++
drivers/mmc/host/davinci_mmc.c | 2 +-
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 10 +-
drivers/net/wireless/rsi/rsi_common.h | 1 -
drivers/soc/tegra/pmc.c | 8 +-
drivers/thermal/int340x_thermal/int3400_thermal.c | 21 +-
drivers/tty/serial/xilinx_uartps.c | 2 +-
fs/9p/v9fs.c | 21 ++
fs/9p/v9fs.h | 1 +
fs/9p/vfs_dir.c | 8 +-
fs/9p/vfs_file.c | 6 +-
fs/cifs/inode.c | 67 ++--
fs/cifs/smb2maperror.c | 3 +-
fs/ext4/ioctl.c | 7 +
fs/ext4/resize.c | 17 +-
fs/f2fs/super.c | 34 +-
include/linux/atalk.h | 2 +-
include/linux/swap.h | 4 +-
kernel/events/core.c | 2 +
kernel/hung_task.c | 30 +-
lib/div64.c | 4 +-
net/9p/protocol.c | 3 +-
net/appletalk/atalk_proc.c | 2 +-
net/appletalk/ddp.c | 37 ++-
net/appletalk/sysctl_net_atalk.c | 5 +-
sound/drivers/opl3/opl3_voice.h | 2 +-
sound/isa/sb/sb8.c | 4 +
sound/pci/echoaudio/echoaudio.c | 5 +
tools/include/linux/bitops.h | 6 +-
tools/include/linux/bits.h | 26 ++
tools/perf/Documentation/perf-config.txt | 2 +-
tools/perf/builtin-top.c | 5 +-
tools/perf/check-headers.sh | 1 +
tools/perf/tests/evsel-tp-sched.c | 1 +
tools/perf/tests/openat-syscall-all-cpus.c | 4 +-
tools/perf/util/build-id.c | 1 +
tools/perf/util/config.c | 3 +-
tools/perf/util/evsel.c | 1 +
tools/perf/util/hist.c | 4 +-
tools/perf/util/parse-events.c | 1 +
tools/power/x86/turbostat/turbostat.c | 3 +
66 files changed, 807 insertions(+), 132 deletions(-)
In the following commit, removing SIGKILL from each thread signal mask
and executing "goto fatal" directly will skip the call to
"trace_signal_deliver". At this point, the delivery tracking of the SIGKILL
signal will be inaccurate.
commit cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT")
Therefore, we need to add trace_signal_deliver before "goto fatal"
after executing sigdelset.
Signed-off-by: Zhenliang Wei <weizhenliang(a)huawei.com>
---
kernel/signal.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/signal.c b/kernel/signal.c
index 227ba170298e..439b742e3229 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2441,6 +2441,8 @@ bool get_signal(struct ksignal *ksig)
if (signal_group_exit(signal)) {
ksig->info.si_signo = signr = SIGKILL;
sigdelset(¤t->pending.signal, SIGKILL);
+ trace_signal_deliver(signr, &ksig->info,
+ &sighand->action[signr - 1]);
recalc_sigpending();
goto fatal;
}
--
2.14.1.windows.1
The patch titled
Subject: libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
has been removed from the -mm tree. Its filename was
libnvdimm-pfn-fix-fsdax-mode-namespace-info-block-zero-fields.patch
This patch was dropped because it was withdrawn
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
At namespace creation time there is the potential for the "expected to be
zero" fields of a 'pfn' info-block to be filled with indeterminate data.
While the kernel buffer is zeroed on allocation it is immediately
overwritten by nd_pfn_validate() filling it with the current contents of
the on-media info-block location. For fields like, 'flags' and the
'padding' it potentially means that future implementations can not rely on
those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for
section alignment, arrange for fields that are not explicitly initialized
to be guaranteed zero. Bump the minor version to indicate it is safe to
assume the 'padding' and 'flags' are zero. Otherwise, this corruption is
expected to benign since all other critical fields are explicitly
initialized.
-stable note: It's not a problem unless a kernel implementation is
explicitly expecting those fields to be zero-initialized. I only
marked it for -stable in case some future kernel backports patch12.
Otherwise it's benign on older kernels that don't have patch12 since
all fields are indeed initialized.
Link: http://lkml.kernel.org/r/155552639290.2015392.17304211251966796338.stgit@dw…
Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: Jérôme Glisse <jglisse(a)redhat.com>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Toshi Kani <toshi.kani(a)hpe.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Oscar Salvador <osalvador(a)suse.de>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/nvdimm/dax_devs.c | 2 +-
drivers/nvdimm/pfn.h | 1 +
drivers/nvdimm/pfn_devs.c | 18 +++++++++++++++---
3 files changed, 17 insertions(+), 4 deletions(-)
--- a/drivers/nvdimm/dax_devs.c~libnvdimm-pfn-fix-fsdax-mode-namespace-info-block-zero-fields
+++ a/drivers/nvdimm/dax_devs.c
@@ -126,7 +126,7 @@ int nd_dax_probe(struct device *dev, str
nvdimm_bus_unlock(&ndns->dev);
if (!dax_dev)
return -ENOMEM;
- pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
+ pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
nd_pfn->pfn_sb = pfn_sb;
rc = nd_pfn_validate(nd_pfn, DAX_SIG);
dev_dbg(dev, "dax: %s\n", rc == 0 ? dev_name(dax_dev) : "<none>");
--- a/drivers/nvdimm/pfn_devs.c~libnvdimm-pfn-fix-fsdax-mode-namespace-info-block-zero-fields
+++ a/drivers/nvdimm/pfn_devs.c
@@ -420,6 +420,15 @@ static int nd_pfn_clear_memmap_errors(st
return 0;
}
+/**
+ * nd_pfn_validate - read and validate info-block
+ * @nd_pfn: fsdax namespace runtime state / properties
+ * @sig: 'devdax' or 'fsdax' signature
+ *
+ * Upon return the info-block buffer contents (->pfn_sb) are
+ * indeterminate when validation fails, and a coherent info-block
+ * otherwise.
+ */
int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
{
u64 checksum, offset;
@@ -565,7 +574,7 @@ int nd_pfn_probe(struct device *dev, str
nvdimm_bus_unlock(&ndns->dev);
if (!pfn_dev)
return -ENOMEM;
- pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
+ pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
nd_pfn = to_nd_pfn(pfn_dev);
nd_pfn->pfn_sb = pfn_sb;
rc = nd_pfn_validate(nd_pfn, PFN_SIG);
@@ -702,7 +711,7 @@ static int nd_pfn_init(struct nd_pfn *nd
u64 checksum;
int rc;
- pfn_sb = devm_kzalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
+ pfn_sb = devm_kmalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
if (!pfn_sb)
return -ENOMEM;
@@ -711,11 +720,14 @@ static int nd_pfn_init(struct nd_pfn *nd
sig = DAX_SIG;
else
sig = PFN_SIG;
+
rc = nd_pfn_validate(nd_pfn, sig);
if (rc != -ENODEV)
return rc;
/* no info block, do init */;
+ memset(pfn_sb, 0, sizeof(*pfn_sb));
+
nd_region = to_nd_region(nd_pfn->dev.parent);
if (nd_region->ro) {
dev_info(&nd_pfn->dev,
@@ -768,7 +780,7 @@ static int nd_pfn_init(struct nd_pfn *nd
memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16);
pfn_sb->version_major = cpu_to_le16(1);
- pfn_sb->version_minor = cpu_to_le16(2);
+ pfn_sb->version_minor = cpu_to_le16(3);
pfn_sb->start_pad = cpu_to_le32(start_pad);
pfn_sb->end_trunc = cpu_to_le32(end_trunc);
pfn_sb->align = cpu_to_le32(nd_pfn->align);
--- a/drivers/nvdimm/pfn.h~libnvdimm-pfn-fix-fsdax-mode-namespace-info-block-zero-fields
+++ a/drivers/nvdimm/pfn.h
@@ -36,6 +36,7 @@ struct nd_pfn_sb {
__le32 end_trunc;
/* minor-version-2 record the base alignment of the mapping */
__le32 align;
+ /* minor-version-3 guarantee the padding and flags are zero */
u8 padding[4000];
__le64 checksum;
};
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
libnvdimm-pfn-stop-padding-pmem-namespaces-to-section-alignment.patch
mm-shuffle-initial-free-memory-to-improve-memory-side-cache-utilization.patch
mm-shuffle-initial-free-memory-to-improve-memory-side-cache-utilization-fix.patch
mm-move-buddy-list-manipulations-into-helpers.patch
mm-move-buddy-list-manipulations-into-helpers-fix.patch
mm-maintain-randomization-of-page-free-lists.patch
The patch titled
Subject: mm: do not boost watermarks to avoid fragmentation for the DISCONTIG memory model
has been added to the -mm tree. Its filename is
mm-do-not-boost-watermarks-to-avoid-fragmentation-for-the-discontig-memory-model.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-do-not-boost-watermarks-to-avoi…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-do-not-boost-watermarks-to-avoi…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Mel Gorman <mgorman(a)techsingularity.net>
Subject: mm: do not boost watermarks to avoid fragmentation for the DISCONTIG memory model
Mikulas Patocka reported that 1c30844d2dfe ("mm: reclaim small amounts of
memory when an external fragmentation event occurs") "broke" memory
management on parisc. The machine is not NUMA but the DISCONTIG model
creates three pgdats even though it's a UMA machine for the following
ranges
0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB
1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB
2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB
Mikulas reported:
With the patch 1c30844d2, the kernel will incorrectly reclaim the
first zone when it fills up, ignoring the fact that there are two
completely free zones. Basiscally, it limits cache size to 1GiB.
For example, if I run:
# dd if=/dev/sda of=/dev/null bs=1M count=2048
- with the proper kernel, there should be "Buffers - 2GiB"
when this command finishes. With the patch 1c30844d2, buffers
will consume just 1GiB or slightly more, because the kernel was
incorrectly reclaiming them.
The page allocator and reclaim makes assumptions that pgdats really
represent NUMA nodes and zones represent ranges and makes decisions on
that basis. Watermark boosting for small pgdats leads to unexpected
results even though this would have behaved reasonably on SPARSEMEM.
DISCONTIG is essentially deprecated and even parisc plans to move to
SPARSEMEM so there is no need to be fancy, this patch simply disables
watermark boosting by default on DISCONTIGMEM.
Link: http://lkml.kernel.org/r/20190419094335.GJ18914@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <mgorman(a)techsingularity.net>
Reported-by: Mikulas Patocka <mpatocka(a)redhat.com>
Tested-by: Mikulas Patocka <mpatocka(a)redhat.com>
Cc: James Bottomley <James.Bottomley(a)hansenpartnership.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
Documentation/sysctl/vm.txt | 16 ++++++++--------
mm/page_alloc.c | 13 +++++++++++++
2 files changed, 21 insertions(+), 8 deletions(-)
--- a/Documentation/sysctl/vm.txt~mm-do-not-boost-watermarks-to-avoid-fragmentation-for-the-discontig-memory-model
+++ a/Documentation/sysctl/vm.txt
@@ -866,14 +866,14 @@ The intent is that compaction has less w
increase the success rate of future high-order allocations such as SLUB
allocations, THP and hugetlbfs pages.
-To make it sensible with respect to the watermark_scale_factor parameter,
-the unit is in fractions of 10,000. The default value of 15,000 means
-that up to 150% of the high watermark will be reclaimed in the event of
-a pageblock being mixed due to fragmentation. The level of reclaim is
-determined by the number of fragmentation events that occurred in the
-recent past. If this value is smaller than a pageblock then a pageblocks
-worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
-of 0 will disable the feature.
+To make it sensible with respect to the watermark_scale_factor
+parameter, the unit is in fractions of 10,000. The default value of
+15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
+watermark will be reclaimed in the event of a pageblock being mixed due
+to fragmentation. The level of reclaim is determined by the number of
+fragmentation events that occurred in the recent past. If this value is
+smaller than a pageblock then a pageblocks worth of pages will be reclaimed
+(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature.
=============================================================
--- a/mm/page_alloc.c~mm-do-not-boost-watermarks-to-avoid-fragmentation-for-the-discontig-memory-model
+++ a/mm/page_alloc.c
@@ -266,7 +266,20 @@ compound_page_dtor * const compound_page
int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;
+#ifdef CONFIG_DISCONTIGMEM
+/*
+ * DiscontigMem defines memory ranges as separate pg_data_t even if the ranges
+ * are not on separate NUMA nodes. Functionally this works but with
+ * watermark_boost_factor, it can reclaim prematurely as the ranges can be
+ * quite small. By default, do not boost watermarks on discontigmem as in
+ * many cases very high-order allocations like THP are likely to be
+ * unsupported and the premature reclaim offsets the advantage of long-term
+ * fragmentation avoidance.
+ */
+int watermark_boost_factor __read_mostly;
+#else
int watermark_boost_factor __read_mostly = 15000;
+#endif
int watermark_scale_factor = 10;
static unsigned long nr_kernel_pages __initdata;
_
Patches currently in -mm which might be from mgorman(a)techsingularity.net are
mm-do-not-boost-watermarks-to-avoid-fragmentation-for-the-discontig-memory-model.patch
This is a note to let you know that I've just added the patch titled
USB: core: Fix bug caused by duplicate interface PM usage counter
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From c2b71462d294cf517a0bc6e4fd6424d7cee5596f Mon Sep 17 00:00:00 2001
From: Alan Stern <stern(a)rowland.harvard.edu>
Date: Fri, 19 Apr 2019 13:52:38 -0400
Subject: USB: core: Fix bug caused by duplicate interface PM usage counter
The syzkaller fuzzer reported a bug in the USB hub driver which turned
out to be caused by a negative runtime-PM usage counter. This allowed
a hub to be runtime suspended at a time when the driver did not expect
it. The symptom is a WARNING issued because the hub's status URB is
submitted while it is already active:
URB 0000000031fb463e submitted while active
WARNING: CPU: 0 PID: 2917 at drivers/usb/core/urb.c:363
The negative runtime-PM usage count was caused by an unfortunate
design decision made when runtime PM was first implemented for USB.
At that time, USB class drivers were allowed to unbind from their
interfaces without balancing the usage counter (i.e., leaving it with
a positive count). The core code would take care of setting the
counter back to 0 before allowing another driver to bind to the
interface.
Later on when runtime PM was implemented for the entire kernel, the
opposite decision was made: Drivers were required to balance their
runtime-PM get and put calls. In order to maintain backward
compatibility, however, the USB subsystem adapted to the new
implementation by keeping an independent usage counter for each
interface and using it to automatically adjust the normal usage
counter back to 0 whenever a driver was unbound.
This approach involves duplicating information, but what is worse, it
doesn't work properly in cases where a USB class driver delays
decrementing the usage counter until after the driver's disconnect()
routine has returned and the counter has been adjusted back to 0.
Doing so would cause the usage counter to become negative. There's
even a warning about this in the USB power management documentation!
As it happens, this is exactly what the hub driver does. The
kick_hub_wq() routine increments the runtime-PM usage counter, and the
corresponding decrement is carried out by hub_event() in the context
of the hub_wq work-queue thread. This work routine may sometimes run
after the driver has been unbound from its interface, and when it does
it causes the usage counter to go negative.
It is not possible for hub_disconnect() to wait for a pending
hub_event() call to finish, because hub_disconnect() is called with
the device lock held and hub_event() acquires that lock. The only
feasible fix is to reverse the original design decision: remove the
duplicate interface-specific usage counter and require USB drivers to
balance their runtime PM gets and puts. As far as I know, all
existing drivers currently do this.
Signed-off-by: Alan Stern <stern(a)rowland.harvard.edu>
Reported-and-tested-by: syzbot+7634edaea4d0b341c625(a)syzkaller.appspotmail.com
CC: <stable(a)vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
Documentation/driver-api/usb/power-management.rst | 14 +++++++++-----
drivers/usb/core/driver.c | 13 -------------
drivers/usb/storage/realtek_cr.c | 13 +++++--------
include/linux/usb.h | 2 --
4 files changed, 14 insertions(+), 28 deletions(-)
diff --git a/Documentation/driver-api/usb/power-management.rst b/Documentation/driver-api/usb/power-management.rst
index 79beb807996b..4a74cf6f2797 100644
--- a/Documentation/driver-api/usb/power-management.rst
+++ b/Documentation/driver-api/usb/power-management.rst
@@ -370,11 +370,15 @@ autosuspend the interface's device. When the usage counter is = 0
then the interface is considered to be idle, and the kernel may
autosuspend the device.
-Drivers need not be concerned about balancing changes to the usage
-counter; the USB core will undo any remaining "get"s when a driver
-is unbound from its interface. As a corollary, drivers must not call
-any of the ``usb_autopm_*`` functions after their ``disconnect``
-routine has returned.
+Drivers must be careful to balance their overall changes to the usage
+counter. Unbalanced "get"s will remain in effect when a driver is
+unbound from its interface, preventing the device from going into
+runtime suspend should the interface be bound to a driver again. On
+the other hand, drivers are allowed to achieve this balance by calling
+the ``usb_autopm_*`` functions even after their ``disconnect`` routine
+has returned -- say from within a work-queue routine -- provided they
+retain an active reference to the interface (via ``usb_get_intf`` and
+``usb_put_intf``).
Drivers using the async routines are responsible for their own
synchronization and mutual exclusion.
diff --git a/drivers/usb/core/driver.c b/drivers/usb/core/driver.c
index 8987cec9549d..ebcadaad89d1 100644
--- a/drivers/usb/core/driver.c
+++ b/drivers/usb/core/driver.c
@@ -473,11 +473,6 @@ static int usb_unbind_interface(struct device *dev)
pm_runtime_disable(dev);
pm_runtime_set_suspended(dev);
- /* Undo any residual pm_autopm_get_interface_* calls */
- for (r = atomic_read(&intf->pm_usage_cnt); r > 0; --r)
- usb_autopm_put_interface_no_suspend(intf);
- atomic_set(&intf->pm_usage_cnt, 0);
-
if (!error)
usb_autosuspend_device(udev);
@@ -1633,7 +1628,6 @@ void usb_autopm_put_interface(struct usb_interface *intf)
int status;
usb_mark_last_busy(udev);
- atomic_dec(&intf->pm_usage_cnt);
status = pm_runtime_put_sync(&intf->dev);
dev_vdbg(&intf->dev, "%s: cnt %d -> %d\n",
__func__, atomic_read(&intf->dev.power.usage_count),
@@ -1662,7 +1656,6 @@ void usb_autopm_put_interface_async(struct usb_interface *intf)
int status;
usb_mark_last_busy(udev);
- atomic_dec(&intf->pm_usage_cnt);
status = pm_runtime_put(&intf->dev);
dev_vdbg(&intf->dev, "%s: cnt %d -> %d\n",
__func__, atomic_read(&intf->dev.power.usage_count),
@@ -1684,7 +1677,6 @@ void usb_autopm_put_interface_no_suspend(struct usb_interface *intf)
struct usb_device *udev = interface_to_usbdev(intf);
usb_mark_last_busy(udev);
- atomic_dec(&intf->pm_usage_cnt);
pm_runtime_put_noidle(&intf->dev);
}
EXPORT_SYMBOL_GPL(usb_autopm_put_interface_no_suspend);
@@ -1715,8 +1707,6 @@ int usb_autopm_get_interface(struct usb_interface *intf)
status = pm_runtime_get_sync(&intf->dev);
if (status < 0)
pm_runtime_put_sync(&intf->dev);
- else
- atomic_inc(&intf->pm_usage_cnt);
dev_vdbg(&intf->dev, "%s: cnt %d -> %d\n",
__func__, atomic_read(&intf->dev.power.usage_count),
status);
@@ -1750,8 +1740,6 @@ int usb_autopm_get_interface_async(struct usb_interface *intf)
status = pm_runtime_get(&intf->dev);
if (status < 0 && status != -EINPROGRESS)
pm_runtime_put_noidle(&intf->dev);
- else
- atomic_inc(&intf->pm_usage_cnt);
dev_vdbg(&intf->dev, "%s: cnt %d -> %d\n",
__func__, atomic_read(&intf->dev.power.usage_count),
status);
@@ -1775,7 +1763,6 @@ void usb_autopm_get_interface_no_resume(struct usb_interface *intf)
struct usb_device *udev = interface_to_usbdev(intf);
usb_mark_last_busy(udev);
- atomic_inc(&intf->pm_usage_cnt);
pm_runtime_get_noresume(&intf->dev);
}
EXPORT_SYMBOL_GPL(usb_autopm_get_interface_no_resume);
diff --git a/drivers/usb/storage/realtek_cr.c b/drivers/usb/storage/realtek_cr.c
index 31b024441938..cc794e25a0b6 100644
--- a/drivers/usb/storage/realtek_cr.c
+++ b/drivers/usb/storage/realtek_cr.c
@@ -763,18 +763,16 @@ static void rts51x_suspend_timer_fn(struct timer_list *t)
break;
case RTS51X_STAT_IDLE:
case RTS51X_STAT_SS:
- usb_stor_dbg(us, "RTS51X_STAT_SS, intf->pm_usage_cnt:%d, power.usage:%d\n",
- atomic_read(&us->pusb_intf->pm_usage_cnt),
+ usb_stor_dbg(us, "RTS51X_STAT_SS, power.usage:%d\n",
atomic_read(&us->pusb_intf->dev.power.usage_count));
- if (atomic_read(&us->pusb_intf->pm_usage_cnt) > 0) {
+ if (atomic_read(&us->pusb_intf->dev.power.usage_count) > 0) {
usb_stor_dbg(us, "Ready to enter SS state\n");
rts51x_set_stat(chip, RTS51X_STAT_SS);
/* ignore mass storage interface's children */
pm_suspend_ignore_children(&us->pusb_intf->dev, true);
usb_autopm_put_interface_async(us->pusb_intf);
- usb_stor_dbg(us, "RTS51X_STAT_SS 01, intf->pm_usage_cnt:%d, power.usage:%d\n",
- atomic_read(&us->pusb_intf->pm_usage_cnt),
+ usb_stor_dbg(us, "RTS51X_STAT_SS 01, power.usage:%d\n",
atomic_read(&us->pusb_intf->dev.power.usage_count));
}
break;
@@ -807,11 +805,10 @@ static void rts51x_invoke_transport(struct scsi_cmnd *srb, struct us_data *us)
int ret;
if (working_scsi(srb)) {
- usb_stor_dbg(us, "working scsi, intf->pm_usage_cnt:%d, power.usage:%d\n",
- atomic_read(&us->pusb_intf->pm_usage_cnt),
+ usb_stor_dbg(us, "working scsi, power.usage:%d\n",
atomic_read(&us->pusb_intf->dev.power.usage_count));
- if (atomic_read(&us->pusb_intf->pm_usage_cnt) <= 0) {
+ if (atomic_read(&us->pusb_intf->dev.power.usage_count) <= 0) {
ret = usb_autopm_get_interface(us->pusb_intf);
usb_stor_dbg(us, "working scsi, ret=%d\n", ret);
}
diff --git a/include/linux/usb.h b/include/linux/usb.h
index 5e49e82c4368..ff010d1fd1c7 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -200,7 +200,6 @@ usb_find_last_int_out_endpoint(struct usb_host_interface *alt,
* @dev: driver model's view of this device
* @usb_dev: if an interface is bound to the USB major, this will point
* to the sysfs representation for that device.
- * @pm_usage_cnt: PM usage counter for this interface
* @reset_ws: Used for scheduling resets from atomic context.
* @resetting_device: USB core reset the device, so use alt setting 0 as
* current; needs bandwidth alloc after reset.
@@ -257,7 +256,6 @@ struct usb_interface {
struct device dev; /* interface specific device info */
struct device *usb_dev;
- atomic_t pm_usage_cnt; /* usage counter for autosuspend */
struct work_struct reset_ws; /* for resets in atomic context */
};
#define to_usb_interface(d) container_of(d, struct usb_interface, dev)
--
2.21.0
The patch
ASoC: codec: hdac_hdmi add device_link to card device
has been applied to the asoc tree at
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git
All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.
You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.
If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.
Please add any relevant lists and maintainers to the CCs when replying
to this mail.
Thanks,
Mark
>From 01c8327667c249818d3712c3e25c7ad2aca7f389 Mon Sep 17 00:00:00 2001
From: Libin Yang <libin.yang(a)intel.com>
Date: Sat, 13 Apr 2019 21:18:12 +0800
Subject: [PATCH] ASoC: codec: hdac_hdmi add device_link to card device
In resume from S3, HDAC HDMI codec driver dapm event callback may be
operated before HDMI codec driver turns on the display audio power
domain because of the contest between display driver and hdmi codec driver.
This patch adds the device_link between soc card device (consumer) and
hdmi codec device (supplier) to make sure the sequence is always correct.
Signed-off-by: Libin Yang <libin.yang(a)intel.com>
Reviewed-by: Takashi Iwai <tiwai(a)suse.de>
Acked-by: Pierre-Louis Bossart <pierre-louis.bossart(a)linux.intel.com>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
sound/soc/codecs/hdac_hdmi.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/sound/soc/codecs/hdac_hdmi.c b/sound/soc/codecs/hdac_hdmi.c
index 5eeb0fe836a9..4de1fbfa8827 100644
--- a/sound/soc/codecs/hdac_hdmi.c
+++ b/sound/soc/codecs/hdac_hdmi.c
@@ -1854,6 +1854,17 @@ static int hdmi_codec_probe(struct snd_soc_component *component)
/* Imp: Store the card pointer in hda_codec */
hdmi->card = dapm->card->snd_card;
+ /*
+ * Setup a device_link between card device and HDMI codec device.
+ * The card device is the consumer and the HDMI codec device is
+ * the supplier. With this setting, we can make sure that the audio
+ * domain in display power will be always turned on before operating
+ * on the HDMI audio codec registers.
+ * Let's use the flag DL_FLAG_AUTOREMOVE_CONSUMER. This can make
+ * sure the device link is freed when the machine driver is removed.
+ */
+ device_link_add(component->card->dev, &hdev->dev, DL_FLAG_RPM_ACTIVE |
+ DL_FLAG_AUTOREMOVE_CONSUMER);
/*
* hdac_device core already sets the state to active and calls
* get_noresume. So enable runtime and set the device to suspend.
--
2.20.1
This is a note to let you know that I've just added the patch titled
USB: dummy-hcd: Fix failure to give back unlinked URBs
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From fc834e607ae3d18e1a20bca3f9a2d7f52ea7a2be Mon Sep 17 00:00:00 2001
From: Alan Stern <stern(a)rowland.harvard.edu>
Date: Thu, 18 Apr 2019 13:12:07 -0400
Subject: USB: dummy-hcd: Fix failure to give back unlinked URBs
The syzkaller USB fuzzer identified a failure mode in which dummy-hcd
would never give back an unlinked URB. This causes usb_kill_urb() to
hang, leading to WARNINGs and unkillable threads.
In dummy-hcd, all URBs are given back by the dummy_timer() routine as
it scans through the list of pending URBS. Failure to give back URBs
can be caused by failure to start or early exit from the scanning
loop. The code currently has two such pathways: One is triggered when
an unsupported bus transfer speed is encountered, and the other by
exhausting the simulated bandwidth for USB transfers during a frame.
This patch removes those two paths, thereby allowing all unlinked URBs
to be given back in a timely manner. It adds a check for the bus
speed when the gadget first starts running, so that dummy_timer() will
never thereafter encounter an unsupported speed. And it prevents the
loop from exiting as soon as the total bandwidth has been used up (the
scanning loop continues, giving back unlinked URBs as they are found,
but not transferring any more data).
Thanks to Andrey Konovalov for manually running the syzkaller fuzzer
to help track down the source of the bug.
Signed-off-by: Alan Stern <stern(a)rowland.harvard.edu>
Reported-and-tested-by: syzbot+d919b0f29d7b5a4994b9(a)syzkaller.appspotmail.com
CC: <stable(a)vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/gadget/udc/dummy_hcd.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/drivers/usb/gadget/udc/dummy_hcd.c b/drivers/usb/gadget/udc/dummy_hcd.c
index baf72f95f0f1..213b52508621 100644
--- a/drivers/usb/gadget/udc/dummy_hcd.c
+++ b/drivers/usb/gadget/udc/dummy_hcd.c
@@ -979,8 +979,18 @@ static int dummy_udc_start(struct usb_gadget *g,
struct dummy_hcd *dum_hcd = gadget_to_dummy_hcd(g);
struct dummy *dum = dum_hcd->dum;
- if (driver->max_speed == USB_SPEED_UNKNOWN)
+ switch (g->speed) {
+ /* All the speeds we support */
+ case USB_SPEED_LOW:
+ case USB_SPEED_FULL:
+ case USB_SPEED_HIGH:
+ case USB_SPEED_SUPER:
+ break;
+ default:
+ dev_err(dummy_dev(dum_hcd), "Unsupported driver max speed %d\n",
+ driver->max_speed);
return -EINVAL;
+ }
/*
* SLAVE side init ... the layer above hardware, which
@@ -1784,9 +1794,10 @@ static void dummy_timer(struct timer_list *t)
/* Bus speed is 500000 bytes/ms, so use a little less */
total = 490000;
break;
- default:
+ default: /* Can't happen */
dev_err(dummy_dev(dum_hcd), "bogus device speed\n");
- return;
+ total = 0;
+ break;
}
/* FIXME if HZ != 1000 this will probably misbehave ... */
@@ -1828,7 +1839,7 @@ static void dummy_timer(struct timer_list *t)
/* Used up this frame's bandwidth? */
if (total <= 0)
- break;
+ continue;
/* find the gadget's ep for this request (if configured) */
address = usb_pipeendpoint (urb->pipe);
--
2.21.0
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 442601e87a4769a8daba4976ec3afa5222ca211d Mon Sep 17 00:00:00 2001
From: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Date: Fri, 8 Feb 2019 18:30:59 +0200
Subject: [PATCH] tpm/tpm_i2c_atmel: Return -E2BIG when the transfer is
incomplete
Return -E2BIG when the transfer is incomplete. The upper layer does
not retry, so not doing that is incorrect behaviour.
Cc: stable(a)vger.kernel.org
Fixes: a2871c62e186 ("tpm: Add support for Atmel I2C TPMs")
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Reviewed-by: Stefan Berger <stefanb(a)linux.ibm.com>
Reviewed-by: Jerry Snitselaar <jsnitsel(a)redhat.com>
diff --git a/drivers/char/tpm/tpm_i2c_atmel.c b/drivers/char/tpm/tpm_i2c_atmel.c
index 32a8e27c5382..cc4e642d3180 100644
--- a/drivers/char/tpm/tpm_i2c_atmel.c
+++ b/drivers/char/tpm/tpm_i2c_atmel.c
@@ -69,6 +69,10 @@ static int i2c_atmel_send(struct tpm_chip *chip, u8 *buf, size_t len)
if (status < 0)
return status;
+ /* The upper layer does not support incomplete sends. */
+ if (status != len)
+ return -E2BIG;
+
return 0;
}
From: Andrea Arcangeli <aarcange(a)redhat.com>
Subject: coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma layout
will not change from under it. Only using some signal serialization on
the processes belonging to the mm is not nearly enough. This was pointed
out earlier. For example in Hugh's post from Jul 2017:
https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
"Not strictly relevant here, but a related note: I was very surprised to
discover, only quite recently, how handle_mm_fault() may be called without
down_read(mmap_sem) - when core dumping. That seems a misguided
optimization to me, which would also be nice to correct"
In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.
Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side effects
in the core dumping code.
Adding mmap_sem for writing around the ->core_dump invocation is a viable
long term fix, but it requires removing all copy user and page faults and
to replace them with get_dump_page() for all binary formats which is not
suitable as a short term fix.
For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs. Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.
Allowing mmap_sem protected sections to run in parallel with the coredump
provides some minor parallelism advantage to the swapoff code (which seems
to be safe enough by never mangling any vma field and can keep doing
swapins in parallel to the core dumping) and to some other corner case.
In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any other
"Fixes:" because there's no hash beyond the git genesis commit.
Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for reading,
by adding the mmget_still_valid() check to it, all other cases that take
the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm(). The expand_stack() in page fault context
also doesn't need the new check, because all tasks under core dumping are
frozen.
Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange(a)redhat.com>
Reported-by: Jann Horn <jannh(a)google.com>
Suggested-by: Oleg Nesterov <oleg(a)redhat.com>
Acked-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: Mike Rapoport <rppt(a)linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Reviewed-by: Jann Horn <jannh(a)google.com>
Acked-by: Jason Gunthorpe <jgg(a)mellanox.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/infiniband/core/uverbs_main.c | 3 +++
fs/proc/task_mmu.c | 18 ++++++++++++++++++
fs/userfaultfd.c | 9 +++++++++
include/linux/sched/mm.h | 21 +++++++++++++++++++++
mm/mmap.c | 7 ++++++-
5 files changed, 57 insertions(+), 1 deletion(-)
--- a/drivers/infiniband/core/uverbs_main.c~coredump-fix-race-condition-between-mmget_not_zero-get_task_mm-and-core-dumping
+++ a/drivers/infiniband/core/uverbs_main.c
@@ -993,6 +993,8 @@ void uverbs_user_mmap_disassociate(struc
* will only be one mm, so no big deal.
*/
down_write(&mm->mmap_sem);
+ if (!mmget_still_valid(mm))
+ goto skip_mm;
mutex_lock(&ufile->umap_lock);
list_for_each_entry_safe (priv, next_priv, &ufile->umaps,
list) {
@@ -1007,6 +1009,7 @@ void uverbs_user_mmap_disassociate(struc
vma->vm_flags &= ~(VM_SHARED | VM_MAYSHARE);
}
mutex_unlock(&ufile->umap_lock);
+ skip_mm:
up_write(&mm->mmap_sem);
mmput(mm);
}
--- a/fs/proc/task_mmu.c~coredump-fix-race-condition-between-mmget_not_zero-get_task_mm-and-core-dumping
+++ a/fs/proc/task_mmu.c
@@ -1143,6 +1143,24 @@ static ssize_t clear_refs_write(struct f
count = -EINTR;
goto out_mm;
}
+ /*
+ * Avoid to modify vma->vm_flags
+ * without locked ops while the
+ * coredump reads the vm_flags.
+ */
+ if (!mmget_still_valid(mm)) {
+ /*
+ * Silently return "count"
+ * like if get_task_mm()
+ * failed. FIXME: should this
+ * function have returned
+ * -ESRCH if get_task_mm()
+ * failed like if
+ * get_proc_task() fails?
+ */
+ up_write(&mm->mmap_sem);
+ goto out_mm;
+ }
for (vma = mm->mmap; vma; vma = vma->vm_next) {
vma->vm_flags &= ~VM_SOFTDIRTY;
vma_set_page_prot(vma);
--- a/fs/userfaultfd.c~coredump-fix-race-condition-between-mmget_not_zero-get_task_mm-and-core-dumping
+++ a/fs/userfaultfd.c
@@ -629,6 +629,8 @@ static void userfaultfd_event_wait_compl
/* the various vma->vm_userfaultfd_ctx still points to it */
down_write(&mm->mmap_sem);
+ /* no task can run (and in turn coredump) yet */
+ VM_WARN_ON(!mmget_still_valid(mm));
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
@@ -883,6 +885,8 @@ static int userfaultfd_release(struct in
* taking the mmap_sem for writing.
*/
down_write(&mm->mmap_sem);
+ if (!mmget_still_valid(mm))
+ goto skip_mm;
prev = NULL;
for (vma = mm->mmap; vma; vma = vma->vm_next) {
cond_resched();
@@ -905,6 +909,7 @@ static int userfaultfd_release(struct in
vma->vm_flags = new_flags;
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
}
+skip_mm:
up_write(&mm->mmap_sem);
mmput(mm);
wakeup:
@@ -1333,6 +1338,8 @@ static int userfaultfd_register(struct u
goto out;
down_write(&mm->mmap_sem);
+ if (!mmget_still_valid(mm))
+ goto out_unlock;
vma = find_vma_prev(mm, start, &prev);
if (!vma)
goto out_unlock;
@@ -1520,6 +1527,8 @@ static int userfaultfd_unregister(struct
goto out;
down_write(&mm->mmap_sem);
+ if (!mmget_still_valid(mm))
+ goto out_unlock;
vma = find_vma_prev(mm, start, &prev);
if (!vma)
goto out_unlock;
--- a/include/linux/sched/mm.h~coredump-fix-race-condition-between-mmget_not_zero-get_task_mm-and-core-dumping
+++ a/include/linux/sched/mm.h
@@ -49,6 +49,27 @@ static inline void mmdrop(struct mm_stru
__mmdrop(mm);
}
+/*
+ * This has to be called after a get_task_mm()/mmget_not_zero()
+ * followed by taking the mmap_sem for writing before modifying the
+ * vmas or anything the coredump pretends not to change from under it.
+ *
+ * NOTE: find_extend_vma() called from GUP context is the only place
+ * that can modify the "mm" (notably the vm_start/end) under mmap_sem
+ * for reading and outside the context of the process, so it is also
+ * the only case that holds the mmap_sem for reading that must call
+ * this function. Generally if the mmap_sem is hold for reading
+ * there's no need of this check after get_task_mm()/mmget_not_zero().
+ *
+ * This function can be obsoleted and the check can be removed, after
+ * the coredump code will hold the mmap_sem for writing before
+ * invoking the ->core_dump methods.
+ */
+static inline bool mmget_still_valid(struct mm_struct *mm)
+{
+ return likely(!mm->core_state);
+}
+
/**
* mmget() - Pin the address space associated with a &struct mm_struct.
* @mm: The address space to pin.
--- a/mm/mmap.c~coredump-fix-race-condition-between-mmget_not_zero-get_task_mm-and-core-dumping
+++ a/mm/mmap.c
@@ -45,6 +45,7 @@
#include <linux/moduleparam.h>
#include <linux/pkeys.h>
#include <linux/oom.h>
+#include <linux/sched/mm.h>
#include <linux/uaccess.h>
#include <asm/cacheflush.h>
@@ -2525,7 +2526,8 @@ find_extend_vma(struct mm_struct *mm, un
vma = find_vma_prev(mm, addr, &prev);
if (vma && (vma->vm_start <= addr))
return vma;
- if (!prev || expand_stack(prev, addr))
+ /* don't alter vm_end if the coredump is running */
+ if (!prev || !mmget_still_valid(mm) || expand_stack(prev, addr))
return NULL;
if (prev->vm_flags & VM_LOCKED)
populate_vma_page_range(prev, addr, prev->vm_end, NULL);
@@ -2551,6 +2553,9 @@ find_extend_vma(struct mm_struct *mm, un
return vma;
if (!(vma->vm_flags & VM_GROWSDOWN))
return NULL;
+ /* don't alter vm_start if the coredump is running */
+ if (!mmget_still_valid(mm))
+ return NULL;
start = vma->vm_start;
if (expand_stack(vma, addr))
return NULL;
_
From: zhong jiang <zhongjiang(a)huawei.com>
Subject: mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
When adding memory by probing a memory block in the sysfs interface, there
is an obvious issue where we will unlock the device_hotplug_lock when we
failed to takes it.
That issue was introduced in 8df1d0e4a265 ("mm/memory_hotplug: make
add_memory() take the device_hotplug_lock").
We should drop out in time when failing to take the device_hotplug_lock.
Link: http://lkml.kernel.org/r/1554696437-9593-1-git-send-email-zhongjiang@huawei…
Fixes: 8df1d0e4a265 ("mm/memory_hotplug: make add_memory() take the device_hotplug_lock")
Signed-off-by: zhong jiang <zhongjiang(a)huawei.com>
Reported-by: Yang yingliang <yangyingliang(a)huawei.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/base/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/base/memory.c~mm-memory_hotplug-do-not-unlock-when-fails-to-take-the-device_hotplug_lock
+++ a/drivers/base/memory.c
@@ -506,7 +506,7 @@ static ssize_t probe_store(struct device
ret = lock_device_hotplug_sysfs();
if (ret)
- goto out;
+ return ret;
nid = memory_add_physaddr_to_nid(phys_addr);
ret = __add_memory(nid, phys_addr,
_
The patch titled
Subject: fs/binfmt_elf.c: move brk out of mmap when doing direct loader exec
has been removed from the -mm tree. Its filename was
binfmt_elf-move-brk-out-of-mmap-when-doing-direct-loader-exec.patch
This patch was dropped because it had testing failures
------------------------------------------------------
From: Kees Cook <keescook(a)chromium.org>
Subject: fs/binfmt_elf.c: move brk out of mmap when doing direct loader exec
eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"), made
changes in the rare case when the ELF loader was directly invoked (e.g to
set a non-inheritable LD_LIBRARY_PATH, testing new versions of the
loader), by moving into the mmap region to avoid both ET_EXEC and PIE
binaries. This had the effect of also moving the brk region into mmap,
which could lead to the stack and brk being arbitrarily close to each
other. An unlucky process wouldn't get its requested stack size and stack
allocations could end up scribbling on the heap.
This is illustrated here. In the case of using the loader directly, brk
(so helpfully identified as "[heap]") is allocated with the _loader_ not
the binary. For example, with ASLR entirely disabled, you can see this
more clearly:
$ /bin/cat /proc/self/maps
555555554000-55555555c000 r-xp 00000000 ... /bin/cat
55555575b000-55555575c000 r--p 00007000 ... /bin/cat
55555575c000-55555575d000 rw-p 00008000 ... /bin/cat
55555575d000-55555577e000 rw-p 00000000 ... [heap]
...
7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 ...
7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]
$ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps
...
7ffff7bcc000-7ffff7bd4000 r-xp 00000000 ... /bin/cat
7ffff7bd4000-7ffff7dd3000 ---p 00008000 ... /bin/cat
7ffff7dd3000-7ffff7dd4000 r--p 00007000 ... /bin/cat
7ffff7dd4000-7ffff7dd5000 rw-p 00008000 ... /bin/cat
7ffff7dd5000-7ffff7dfc000 r-xp 00000000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7fb2000-7ffff7fd6000 rw-p 00000000 ...
7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff8020000 rw-p 00000000 ... [heap]
7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]
The solution is to move brk out of mmap and into ELF_ET_DYN_BASE since
nothing is there in this direct loader case (and ET_EXEC still far away at
0x400000). Anything that ran before should still work (i.e. the
ultimately-launched binary already had the brk very far from its text, so
this should be no different from a COMPAT_BRK standpoint). The only risk
I see here is that if someone started to suddenly depend on the entire
memory space above the mmap region being available when launching binaries
via a direct loader execs which seems highly unlikely, I'd hope: this
would mean a binary would _not_ work when execed normally.
Link: http://lkml.kernel.org/r/20190416042320.GA36924@beast
Link: https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=Omfhx_p0nCKPS…
Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Reported-by: Ali Saidi <alisaidi(a)amazon.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jann Horn <jannh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/binfmt_elf.c | 9 +++++++++
1 file changed, 9 insertions(+)
--- a/fs/binfmt_elf.c~binfmt_elf-move-brk-out-of-mmap-when-doing-direct-loader-exec
+++ a/fs/binfmt_elf.c
@@ -1131,6 +1131,15 @@ out_free_interp:
current->mm->end_data = end_data;
current->mm->start_stack = bprm->p;
+ /*
+ * When executing a loader directly (ET_DYN without Interp), move
+ * the brk area out of the mmap region (since it grows up, and may
+ * collide early with the stack growing down), and into the unused
+ * ELF_ET_DYN_BASE region.
+ */
+ if (!elf_interpreter)
+ current->mm->brk = current->mm->start_brk = ELF_ET_DYN_BASE;
+
if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) {
current->mm->brk = current->mm->start_brk =
arch_randomize_brk(current->mm);
_
Patches currently in -mm which might be from keescook(a)chromium.org are
On Thu, Apr 18, 2019 at 11:01:42PM +0300, Amit Klein wrote:
> Patch 355b98553789b646ed97ad801a619ff898471b92 makes net_hash_mix() return
> tru 32 bits of entropy. When used in the IP ID generation algorithm, this
> has the effect of extending the IP ID generation key from 32 bits to 64
> bits.
> However, net_hash_mix() is only used for IP ID generation starting with
> kernel version 4.1. Therefore, earlier kernels remain with 32-bit key.
> The patch addresses this issue by explicitly extending the key to 64 bits
> for kernels v<4.1.
Very nice, thanks!
One nit, it's easier to reference commits by a shorter sha1 and the text
of the commit, than just one long number. So I would rewrite the
subject and paragraphs to be something like the following:
------------
Subject: [PATCH] inet: update the IP ID generation algorithm to higher standards.
Commit 355b98553789 ("netns: provide pure entropy for net_hash_mix()")
makes net_hash_mix() return a true 32 bits of entropy. When used in the
IP ID generation algorithm, this has the effect of extending the IP ID
generation key from 32 bits to 64 bits.
However, net_hash_mix() is only used for IP ID generation starting with
kernel version 4.1. Therefore, earlier kernels remain with 32-bit key
no matter what the net_hash_mix() return value is.
This change addresses the issue by explicitly extending the key to 64
bits for kernels older than 4.1.
------------
Does that look good to you as an accurate representation? If so, I can
edit the text of your patch when I queue it up.
thanks,
greg k-h
On 4/18/19 11:29 AM, Sasha Levin wrote:
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 1de4fa14ee25 x86, mpx: Cleanup unused bound tables.
>
> The bot has tested the following trees: v5.0.8, v4.19.35, v4.14.112, v4.9.169, v4.4.178.
>
> v5.0.8: Build OK!
> v4.19.35: Failed to apply! Possible dependencies:
> dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
I probably should have looked more closely at the state of the code
before dd2283f2605e. A more correct Fixes: would probably have referred
to dd2283f2605e. *It* appears to be the root cause rather than the
original MPX code that I called out.
The pre-dd2283f2605e code does this:
> /*
> * Remove the vma's, and unmap the actual pages
> */
> detach_vmas_to_be_unmapped(mm, vma, prev, end);
> unmap_region(mm, vma, prev, start, end);
>
> arch_unmap(mm, vma, start, end);
>
> /* Fix up all other VM information */
> remove_vma_list(mm, vma);
But, this is actually safe. arch_unmap() can't see 'vma' in the rbtree
because it's been detached, so it can't do anything to 'vma' that might
be unsafe for remove_vma_list()'s use of 'vma'.
The bug in dd2283f2605e was moving unmap_region() to the after arch_unmap().
I confirmed this by running the reproducer on v4.19.35. It did not
trigger anything there, even with a bunch of debugging enabled which
detected the issue in 5.0.
Upstream commit 3fe6e52f0626 ("ovl: override creds with the ones from
the superblock mounter") is present in v4.4.156 as 121b09d30d48. But the
patch has a few follow up fixes in upstream that also have to be applied
to 4.4:
d0e13f5bbe4b ("ovl: fix uid/gid when creating over whiteout") in v4.7-rc4
8fc646b44385 ("ovl: fix random return value on mount") in v4.13-rc2
The first patch should fix v4.4.156 regression reported here:
https://bugzilla.kernel.org/show_bug.cgi?id=201633.
Rest of stable kernels (4.9+) already have these followup fixes.
From: Roman Gushchin <guro(a)fb.com>
[ commit c29f9010a35604047f96a7e9d6cbabfa36d996d1 from 4.14.y ]
Yongqin reported that /proc/zoneinfo format is broken in 4.14
due to commit 7aaf77272358 ("mm: don't show nr_indirectly_reclaimable
in /proc/vmstat")
Node 0, zone DMA
per-node stats
nr_inactive_anon 403
nr_active_anon 89123
nr_inactive_file 128887
nr_active_file 47377
nr_unevictable 2053
nr_slab_reclaimable 7510
nr_slab_unreclaimable 10775
nr_isolated_anon 0
nr_isolated_file 0
<...>
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 6022
nr_written 5985
74240
^^^^^^^^^^
pages free 131656
The problem is caused by the nr_indirectly_reclaimable counter,
which is hidden from the /proc/vmstat, but not from the
/proc/zoneinfo. Let's fix this inconsistency and hide the
counter from /proc/zoneinfo exactly as from /proc/vmstat.
BTW, in 4.19+ the counter has been renamed and exported by
the commit b29940c1abd7 ("mm: rename and change semantics of
nr_indirectly_reclaimable_bytes"), so there is no such a problem
anymore.
Cc: <stable(a)vger.kernel.org> # 4.19.y
Fixes: 7aaf77272358 ("mm: don't show nr_indirectly_reclaimable in /proc/vmstat")
Reported-by: Yongqin Liu <yongqin.liu(a)linaro.org>
Signed-off-by: Roman Gushchin <guro(a)fb.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
---
mm/vmstat.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 72ef3936d15d..7b8937cb2876 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1550,6 +1550,10 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
if (is_zone_first_populated(pgdat, zone)) {
seq_printf(m, "\n per-node stats");
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+ /* Skip hidden vmstat items. */
+ if (*vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
+ NR_VM_NUMA_STAT_ITEMS] == '\0')
+ continue;
seq_printf(m, "\n %-12s %lu",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
NR_VM_NUMA_STAT_ITEMS],
commit 662d66466637862ef955f7f6e78a286d8cf0ebef upstream.
When a QP is put into error state, all pending requests in the send work
queue should be drained. The following sequence of events could lead to a
failure, causing a request to hang:
(1) The QP builds a packet and tries to send through SDMA engine.
However, PIO engine is still busy. Consequently, this packet is put on
the QP's tx list and the QP is put on the PIO waiting list. The field
qp->s_flags is set with HFI1_S_WAIT_PIO_DRAIN;
(2) The QP is put into error state by the user application and
notify_error_qp() is called, which removes the QP from the PIO waiting
list and the packet from the QP's tx list. In addition, qp->s_flags is
cleared of RVT_S_ANY_WAIT_IO bits, which does not include
HFI1_S_WAIT_PIO_DRAIN bit;
(3) The hfi1_schdule_send() function is called to drain the QP's send
queue. Subsequently, hfi1_do_send() is called. Since the flag bit
HFI1_S_WAIT_PIO_DRAIN is set in qp->s_flags, hfi1_send_ok() fails. As
a result, hfi1_do_send() bails out without draining any request from
the send queue;
(4) The PIO engine completes the sending and tries to wake up any QP on
its waiting list. But the QP has been removed from the PIO waiting
list and therefore is kept in sleep forever.
The fix is to clear qp->s_flags of HFI1_S_ANY_WAIT_IO bits in step (2).
HFI1_S_ANY_WAIT_IO includes RVT_S_ANY_WAIT_IO and HFI1_S_WAIT_PIO_DRAIN.
[ Corrected commit message to match upstream ]
Fixes: 2e2ba09e48b7 ("IB/rdmavt, IB/hfi1: Create device dependent s_flags")
Cc: <stable(a)vger.kernel.org> # 4.19.x+
Reviewed-by: Mike Marciniszyn <mike.marciniszyn(a)intel.com>
Reviewed-by: Alex Estrin <alex.estrin(a)intel.com>
Signed-off-by: Kaike Wan <kaike.wan(a)intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro(a)intel.com>
Signed-off-by: Jason Gunthorpe <jgg(a)mellanox.com>
---
drivers/infiniband/hw/hfi1/qp.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/infiniband/hw/hfi1/qp.c b/drivers/infiniband/hw/hfi1/qp.c
index 9b1e84a..63c5ba6 100644
--- a/drivers/infiniband/hw/hfi1/qp.c
+++ b/drivers/infiniband/hw/hfi1/qp.c
@@ -784,7 +784,7 @@ void notify_error_qp(struct rvt_qp *qp)
write_seqlock(lock);
if (!list_empty(&priv->s_iowait.list) &&
!(qp->s_flags & RVT_S_BUSY)) {
- qp->s_flags &= ~RVT_S_ANY_WAIT_IO;
+ qp->s_flags &= ~HFI1_S_ANY_WAIT_IO;
list_del_init(&priv->s_iowait.list);
priv->s_iowait.lock = NULL;
rvt_put_qp(qp);