This is the start of the stable review cycle for the 4.4.210 release.
There are 28 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 16 Jan 2020 09:41:58 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.210-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.4.210-rc1
Florian Westphal <fw(a)strlen.de>
netfilter: ipset: avoid null deref when IPSET_ATTR_LINENO is present
Florian Westphal <fw(a)strlen.de>
netfilter: arp_tables: init netns pointer in xt_tgchk_param struct
Alan Stern <stern(a)rowland.harvard.edu>
USB: Fix: Don't skip endpoint descriptors with maxpacket=0
Navid Emamdoost <navid.emamdoost(a)gmail.com>
rtl8xxxu: prevent leaking urb
Navid Emamdoost <navid.emamdoost(a)gmail.com>
scsi: bfa: release allocated memory in case of error
Navid Emamdoost <navid.emamdoost(a)gmail.com>
mwifiex: pcie: Fix memory leak in mwifiex_pcie_alloc_cmdrsp_buf
Ganapathi Bhat <gbhat(a)marvell.com>
mwifiex: fix possible heap overflow in mwifiex_process_country_ie()
Sudip Mukherjee <sudipm.mukherjee(a)gmail.com>
tty: always relink the port
Sudip Mukherjee <sudipm.mukherjee(a)gmail.com>
tty: link tty and port before configuring it as console
Michael Straube <straube.linux(a)gmail.com>
staging: rtl8188eu: Add device code for TP-Link TL-WN727N v5.21
Paul Cercueil <paul(a)crapouillou.net>
usb: musb: dma: Correct parameter passed to IRQ handler
Paul Cercueil <paul(a)crapouillou.net>
usb: musb: Disable pullup at init
Daniele Palmas <dnlplm(a)gmail.com>
USB: serial: option: add ZLP support for 0x1bc7/0x9010
Malcolm Priestley <tvboxspy(a)gmail.com>
staging: vt6656: set usb_set_intfdata on driver fail.
Oliver Hartkopp <socketcan(a)hartkopp.net>
can: can_dropped_invalid_skb(): ensure an initialized headroom in outgoing CAN sk_buffs
Florian Faber <faber(a)faberman.de>
can: mscan: mscan_rx_poll(): fix rx path lockup when returning from polling to irq mode
Johan Hovold <johan(a)kernel.org>
can: gs_usb: gs_usb_probe(): use descriptors of current altsetting
Wayne Lin <Wayne.Lin(a)amd.com>
drm/dp_mst: correct the shifting in DP_REMOTE_I2C_READ
Dmitry Torokhov <dmitry.torokhov(a)gmail.com>
Input: add safety guards to input_set_keycode()
Dmitry Torokhov <dmitry.torokhov(a)gmail.com>
HID: hid-input: clear unmapped usages
Marcel Holtmann <marcel(a)holtmann.org>
HID: uhid: Fix returning EPOLLOUT from uhid_char_poll
Alan Stern <stern(a)rowland.harvard.edu>
HID: Fix slab-out-of-bounds read in hid_field_extract
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
tracing: Have stack tracer compile when MCOUNT_INSN_SIZE is not defined
Kaitao Cheng <pilgrimtao(a)gmail.com>
kernel/trace: Fix do not unregister tracepoints when register sched_migrate_task fail
Takashi Iwai <tiwai(a)suse.de>
ALSA: usb-audio: Apply the sample rate quirk for Bose Companion 5
Guenter Roeck <linux(a)roeck-us.net>
usb: chipidea: host: Disable port power only if previously enabled
Will Deacon <will(a)kernel.org>
chardev: Avoid potential use-after-free in 'chrdev_open()'
Jan Kara <jack(a)suse.cz>
kobject: Export kobject_get_unless_zero()
-------------
Diffstat:
Makefile | 4 +--
drivers/gpu/drm/drm_dp_mst_topology.c | 2 +-
drivers/hid/hid-core.c | 6 +++++
drivers/hid/hid-input.c | 16 ++++++++---
drivers/hid/uhid.c | 3 ++-
drivers/input/input.c | 26 +++++++++++-------
drivers/net/can/mscan/mscan.c | 21 +++++++--------
drivers/net/can/usb/gs_usb.c | 4 +--
drivers/net/wireless/mwifiex/pcie.c | 4 ++-
drivers/net/wireless/mwifiex/sta_ioctl.c | 11 +++++++-
drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c | 1 +
drivers/scsi/bfa/bfad_attr.c | 4 ++-
drivers/staging/rtl8188eu/os_dep/usb_intf.c | 1 +
drivers/staging/vt6656/device.h | 1 +
drivers/staging/vt6656/main_usb.c | 1 +
drivers/staging/vt6656/wcmd.c | 1 +
drivers/tty/serial/serial_core.c | 1 +
drivers/usb/chipidea/host.c | 4 ++-
drivers/usb/core/config.c | 12 ++++++---
drivers/usb/musb/musb_core.c | 3 +++
drivers/usb/musb/musbhsdma.c | 2 +-
drivers/usb/serial/option.c | 8 ++++++
drivers/usb/serial/usb-wwan.h | 1 +
drivers/usb/serial/usb_wwan.c | 4 +++
fs/char_dev.c | 2 +-
include/linux/can/dev.h | 34 ++++++++++++++++++++++++
include/linux/kobject.h | 2 ++
kernel/trace/trace_sched_wakeup.c | 4 ++-
kernel/trace/trace_stack.c | 5 ++++
lib/kobject.c | 5 +++-
net/ipv4/netfilter/arp_tables.c | 27 +++++++++++--------
net/netfilter/ipset/ip_set_core.c | 3 ++-
sound/usb/quirks.c | 1 +
33 files changed, 169 insertions(+), 55 deletions(-)
The patch titled
Subject: mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid
has been removed from the -mm tree. Its filename was
mm-memcg-slab-call-flush_memcg_workqueue-only-if-memcg-workqueue-is-valid.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Adrian Huang <ahuang12(a)lenovo.com>
Subject: mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid
When booting with amd_iommu=off, the following WARNING message
appears:
AMD-Vi: AMD IOMMU disabled on kernel command-line
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
RIP: 0010:flush_workqueue+0x42e/0x450
Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
RSP: 0000:ffffffff96203d80 EFLAGS: 00010246
RAX: ffffffff96203dc8 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffffff96a63120 RSI: ffffffff95efcba2 RDI: ffffffff96203dc0
RBP: ffffffff96203e08 R08: 0000000000000000 R09: ffffffff962a1828
R10: 00000000f0000080 R11: dead000000000100 R12: ffff8d8a87c0a770
R13: dead000000000100 R14: 0000000000000456 R15: ffffffff96203da0
FS: 0000000000000000(0000) GS:ffff8d8dbd000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8d91cfbff000 CR3: 000000078920a000 CR4: 00000000000406b0
Call Trace:
? wait_for_completion+0x51/0x180
kmem_cache_destroy+0x69/0x260
iommu_go_to_state+0x40c/0x5ab
amd_iommu_prepare+0x16/0x2a
irq_remapping_prepare+0x36/0x5f
enable_IR_x2apic+0x21/0x172
default_setup_apic_routing+0x12/0x6f
apic_intr_mode_init+0x1a1/0x1f1
x86_late_time_init+0x17/0x1c
start_kernel+0x480/0x53f
secondary_startup_64+0xb6/0xc0
---[ end trace 30894107c3749449 ]---
x2apic: IRQ remapping doesn't support X2APIC mode
x2apic disabled
The warning is caused by the calling of 'kmem_cache_destroy()'
in free_iommu_resources(). Here is the call path:
free_iommu_resources
kmem_cache_destroy
flush_memcg_workqueue
flush_workqueue
The root cause is that the IOMMU subsystem runs before the workqueue
subsystem, which the variable 'wq_online' is still 'false'. This leads to
the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is 'true'.
Since the variable 'memcg_kmem_cache_wq' is not allocated during the time,
it is unnecessary to call flush_memcg_workqueue(). This prevents the
WARNING message triggered by flush_workqueue().
Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate")
Signed-off-by: Adrian Huang <ahuang12(a)lenovo.com>
Reported-by: Xiaochun Lee <lixc17(a)lenovo.com>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab_common.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/slab_common.c~mm-memcg-slab-call-flush_memcg_workqueue-only-if-memcg-workqueue-is-valid
+++ a/mm/slab_common.c
@@ -903,7 +903,8 @@ static void flush_memcg_workqueue(struct
* deactivates the memcg kmem_caches through workqueue. Make sure all
* previous workitems on workqueue are processed.
*/
- flush_workqueue(memcg_kmem_cache_wq);
+ if (likely(memcg_kmem_cache_wq))
+ flush_workqueue(memcg_kmem_cache_wq);
/*
* If we're racing with children kmem_cache deactivation, it might
_
Patches currently in -mm which might be from ahuang12(a)lenovo.com are
The patch titled
Subject: mm, debug_pagealloc: don't rely on static keys too early
has been removed from the -mm tree. Its filename was
mm-debug_pagealloc-dont-rely-on-static-keys-too-early.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Vlastimil Babka <vbabka(a)suse.cz>
Subject: mm, debug_pagealloc: don't rely on static keys too early
Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param() as
in start_kernel(), so when the "debug_pagealloc=on" option is parsed, it
is safe to enable the static key.
However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.
To fix this without tricky changes to init code of multiple architectures,
this patch partially reverts the static key conversion from 96a2b03f281d.
Init-time and non-fastpath calls (such as in arch code) of
debug_pagealloc_enabled() will again test a simple bool variable.
Fastpath mm code is converted to a new debug_pagealloc_enabled_static()
variant that relies on the static key, which is enabled in a well-defined
point in mm_init() where it's guaranteed that jump_label_init() has been
called, regardless of architecture.
[sfr(a)canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Signed-off-by: Stephen Rothwell <sfr(a)canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Qian Cai <cai(a)lca.pw>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 18 +++++++++++++++---
init/main.c | 1 +
mm/page_alloc.c | 37 +++++++++++++------------------------
mm/slab.c | 4 ++--
mm/slub.c | 2 +-
mm/vmalloc.c | 4 ++--
6 files changed, 34 insertions(+), 32 deletions(-)
--- a/include/linux/mm.h~mm-debug_pagealloc-dont-rely-on-static-keys-too-early
+++ a/include/linux/mm.h
@@ -2658,14 +2658,26 @@ static inline bool want_init_on_free(voi
!page_poisoning_enabled();
}
-#ifdef CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
-DECLARE_STATIC_KEY_TRUE(_debug_pagealloc_enabled);
+#ifdef CONFIG_DEBUG_PAGEALLOC
+extern void init_debug_pagealloc(void);
#else
-DECLARE_STATIC_KEY_FALSE(_debug_pagealloc_enabled);
+static inline void init_debug_pagealloc(void) {}
#endif
+extern bool _debug_pagealloc_enabled_early;
+DECLARE_STATIC_KEY_FALSE(_debug_pagealloc_enabled);
static inline bool debug_pagealloc_enabled(void)
{
+ return IS_ENABLED(CONFIG_DEBUG_PAGEALLOC) &&
+ _debug_pagealloc_enabled_early;
+}
+
+/*
+ * For use in fast paths after init_debug_pagealloc() has run, or when a
+ * false negative result is not harmful when called too early.
+ */
+static inline bool debug_pagealloc_enabled_static(void)
+{
if (!IS_ENABLED(CONFIG_DEBUG_PAGEALLOC))
return false;
--- a/init/main.c~mm-debug_pagealloc-dont-rely-on-static-keys-too-early
+++ a/init/main.c
@@ -553,6 +553,7 @@ static void __init mm_init(void)
* bigger than MAX_ORDER unless SPARSEMEM.
*/
page_ext_init_flatmem();
+ init_debug_pagealloc();
report_meminit();
mem_init();
kmem_cache_init();
--- a/mm/page_alloc.c~mm-debug_pagealloc-dont-rely-on-static-keys-too-early
+++ a/mm/page_alloc.c
@@ -694,34 +694,27 @@ void prep_compound_page(struct page *pag
#ifdef CONFIG_DEBUG_PAGEALLOC
unsigned int _debug_guardpage_minorder;
-#ifdef CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
-DEFINE_STATIC_KEY_TRUE(_debug_pagealloc_enabled);
-#else
+bool _debug_pagealloc_enabled_early __read_mostly
+ = IS_ENABLED(CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT);
+EXPORT_SYMBOL(_debug_pagealloc_enabled_early);
DEFINE_STATIC_KEY_FALSE(_debug_pagealloc_enabled);
-#endif
EXPORT_SYMBOL(_debug_pagealloc_enabled);
DEFINE_STATIC_KEY_FALSE(_debug_guardpage_enabled);
static int __init early_debug_pagealloc(char *buf)
{
- bool enable = false;
-
- if (kstrtobool(buf, &enable))
- return -EINVAL;
-
- if (enable)
- static_branch_enable(&_debug_pagealloc_enabled);
-
- return 0;
+ return kstrtobool(buf, &_debug_pagealloc_enabled_early);
}
early_param("debug_pagealloc", early_debug_pagealloc);
-static void init_debug_guardpage(void)
+void init_debug_pagealloc(void)
{
if (!debug_pagealloc_enabled())
return;
+ static_branch_enable(&_debug_pagealloc_enabled);
+
if (!debug_guardpage_minorder())
return;
@@ -1186,7 +1179,7 @@ static __always_inline bool free_pages_p
*/
arch_free_page(page, order);
- if (debug_pagealloc_enabled())
+ if (debug_pagealloc_enabled_static())
kernel_map_pages(page, 1 << order, 0);
kasan_free_nondeferred_pages(page, order);
@@ -1207,7 +1200,7 @@ static bool free_pcp_prepare(struct page
static bool bulkfree_pcp_prepare(struct page *page)
{
- if (debug_pagealloc_enabled())
+ if (debug_pagealloc_enabled_static())
return free_pages_check(page);
else
return false;
@@ -1221,7 +1214,7 @@ static bool bulkfree_pcp_prepare(struct
*/
static bool free_pcp_prepare(struct page *page)
{
- if (debug_pagealloc_enabled())
+ if (debug_pagealloc_enabled_static())
return free_pages_prepare(page, 0, true);
else
return free_pages_prepare(page, 0, false);
@@ -1973,10 +1966,6 @@ void __init page_alloc_init_late(void)
for_each_populated_zone(zone)
set_zone_contiguous(zone);
-
-#ifdef CONFIG_DEBUG_PAGEALLOC
- init_debug_guardpage();
-#endif
}
#ifdef CONFIG_CMA
@@ -2106,7 +2095,7 @@ static inline bool free_pages_prezeroed(
*/
static inline bool check_pcp_refill(struct page *page)
{
- if (debug_pagealloc_enabled())
+ if (debug_pagealloc_enabled_static())
return check_new_page(page);
else
return false;
@@ -2128,7 +2117,7 @@ static inline bool check_pcp_refill(stru
}
static inline bool check_new_pcp(struct page *page)
{
- if (debug_pagealloc_enabled())
+ if (debug_pagealloc_enabled_static())
return check_new_page(page);
else
return false;
@@ -2155,7 +2144,7 @@ inline void post_alloc_hook(struct page
set_page_refcounted(page);
arch_alloc_page(page, order);
- if (debug_pagealloc_enabled())
+ if (debug_pagealloc_enabled_static())
kernel_map_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
kernel_poison_pages(page, 1 << order, 1);
--- a/mm/slab.c~mm-debug_pagealloc-dont-rely-on-static-keys-too-early
+++ a/mm/slab.c
@@ -1416,7 +1416,7 @@ static void kmem_rcu_free(struct rcu_hea
#if DEBUG
static bool is_debug_pagealloc_cache(struct kmem_cache *cachep)
{
- if (debug_pagealloc_enabled() && OFF_SLAB(cachep) &&
+ if (debug_pagealloc_enabled_static() && OFF_SLAB(cachep) &&
(cachep->size % PAGE_SIZE) == 0)
return true;
@@ -2008,7 +2008,7 @@ int __kmem_cache_create(struct kmem_cach
* to check size >= 256. It guarantees that all necessary small
* sized slab is initialized in current slab initialization sequence.
*/
- if (debug_pagealloc_enabled() && (flags & SLAB_POISON) &&
+ if (debug_pagealloc_enabled_static() && (flags & SLAB_POISON) &&
size >= 256 && cachep->object_size > cache_line_size()) {
if (size < PAGE_SIZE || size % PAGE_SIZE == 0) {
size_t tmp_size = ALIGN(size, PAGE_SIZE);
--- a/mm/slub.c~mm-debug_pagealloc-dont-rely-on-static-keys-too-early
+++ a/mm/slub.c
@@ -288,7 +288,7 @@ static inline void *get_freepointer_safe
unsigned long freepointer_addr;
void *p;
- if (!debug_pagealloc_enabled())
+ if (!debug_pagealloc_enabled_static())
return get_freepointer(s, object);
freepointer_addr = (unsigned long)object + s->offset;
--- a/mm/vmalloc.c~mm-debug_pagealloc-dont-rely-on-static-keys-too-early
+++ a/mm/vmalloc.c
@@ -1383,7 +1383,7 @@ static void free_unmap_vmap_area(struct
{
flush_cache_vunmap(va->va_start, va->va_end);
unmap_vmap_area(va);
- if (debug_pagealloc_enabled())
+ if (debug_pagealloc_enabled_static())
flush_tlb_kernel_range(va->va_start, va->va_end);
free_vmap_area_noflush(va);
@@ -1681,7 +1681,7 @@ static void vb_free(const void *addr, un
vunmap_page_range((unsigned long)addr, (unsigned long)addr + size);
- if (debug_pagealloc_enabled())
+ if (debug_pagealloc_enabled_static())
flush_tlb_kernel_range((unsigned long)addr,
(unsigned long)addr + size);
_
Patches currently in -mm which might be from vbabka(a)suse.cz are
mm-debug-always-print-flags-in-dump_page.patch
The patch titled
Subject: mm: memcg/slab: fix percpu slab vmstats flushing
has been removed from the -mm tree. Its filename was
mm-memcg-slab-fix-percpu-slab-vmstats-flushing.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Roman Gushchin <guro(a)fb.com>
Subject: mm: memcg/slab: fix percpu slab vmstats flushing
Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure. Each time percpu
counters are summed, added to the atomic counterparts and propagated up by
the cgroup tree.
The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels. It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed. And with uptime the
error on upper levels might become noticeable.
The first flushing aims to make counters on ancestor levels more precise.
Dying cgroups may resume in the dying state for a long time. After
kmem_cache reparenting which is performed during the offlining slab
counters of the dying cgroup don't have any chances to be updated, because
any slab operations will be performed on the parent level. It means that
the inaccuracy caused by percpu batching will not decrease up to the final
destruction of the cgroup. By the original idea flushing slab counters
during the offlining should minimize the visible inaccuracy of slab
counters on the parent level.
The problem is that percpu counters are not zeroed after the first
flushing. So every cached percpu value is summed twice. It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level. After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real value.
For now, let's just stop flushing slab counters on memcg offlining. It
can't be done correctly without scheduling a work on each cpu: reading and
zeroing it during css offlining can race with an asynchronous update,
which doesn't expect values to be changed underneath.
With this change, slab counters on parent level will become eventually
consistent. Once all dying children are gone, values are correct. And if
not, the error is capped by 32 * NR_CPUS pages per dying cgroup.
It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level. It means that if a slab page
was allocated, a counter on child level was bumped, then the page was
reparented and freed, the annihilation of positive and negative counter
values will not happen until the child cgroup is released. It makes slab
counters different from others, and it might want us to implement flushing
in a correct form again. But it's also a question of performance:
scheduling a work on each cpu isn't free, and it's an open question if the
benefit of having more accurate counters is worth it.
We might also consider flushing all counters on offlining, not only slab
counters.
So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups). And think about
the accuracy of counters separately.
Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mmzone.h | 5 ++---
mm/memcontrol.c | 37 +++++++++----------------------------
2 files changed, 11 insertions(+), 31 deletions(-)
--- a/include/linux/mmzone.h~mm-memcg-slab-fix-percpu-slab-vmstats-flushing
+++ a/include/linux/mmzone.h
@@ -215,9 +215,8 @@ enum node_stat_item {
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
NR_UNEVICTABLE, /* " " " " " */
- NR_SLAB_RECLAIMABLE, /* Please do not reorder this item */
- NR_SLAB_UNRECLAIMABLE, /* and this one without looking at
- * memcg_flush_percpu_vmstats() first. */
+ NR_SLAB_RECLAIMABLE,
+ NR_SLAB_UNRECLAIMABLE,
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
WORKINGSET_NODES,
--- a/mm/memcontrol.c~mm-memcg-slab-fix-percpu-slab-vmstats-flushing
+++ a/mm/memcontrol.c
@@ -3287,49 +3287,34 @@ static u64 mem_cgroup_read_u64(struct cg
}
}
-static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only)
+static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg)
{
- unsigned long stat[MEMCG_NR_STAT];
+ unsigned long stat[MEMCG_NR_STAT] = {0};
struct mem_cgroup *mi;
int node, cpu, i;
- int min_idx, max_idx;
-
- if (slab_only) {
- min_idx = NR_SLAB_RECLAIMABLE;
- max_idx = NR_SLAB_UNRECLAIMABLE;
- } else {
- min_idx = 0;
- max_idx = MEMCG_NR_STAT;
- }
-
- for (i = min_idx; i < max_idx; i++)
- stat[i] = 0;
for_each_online_cpu(cpu)
- for (i = min_idx; i < max_idx; i++)
+ for (i = 0; i < MEMCG_NR_STAT; i++)
stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu);
for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
- for (i = min_idx; i < max_idx; i++)
+ for (i = 0; i < MEMCG_NR_STAT; i++)
atomic_long_add(stat[i], &mi->vmstats[i]);
- if (!slab_only)
- max_idx = NR_VM_NODE_STAT_ITEMS;
-
for_each_node(node) {
struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];
struct mem_cgroup_per_node *pi;
- for (i = min_idx; i < max_idx; i++)
+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
stat[i] = 0;
for_each_online_cpu(cpu)
- for (i = min_idx; i < max_idx; i++)
+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
stat[i] += per_cpu(
pn->lruvec_stat_cpu->count[i], cpu);
for (pi = pn; pi; pi = parent_nodeinfo(pi, node))
- for (i = min_idx; i < max_idx; i++)
+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
atomic_long_add(stat[i], &pi->lruvec_stat[i]);
}
}
@@ -3403,13 +3388,9 @@ static void memcg_offline_kmem(struct me
parent = root_mem_cgroup;
/*
- * Deactivate and reparent kmem_caches. Then flush percpu
- * slab statistics to have precise values at the parent and
- * all ancestor levels. It's required to keep slab stats
- * accurate after the reparenting of kmem_caches.
+ * Deactivate and reparent kmem_caches.
*/
memcg_deactivate_kmem_caches(memcg, parent);
- memcg_flush_percpu_vmstats(memcg, true);
kmemcg_id = memcg->kmemcg_id;
BUG_ON(kmemcg_id < 0);
@@ -4913,7 +4894,7 @@ static void mem_cgroup_free(struct mem_c
* Flush percpu vmstats and vmevents to guarantee the value correctness
* on parent's and all ancestor levels.
*/
- memcg_flush_percpu_vmstats(memcg, false);
+ memcg_flush_percpu_vmstats(memcg);
memcg_flush_percpu_vmevents(memcg);
__mem_cgroup_free(memcg);
}
_
Patches currently in -mm which might be from guro(a)fb.com are
The patch titled
Subject: mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment
has been removed from the -mm tree. Its filename was
thp-shmem-fix-conflict-of-above-47bit-hint-address-and-pmd-alignment.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: "Kirill A. Shutemov" <kirill(a)shutemov.name>
Subject: mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment
Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
enabled. But it doesn't work well with above-47bit hint address.
Normally, the kernel doesn't create userspace mappings above 47-bit, even
if the machine allows this (such as with 5-level paging on x86-64). Not
all user space is ready to handle wide addresses. It's known that at
least some JIT compilers use higher bits in pointers to encode their
information.
Userspace can ask for allocation from full address space by specifying
hint address (with or without MAP_FIXED) above 47-bits. If the
application doesn't need a particular address, but wants to allocate from
whole address space it can specify -1 as a hint address.
Unfortunately, this trick breaks THP alignment in shmem/tmp:
shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
*any* hint address specified.
This can be fixed by requesting the aligned area if the we failed to
allocated at user-specified hint address. The request with inflated
length will also take the user-specified hint address. This way we will
not lose an allocation request from the full address space.
[kirill(a)shutemov.name: fold in a fixup]
Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.…
Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: "Willhalm, Thomas" <thomas.willhalm(a)intel.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: "Bruggeman, Otto G" <otto.g.bruggeman(a)intel.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shmem.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
--- a/mm/shmem.c~thp-shmem-fix-conflict-of-above-47bit-hint-address-and-pmd-alignment
+++ a/mm/shmem.c
@@ -2107,9 +2107,10 @@ unsigned long shmem_get_unmapped_area(st
/*
* Our priority is to support MAP_SHARED mapped hugely;
* and support MAP_PRIVATE mapped hugely too, until it is COWed.
- * But if caller specified an address hint, respect that as before.
+ * But if caller specified an address hint and we allocated area there
+ * successfully, respect that as before.
*/
- if (uaddr)
+ if (uaddr == addr)
return addr;
if (shmem_huge != SHMEM_HUGE_FORCE) {
@@ -2143,7 +2144,7 @@ unsigned long shmem_get_unmapped_area(st
if (inflated_len < len)
return addr;
- inflated_addr = get_area(NULL, 0, inflated_len, 0, flags);
+ inflated_addr = get_area(NULL, uaddr, inflated_len, 0, flags);
if (IS_ERR_VALUE(inflated_addr))
return addr;
if (inflated_addr & ~PAGE_MASK)
_
Patches currently in -mm which might be from kirill(a)shutemov.name are
mm-page_alloc-skip-non-present-sections-on-zone-initialization.patch
The patch titled
Subject: mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment
has been removed from the -mm tree. Its filename was
thp-fix-conflict-of-above-47bit-hint-address-and-pmd-alignment.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: "Kirill A. Shutemov" <kirill(a)shutemov.name>
Subject: mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment
Patch series "Fix two above-47bit hint address vs. THP bugs".
The two get_unmapped_area() implementations have to be fixed to provide
THP-friendly mappings if above-47bit hint address is specified.
This patch (of 2):
Filesystems use thp_get_unmapped_area() to provide THP-friendly mappings.
For DAX in particular.
Normally, the kernel doesn't create userspace mappings above 47-bit, even
if the machine allows this (such as with 5-level paging on x86-64). Not
all user space is ready to handle wide addresses. It's known that at
least some JIT compilers use higher bits in pointers to encode their
information.
Userspace can ask for allocation from full address space by specifying
hint address (with or without MAP_FIXED) above 47-bits. If the
application doesn't need a particular address, but wants to allocate from
whole address space it can specify -1 as a hint address.
Unfortunately, this trick breaks thp_get_unmapped_area(): the function
would not try to allocate PMD-aligned area if *any* hint address
specified.
Modify the routine to handle it correctly:
- Try to allocate the space at the specified hint address with length
padding required for PMD alignment.
- If failed, retry without length padding (but with the same hint
address);
- If the returned address matches the hint address return it.
- Otherwise, align the address as required for THP and return.
The user specified hint address is passed down to get_unmapped_area() so
above-47bit hint address will be taken into account without breaking
alignment requirements.
Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.…
Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Thomas Willhalm <thomas.willhalm(a)intel.com>
Tested-by: Dan Williams <dan.j.williams(a)intel.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: "Bruggeman, Otto G" <otto.g.bruggeman(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 38 ++++++++++++++++++++++++--------------
1 file changed, 24 insertions(+), 14 deletions(-)
--- a/mm/huge_memory.c~thp-fix-conflict-of-above-47bit-hint-address-and-pmd-alignment
+++ a/mm/huge_memory.c
@@ -527,13 +527,13 @@ void prep_transhuge_page(struct page *pa
set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
}
-static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
+static unsigned long __thp_get_unmapped_area(struct file *filp,
+ unsigned long addr, unsigned long len,
loff_t off, unsigned long flags, unsigned long size)
{
- unsigned long addr;
loff_t off_end = off + len;
loff_t off_align = round_up(off, size);
- unsigned long len_pad;
+ unsigned long len_pad, ret;
if (off_end <= off_align || (off_end - off_align) < size)
return 0;
@@ -542,30 +542,40 @@ static unsigned long __thp_get_unmapped_
if (len_pad < len || (off + len_pad) < off)
return 0;
- addr = current->mm->get_unmapped_area(filp, 0, len_pad,
+ ret = current->mm->get_unmapped_area(filp, addr, len_pad,
off >> PAGE_SHIFT, flags);
- if (IS_ERR_VALUE(addr))
+
+ /*
+ * The failure might be due to length padding. The caller will retry
+ * without the padding.
+ */
+ if (IS_ERR_VALUE(ret))
return 0;
- addr += (off - addr) & (size - 1);
- return addr;
+ /*
+ * Do not try to align to THP boundary if allocation at the address
+ * hint succeeds.
+ */
+ if (ret == addr)
+ return addr;
+
+ ret += (off - ret) & (size - 1);
+ return ret;
}
unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags)
{
+ unsigned long ret;
loff_t off = (loff_t)pgoff << PAGE_SHIFT;
- if (addr)
- goto out;
if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
goto out;
- addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
- if (addr)
- return addr;
-
- out:
+ ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE);
+ if (ret)
+ return ret;
+out:
return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
}
EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
_
Patches currently in -mm which might be from kirill(a)shutemov.name are
mm-page_alloc-skip-non-present-sections-on-zone-initialization.patch