LLVM implemented a recent "libcall optimization" that lowers calls to
`sprintf(dest, "%s", str)` where the return value is used to
`stpcpy(dest, str) - dest`. This generally avoids the machinery involved
in parsing format strings. `stpcpy` is just like `strcpy` except it
returns the pointer to the new tail of `dest`. This optimization was
introduced into clang-12.
Implement this so that we don't observe linkage failures due to missing
symbol definitions for `stpcpy`.
Similar to last year's fire drill with:
commit 5f074f3e192f ("lib/string.c: implement a basic bcmp")
The kernel is somewhere between a "freestanding" environment (no full libc)
and "hosted" environment (many symbols from libc exist with the same
type, function signature, and semantics).
As H. Peter Anvin notes, there's not really a great way to inform the
compiler that you're targeting a freestanding environment but would like
to opt-in to some libcall optimizations (see pr/47280 below), rather than
opt-out.
Arvind notes, -fno-builtin-* behaves slightly differently between GCC
and Clang, and Clang is missing many __builtin_* definitions, which I
consider a bug in Clang and am working on fixing.
Masahiro summarizes the subtle distinction between compilers justly:
To prevent transformation from foo() into bar(), there are two ways in
Clang to do that; -fno-builtin-foo, and -fno-builtin-bar. There is
only one in GCC; -fno-buitin-foo.
(Any difference in that behavior in Clang is likely a bug from a missing
__builtin_* definition.)
Masahiro also notes:
We want to disable optimization from foo() to bar(),
but we may still benefit from the optimization from
foo() into something else. If GCC implements the same transform, we
would run into a problem because it is not -fno-builtin-bar, but
-fno-builtin-foo that disables that optimization.
In this regard, -fno-builtin-foo would be more future-proof than
-fno-built-bar, but -fno-builtin-foo is still potentially overkill. We
may want to prevent calls from foo() being optimized into calls to
bar(), but we still may want other optimization on calls to foo().
It seems that compilers today don't quite provide the fine grain control
over which libcall optimizations pseudo-freestanding environments would
prefer.
Finally, Kees notes that this interface is unsafe, so we should not
encourage its use. As such, I've removed the declaration from any
header, but it still needs to be exported to avoid linkage errors in
modules.
Cc: stable(a)vger.kernel.org
Link: https://bugs.llvm.org/show_bug.cgi?id=47162
Link: https://bugs.llvm.org/show_bug.cgi?id=47280
Link: https://github.com/ClangBuiltLinux/linux/issues/1126
Link: https://man7.org/linux/man-pages/man3/stpcpy.3.html
Link: https://pubs.opengroup.org/onlinepubs/9699919799/functions/stpcpy.html
Link: https://reviews.llvm.org/D85963
Suggested-by: Andy Lavr <andy.lavr(a)gmail.com>
Suggested-by: Arvind Sankar <nivedita(a)alum.mit.edu>
Suggested-by: Joe Perches <joe(a)perches.com>
Suggested-by: Masahiro Yamada <masahiroy(a)kernel.org>
Suggested-by: Rasmus Villemoes <linux(a)rasmusvillemoes.dk>
Reported-by: Sami Tolvanen <samitolvanen(a)google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers(a)google.com>
---
Changes V3:
* Drop Sami's Tested by tag; newer patch.
* Add EXPORT_SYMBOL as per Andy.
* Rewrite commit message, rewrote part of what Masahiro said to be
generic in terms of foo() and bar().
* Prefer %NUL-terminated to NULL terminated. NUL is the ASCII character
'\0', as per Arvind and Rasmus.
Changes V2:
* Added Sami's Tested by; though the patch changed implementation, the
missing symbol at link time was the problem Sami was observing.
* Fix __restrict -> __restrict__ typo as per Joe.
* Drop note about restrict from commit message as per Arvind.
* Fix NULL -> NUL as per Arvind; NUL is ASCII '\0'. TIL
* Fix off by one error as per Arvind; I had another off by one error in
my test program that was masking this.
lib/string.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/lib/string.c b/lib/string.c
index 6012c385fb31..6bd0cf0fb009 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -272,6 +272,30 @@ ssize_t strscpy_pad(char *dest, const char *src, size_t count)
}
EXPORT_SYMBOL(strscpy_pad);
+/**
+ * stpcpy - copy a string from src to dest returning a pointer to the new end
+ * of dest, including src's %NUL-terminator. May overrun dest.
+ * @dest: pointer to end of string being copied into. Must be large enough
+ * to receive copy.
+ * @src: pointer to the beginning of string being copied from. Must not overlap
+ * dest.
+ *
+ * stpcpy differs from strcpy in a key way: the return value is the new
+ * %NUL-terminated character. (for strcpy, the return value is a pointer to
+ * src. This interface is considered unsafe as it doesn't perform bounds
+ * checking of the inputs. As such it's not recommended for usage. Instead,
+ * its definition is provided in case the compiler lowers other libcalls to
+ * stpcpy.
+ */
+char *stpcpy(char *__restrict__ dest, const char *__restrict__ src);
+char *stpcpy(char *__restrict__ dest, const char *__restrict__ src)
+{
+ while ((*dest++ = *src++) != '\0')
+ /* nothing */;
+ return --dest;
+}
+EXPORT_SYMBOL(stpcpy);
+
#ifndef __HAVE_ARCH_STRCAT
/**
* strcat - Append one %NUL-terminated string to another
--
2.28.0.297.g1956fa8f8d-goog
From: SeongJae Park <sjpark(a)amazon.de>
Hello,
We found below 27 commits in the 'v5.5..linus/master (upstream)' seems fixing or mentioning
commits in the 'v5.4..stable/linux-5.4.y (downstream)' but are not merged in the 'downstream' yet.
Could you please review if those need to be merged in?
A commit is considered as fix of another if the complete 'Fixed:' tag is in the
commit message. If the tag is not found but the commit message contains the
title or the hash id of the other commit, it is considered mentioning it. So,
the 'mentions' might have many false positives, but it could cover the typos (I
found such cases before).
The commits are grouped as 'fixes cleanly applicable', 'fixes not cleanly
applicable (need manual backporting to be applied)', 'mentions cleanly
applicable', and 'mentions not cleanly applicable'. Also, the commits in each
group are sorted by the commit dates (oldest first).
Both the finding of the commits and the writeup of this report is automatically
done by a little script[1]. I'm going to run the tool and post this kind of
report every couple of weeks or every month. Any comment (e.g., regarding
posting period, new features request, bug report, ...) is welcome.
Especially, if you find some commits that don't need to be merged in the
downstream, please let me know so that I can mark those as unnecessary and
don't bother you again.
[1] https://github.com/sjp38/stream-track
Thanks,
SeongJae
# v5.5: 4e3112a240ba9986cc3f67a6880da6529a955006
# linus/master: 15bc20c6af4ceee97a1f90b43c0e386643c071b4
# v5.4: 6e815efe19a99a33b16cc720c3d3a727565a4fa1
# stable/linux-5.4.y: 6576d69aac94cd8409636dfa86e0df39facdf0d2
Fixes cleanly applicable
------------------------
2fb75ceaf71a ("remoteproc: Add missing '\n' in log messages")
# commit date: 2020-04-22, author: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
# fixes 'remoteproc: Fix NULL pointer dereference in rproc_virtio_notify'
1b9ae0c92925 ("wireless: Use linux/stddef.h instead of stddef.h")
# commit date: 2020-05-27, author: Hauke Mehrtens <hauke(a)hauke-m.de>
# fixes 'wireless: Use offsetof instead of custom macro.'
e4b0e41fee94 ("ALSA: usb-audio: Use the new macro for HP Dock rename quirks")
# commit date: 2020-06-08, author: Takashi Iwai <tiwai(a)suse.de>
# fixes 'ALSA: usb-audio: Add vendor, product and profile name for HP Thunderbolt Dock'
efb94790852a ("drm/panel-simple: fix connector type for LogicPD Type28 Display")
# commit date: 2020-06-21, author: Adam Ford <aford173(a)gmail.com>
# fixes 'drm/panel: simple: Add Logic PD Type 28 display support'
2f57b8d57673 ("dmaengine: imx-sdma: Fix: Remove 'always true' comparison")
# commit date: 2020-06-24, author: Fabio Estevam <festevam(a)gmail.com>
# fixes 'dmaengine: imx-sdma: Fix the event id check to include RX event for UART6'
10de795a5add ("kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE")
# commit date: 2020-08-06, author: Muchun Song <songmuchun(a)bytedance.com>
# fixes 'kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler'
Fixes not cleanly applicable
----------------------------
3907ccfaec5d ("crypto: atmel-aes - Fix CTR counter overflow when multiple fragments")
# commit date: 2019-12-20, author: Tudor Ambarus <tudor.ambarus(a)microchip.com>
# fixes 'crypto: atmel-aes - Fix counter overflow in CTR mode'
9210c075cef2 ("nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll()")
# commit date: 2020-05-27, author: Dongli Zhang <dongli.zhang(a)oracle.com>
# fixes 'nvme/pci: move cqe check after device shutdown'
6e2f83884c09 ("bnxt_en: Fix AER reset logic on 57500 chips.")
# commit date: 2020-06-15, author: Michael Chan <michael.chan(a)broadcom.com>
# fixes 'bnxt_en: Improve AER slot reset.'
695cf5ab401c ("ALSA: usb-audio: Fix packet size calculation")
# commit date: 2020-06-30, author: Alexander Tsoy <alexander(a)tsoy.me>
# fixes 'ALSA: usb-audio: Improve frames size computation'
2fb2799a2abb ("net: rmnet: do not allow to add multiple bridge interfaces")
# commit date: 2020-07-04, author: Taehee Yoo <ap420073(a)gmail.com>
# fixes 'net: rmnet: use upper/lower device infrastructure'
Mentions cleanly applicable
---------------------------
32ada3b9e04c ("x86/resctrl: Clean up unused function parameter in mkdir path")
# commit date: 2020-01-20, author: Xiaochen Shen <xiaochen.shen(a)intel.com>
# mentions 'x86/resctrl: Fix a deadlock due to inaccurate reference'
20f513091caf ("crypto: ccree - remove set but not used variable 'du_size'")
# commit date: 2020-02-13, author: YueHaibing <yuehaibing(a)huawei.com>
# mentions 'crypto: ccree - fix FDE descriptor sequence'
b182a66792fe ("net: ena: remove set but not used variable 'hash_key'")
# commit date: 2020-02-17, author: YueHaibing <yuehaibing(a)huawei.com>
# mentions 'net: ena: rss: do not allocate key when not supported'
cbb5494ebce5 ("Revert "thunderbolt: Prevent crash if non-active NVMem file is read"")
# commit date: 2020-04-16, author: Nicholas Johnson <nicholas.johnson-opensource(a)outlook.com.au>
# mentions 'thunderbolt: Prevent crash if non-active NVMem file is read'
1a33e10e4a95 ("net: partially revert dynamic lockdep key changes")
# commit date: 2020-05-04, author: Cong Wang <xiyou.wangcong(a)gmail.com>
# mentions 'bonding: add missing netdev_update_lockdep_key()'
7e579f3a074c ("tools arch x86 uapi: Synch asm/unistd.h with the kernel sources")
# commit date: 2020-06-09, author: Arnaldo Carvalho de Melo <acme(a)redhat.com>
# mentions 'x86/syscalls: Revert "x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long"'
c3dbe541ef77 ("blktrace: Avoid sparse warnings when assigning q->blk_trace")
# commit date: 2020-06-17, author: Jan Kara <jack(a)suse.cz>
# mentions 'blktrace: Protect q->blk_trace with RCU'
6d548b9e5d56 ("btrfs: fix reclaim_size counter leak after stealing from global reserve")
# commit date: 2020-07-02, author: Filipe Manana <fdmanana(a)suse.com>
# mentions 'btrfs: improve global reserve stealing logic'
Mentions not cleanly applicable
-------------------------------
6dbd54e4154d ("Revert "tty/serial: atmel: fix out of range clock divider handling"")
# commit date: 2019-12-18, author: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
# mentions 'tty/serial: atmel: fix out of range clock divider handling'
48d7fb181a91 ("drm/i915: Remove lite restore defines")
# commit date: 2020-02-08, author: Mika Kuoppala <mika.kuoppala(a)linux.intel.com>
# mentions 'drm/i915/gt: Detect if we miss WaIdleLiteRestore'
ff7e06a55676 ("ALSA: pcm: oss: Fix regression by buffer overflow fix (again)")
# commit date: 2020-04-03, author: Takashi Iwai <tiwai(a)suse.de>
# mentions 'ALSA: pcm: oss: Avoid plugin buffer overflow'
ac957e8c5411 ("ALSA: pcm: oss: Place the plugin buffer overflow checks correctly (for 5.7)")
# commit date: 2020-04-24, author: Takashi Iwai <tiwai(a)suse.de>
# mentions 'ALSA: pcm: oss: Avoid plugin buffer overflow'
e7511f560f54 ("bonding: remove useless stats_lock_key")
# commit date: 2020-05-04, author: Cong Wang <xiyou.wangcong(a)gmail.com>
# mentions 'bonding: fix lockdep warning in bond_get_stats()'
39f23ce07b93 ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list")
# commit date: 2020-05-19, author: Vincent Guittot <vincent.guittot(a)linaro.org>
# mentions 'sched/fair: Fix enqueue_task_fair warning'
8ab3a3812aa9 ("drm/i915/gt: Incrementally check for rewinding")
# commit date: 2020-06-16, author: Chris Wilson <chris(a)chris-wilson.co.uk>
# mentions 'drm/i915/execlists: Always force a context reload when rewinding RING_TAIL'
5e548b32018d ("btrfs: do not set the full sync flag on the inode during page release")
# commit date: 2020-07-27, author: Filipe Manana <fdmanana(a)suse.com>
# mentions 'btrfs: fix race between page release and a fast fsync'
Trying to clear DR7 around a #DB from usermode malfunctions if we
schedule when delivering SIGTRAP. Rather than trying to define a
special no-recursion region, just allow a single level of recursion.
We do the same thing for NMI, and it hasn't caused any problems yet.
Reported-by: Kyle Huey <me(a)kylehuey.com>
Debugged-by: Josh Poimboeuf <jpoimboe(a)redhat.com>
Fixes: 9f58fdde95c9 ("x86/db: Split out dr6/7 handling")
Cc: stable(a)vger.kernel.org
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
---
arch/x86/kernel/traps.c | 69 +++++++++++++++++++++++------------------
1 file changed, 38 insertions(+), 31 deletions(-)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1f66d2d1e998..bf852b72f15c 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -729,20 +729,9 @@ static bool is_sysenter_singlestep(struct pt_regs *regs)
#endif
}
-static __always_inline void debug_enter(unsigned long *dr6, unsigned long *dr7)
+static __always_inline unsigned long debug_read_clear_dr6(void)
{
- /*
- * Disable breakpoints during exception handling; recursive exceptions
- * are exceedingly 'fun'.
- *
- * Since this function is NOKPROBE, and that also applies to
- * HW_BREAKPOINT_X, we can't hit a breakpoint before this (XXX except a
- * HW_BREAKPOINT_W on our stack)
- *
- * Entry text is excluded for HW_BP_X and cpu_entry_area, which
- * includes the entry stack is excluded for everything.
- */
- *dr7 = local_db_save();
+ unsigned long dr6;
/*
* The Intel SDM says:
@@ -755,15 +744,21 @@ static __always_inline void debug_enter(unsigned long *dr6, unsigned long *dr7)
*
* Keep it simple: clear DR6 immediately.
*/
- get_debugreg(*dr6, 6);
+ get_debugreg(dr6, 6);
set_debugreg(0, 6);
/* Filter out all the reserved bits which are preset to 1 */
- *dr6 &= ~DR6_RESERVED;
+ dr6 &= ~DR6_RESERVED;
+
+ return dr6;
+}
+
+static __always_inline void debug_enter(unsigned long *dr6, unsigned long *dr7)
+{
+ *dr6 = debug_read_clear_dr6();
}
static __always_inline void debug_exit(unsigned long dr7)
{
- local_db_restore(dr7);
}
/*
@@ -863,6 +858,19 @@ static void handle_debug(struct pt_regs *regs, unsigned long dr6, bool user)
static __always_inline void exc_debug_kernel(struct pt_regs *regs,
unsigned long dr6)
{
+ /*
+ * Disable breakpoints during exception handling; recursive exceptions
+ * are exceedingly 'fun'.
+ *
+ * Since this function is NOKPROBE, and that also applies to
+ * HW_BREAKPOINT_X, we can't hit a breakpoint before this (XXX except a
+ * HW_BREAKPOINT_W on our stack)
+ *
+ * Entry text is excluded for HW_BP_X and cpu_entry_area, which
+ * includes the entry stack is excluded for everything.
+ */
+ unsigned long dr7 = local_db_save();
+
bool irq_state = idtentry_enter_nmi(regs);
instrumentation_begin();
@@ -883,6 +891,8 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
instrumentation_end();
idtentry_exit_nmi(regs, irq_state);
+
+ local_db_restore(dr7);
}
static __always_inline void exc_debug_user(struct pt_regs *regs,
@@ -894,6 +904,15 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,
*/
WARN_ON_ONCE(!user_mode(regs));
+ /*
+ * NB: We can't easily clear DR7 here because
+ * idtentry_exit_to_usermode() can invoke ptrace, schedule, access
+ * user memory, etc. This means that a recursive #DB is possible. If
+ * this happens, that #DB will hit exc_debug_kernel() and clear DR7.
+ * Since we're not on the IST stack right now, everything will be
+ * fine.
+ */
+
irqentry_enter_from_user_mode(regs);
instrumentation_begin();
@@ -907,36 +926,24 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,
/* IST stack entry */
DEFINE_IDTENTRY_DEBUG(exc_debug)
{
- unsigned long dr6, dr7;
-
- debug_enter(&dr6, &dr7);
- exc_debug_kernel(regs, dr6);
- debug_exit(dr7);
+ exc_debug_kernel(regs, debug_read_clear_dr6());
}
/* User entry, runs on regular task stack */
DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
{
- unsigned long dr6, dr7;
-
- debug_enter(&dr6, &dr7);
- exc_debug_user(regs, dr6);
- debug_exit(dr7);
+ exc_debug_user(regs, debug_read_clear_dr6());
}
#else
/* 32 bit does not have separate entry points. */
DEFINE_IDTENTRY_RAW(exc_debug)
{
- unsigned long dr6, dr7;
-
- debug_enter(&dr6, &dr7);
+ unsigned long dr6 = debug_read_clear_dr6();
if (user_mode(regs))
exc_debug_user(regs, dr6);
else
exc_debug_kernel(regs, dr6);
-
- debug_exit(dr7);
}
#endif
--
2.25.4
Hi,
I'm adding stable(a)vger.kernel.org
Once again, this time really adding stable.
On 2020-06-03 04:31, Martin K. Petersen wrote:
> On Thu, 28 May 2020 21:31:08 +0200, Bodo Stroesser wrote:
>
>> 1) If remaining ring space before the end of the ring is
>> smaller then the next cmd to write, tcmu writes a padding
>> entry which fills the remaining space at the end of the
>> ring.
>> Then tcmu calls tcmu_flush_dcache_range() with the size
>> of struct tcmu_cmd_entry as data length to flush.
>> If the space filled by the padding was smaller then
>> tcmu_cmd_entry, tcmu_flush_dcache_range() is called for
>> an address range reaching behind the end of the vmalloc'ed
>> ring.
>> tcmu_flush_dcache_range() in a loop calls
>> flush_dcache_page(virt_to_page(start));
>> for every page being part of the range. On x86 the line is
>> optimized out by the compiler, as flush_dcache_page() is
>> empty on x86.
>> But I assume the above can cause trouble on other
>> architectures that really have a flush_dcache_page().
>> For paddings only the header part of an entry is relevant
>> Due to alignment rules the header always fits in the
>> remaining space, if padding is needed.
>> So tcmu_flush_dcache_range() can safely be called with
>> sizeof(entry->hdr) as the length here.
>>
>> [...]
>
> Applied to 5.8/scsi-queue, thanks!
>
> [1/1] scsi: target: tcmu: Fix size in calls to tcmu_flush_dcache_range
> https://git.kernel.org/mkp/scsi/c/8c4e0f212398
>
The full commit of this patch is:
8c4e0f212398cdd1eb4310a5981d06a723cdd24f
This patch is the first of four patches that are necessary to run tcmu
on ARM without crash. For details please see
https://bugzilla.kernel.org/show_bug.cgi?id=208045
Upsteam commits of patches 2,3, and 4 are:
2: 3c58f737231e "scsi: target: tcmu: Optimize use of flush_dcache_page"
3: 3145550a7f8b "scsi: target: tcmu: Fix crash in
tcmu_flush_dcache_range on ARM"
4: 5a0c256d96f0 "scsi: target: tcmu: Fix crash on ARM during cmd
completion"
Since patches 3 and 4 already were accepted for 5.8, 5.4, and 4.19, and
I sent a request to add patch 2 about 1 hour ago, please consider adding
this patch to 5.4 and 4.19, because without it tcmu on ARM will still
crash.
Thank you,
Bodo
The below error is seen in dmesg, while formatting the disks discovered on host.
dmesg:
[ 636.733374] blk_update_request: I/O error, dev nvme4n1, sector 0 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Patch 6 fixes it and there are 5 other dependent patches that also need to be
pulled from upstream to stable, 5.4 and 4.19 branches.
Patch 1 dependent patch
Patch 2 dependent patch
Patch 3 dependent patch
Patch 4 dependent patch
Patch 5 dependent patch
Patch 6 fix patch
Thanks,
Dakshaja
Christoph Hellwig (5):
nvmet: Cleanup discovery execute handlers
nvmet: Introduce common execute function for get_log_page and identify
nvmet: Introduce nvmet_dsm_len() helper
nvmet: Remove the data_len field from the nvmet_req struct
nvmet: Open code nvmet_req_execute()
Sagi Grimberg (1):
nvmet: fix dsm failure when payload does not match sgl descriptor
drivers/nvme/target/admin-cmd.c | 128 +++++++++++++++++-------------
drivers/nvme/target/core.c | 23 ++++--
drivers/nvme/target/discovery.c | 62 +++++++--------
drivers/nvme/target/fabrics-cmd.c | 15 +++-
drivers/nvme/target/fc.c | 4 +-
drivers/nvme/target/io-cmd-bdev.c | 19 +++--
drivers/nvme/target/io-cmd-file.c | 20 +++--
drivers/nvme/target/loop.c | 2 +-
drivers/nvme/target/nvmet.h | 11 ++-
drivers/nvme/target/rdma.c | 4 +-
drivers/nvme/target/tcp.c | 6 +-
11 files changed, 176 insertions(+), 118 deletions(-)
--
2.18.0.232.gb7bd9486b.dirty
Guys,
There is a convoluted deadlock that I just root caused, and that is
fixed by this work (at least based on my code inspection it appears to
be fixed); but the deadlock exists in older and stable kernels, and I
am not sure whether to create a separate patch for it, or backport
this whole thing.
Thread #1: Hot-removes memory
device_offline
memory_subsys_offline
offline_pages
__offline_pages
mem_hotplug_lock <- write access
waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
migrate it.
Thread #2: ccs killer kthread
css_killed_work_fn
cgroup_mutex <- Grab this Mutex
mem_cgroup_css_offline
memcg_offline_kmem.part
memcg_deactivate_kmem_caches
get_online_mems
mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: crashing userland program
do_coredump
elf_core_dump
get_dump_page() -> get page with pfn#9e5113, and increment refcnt
dump_emit
__kernel_write
__vfs_write
new_sync_write
pipe_write
pipe_wait -> waits for Thread #4 systemd-coredump to
read the pipe
Thread #4: systemd-coredump
ksys_read
vfs_read
__vfs_read
seq_read
proc_single_show
proc_cgroup_show
cgroup_mutex -> waits from Thread #2 for this lock.
In Summary:
Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
waits for Thread#1 for mem_hotplug_lock rwlock.
This work appears to fix this deadlock because cgroup_mutex is not
called anymore before mem_hotplug_lock (unless I am missing it), as it
removes memcg_deactivate_kmem_caches.
Thank you,
Pasha
On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro(a)fb.com> wrote:
>
> On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > The existing cgroup slab memory controller is based on the idea of
> > > replicating slab allocator internals for each memory cgroup.
> > > This approach promises a low memory overhead (one pointer per page),
> > > and isn't adding too much code on hot allocation and release paths.
> > > But is has a very serious flaw: it leads to a low slab utilization.
> > >
> > > Using a drgn* script I've got an estimation of slab utilization on
> > > a number of machines running different production workloads. In most
> > > cases it was between 45% and 65%, and the best number I've seen was
> > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > it brings back 30-50% of slab memory. It means that the real price
> > > of the existing slab memory controller is way bigger than a pointer
> > > per page.
> > >
> > > The real reason why the existing design leads to a low slab utilization
> > > is simple: slab pages are used exclusively by one memory cgroup.
> > > If there are only few allocations of certain size made by a cgroup,
> > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > deleted, or the cgroup contains a single-threaded application which is
> > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > in all these cases the resulting slab utilization is very low.
> > > If kmem accounting is off, the kernel is able to use free space
> > > on slab pages for other allocations.
> > >
> > > Arguably it wasn't an issue back to days when the kmem controller was
> > > introduced and was an opt-in feature, which had to be turned on
> > > individually for each memory cgroup. But now it's turned on by default
> > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > create a large number of cgroups.
> > >
> > > This patchset provides a new implementation of the slab memory controller,
> > > which aims to reach a much better slab utilization by sharing slab pages
> > > between multiple memory cgroups. Below is the short description of the new
> > > design (more details in commit messages).
> > >
> > > Accounting is performed per-object instead of per-page. Slab-related
> > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > with rounding up and remembering leftovers.
> > >
> > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > working, instead of saving a pointer to the memory cgroup directly an
> > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > easily changed to the parent) with a built-in reference counter. This scheme
> > > allows to reparent all allocated objects without walking them over and
> > > changing memcg pointer to the parent.
> > >
> > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > two global sets are used: the root set for non-accounted and root-cgroup
> > > allocations and the second set for all other allocations. This allows to
> > > simplify the lifetime management of individual kmem_caches: they are
> > > destroyed with root counterparts. It allows to remove a good amount of code
> > > and make things generally simpler.
> > >
> > > The patchset* has been tested on a number of different workloads in our
> > > production. In all cases it saved significant amount of memory, measured
> > > from high hundreds of MBs to single GBs per host. On average, the size
> > > of slab memory has been reduced by 35-45%.
> >
> > Here are some numbers from multiple runs of sysbench and kernel compilation
> > with this patchset on a 10 core POWER8 host:
> >
> > ==========================================================================
> > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > of a mem cgroup (Sampling every 5s)
> > --------------------------------------------------------------------------
> > 5.5.0-rc7-mm1 +slab patch %reduction
> > --------------------------------------------------------------------------
> > memory.kmem.usage_in_bytes 15859712 4456448 72
> > memory.usage_in_bytes 337510400 335806464 .5
> > Slab: (kB) 814336 607296 25
> >
> > memory.kmem.usage_in_bytes 16187392 4653056 71
> > memory.usage_in_bytes 318832640 300154880 5
> > Slab: (kB) 789888 559744 29
> > --------------------------------------------------------------------------
> >
> >
> > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > done from bash that is in a memory cgroup. (Sampling every 5s)
> > --------------------------------------------------------------------------
> > 5.5.0-rc7-mm1 +slab patch %reduction
> > --------------------------------------------------------------------------
> > memory.kmem.usage_in_bytes 338493440 231931904 31
> > memory.usage_in_bytes 7368015872 6275923968 15
> > Slab: (kB) 1139072 785408 31
> >
> > memory.kmem.usage_in_bytes 341835776 236453888 30
> > memory.usage_in_bytes 6540427264 6072893440 7
> > Slab: (kB) 1074304 761280 29
> >
> > memory.kmem.usage_in_bytes 340525056 233570304 31
> > memory.usage_in_bytes 6406209536 6177357824 3
> > Slab: (kB) 1244288 739712 40
> > --------------------------------------------------------------------------
> >
> > Slab consumption right after boot
> > --------------------------------------------------------------------------
> > 5.5.0-rc7-mm1 +slab patch %reduction
> > --------------------------------------------------------------------------
> > Slab: (kB) 821888 583424 29
> > ==========================================================================
> >
> > Summary:
> >
> > With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows
> > around 70% and 30% reduction consistently.
> >
> > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > kernel compilation.
> >
> > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > same is seen right after boot too.
>
> That's just perfect!
>
> memory.usage_in_bytes was most likely the same because the freed space
> was taken by pagecache.
>
> Thank you very much for testing!
>
> Roman