In one of my rcutorture tests the TSC clocksource got marked unstable
due to a large difference in the TSC value. I'm not sure if the guest
run for a long time with disabled interrupts or if the host was very
busy and didn't schedule the guest for some time.
I took a look on the qemu/KVM options and decided to update the options:
- Use kvm{32|64} as CPU. We could probably use `host' (like ARM does)
for maximum available features but since we don't run any userland I'm
not sure if it makes any difference.
- Drop the "noapic" option, enable TSC deadline timer. There is no
history why the APIC was disabled, I see no reason for it. The
deadline timer is probably "nicer".
- Additional config options. It ensures that the kernel knowns that it
runs as a kvm guest and can use virt devices like the kvm-clock as
clocksource. The kvm-clock was the main motivation here.
- I didn't add a random HW device. It would make the random device ready
earlier (not it doesn't complete the initialisation at all) but I
doubt that there is any need for this.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
---
tools/testing/selftests/rcutorture/bin/functions.sh | 13 ++++++++++++-
.../selftests/rcutorture/configs/rcu/CFcommon | 4 ++++
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/rcutorture/bin/functions.sh b/tools/testing/selftests/rcutorture/bin/functions.sh
index 6bcb8b5b2ff22..be3c5c73d7e79 100644
--- a/tools/testing/selftests/rcutorture/bin/functions.sh
+++ b/tools/testing/selftests/rcutorture/bin/functions.sh
@@ -172,7 +172,7 @@ identify_qemu_append () {
local console=ttyS0
case "$1" in
qemu-system-x86_64|qemu-system-i386)
- echo noapic selinux=0 initcall_debug debug
+ echo selinux=0 initcall_debug debug
;;
qemu-system-aarch64)
console=ttyAMA0
@@ -191,8 +191,19 @@ identify_qemu_append () {
# Output arguments for qemu arguments based on the TORTURE_QEMU_MAC
# and TORTURE_QEMU_INTERACTIVE environment variables.
identify_qemu_args () {
+ local KVM_CPU=""
+ case "$1" in
+ qemu-system-x86_64)
+ KVM_CPU=kvm64
+ ;;
+ qemu-system-i386)
+ KVM_CPU=kvm32
+ ;;
+ esac
case "$1" in
qemu-system-x86_64|qemu-system-i386)
+ echo -machine q35,accel=kvm
+ echo -cpu ${KVM_CPU},x2apic=on,tsc-deadline=on,hypervisor=on,tsc_adjust=on
;;
qemu-system-aarch64)
echo -machine virt,gic-version=host -cpu host
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFcommon b/tools/testing/selftests/rcutorture/configs/rcu/CFcommon
index d2d2a86139db1..322d5d40443cd 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/CFcommon
+++ b/tools/testing/selftests/rcutorture/configs/rcu/CFcommon
@@ -1,2 +1,6 @@
CONFIG_RCU_TORTURE_TEST=y
CONFIG_PRINTK_TIME=y
+CONFIG_HYPERVISOR_GUEST=y
+CONFIG_PARAVIRT=y
+CONFIG_PARAVIRT_SPINLOCKS=y
+CONFIG_KVM_GUEST=y
--
2.20.1
From: Peter Zijlstra <peterz(a)infradead.org>
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3].
But after some debate[4][5] adding a gap in the stack when entering the
breakpoint handler allows for pushing the return address onto the stack to
easily emulate a call.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
[5] http://lkml.kernel.org/r/CAHk-=wjvQxY4DvPrJ6haPgAa6b906h=MwZXO6G8OtiTGe=N7_…
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Nicolai Stange <nstange(a)suse.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: the arch/x86 maintainers <x86(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Jiri Kosina <jikos(a)kernel.org>
Cc: Miroslav Benes <mbenes(a)suse.cz>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Joe Lawrence <joe.lawrence(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nayna Jain <nayna(a)linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest(a)vger.kernel.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Signed-off-by: *** Need SoB From Peter Zijlstra ***
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
arch/x86/kernel/ftrace.c | 25 ++++++++++++++++++++-----
1 file changed, 20 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..fd152f5a937b 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
#include <asm/kprobes.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -294,13 +298,21 @@ int ftrace_int3_handler(struct pt_regs *regs)
if (WARN_ON_ONCE(!regs))
return 0;
- ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
+ ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
- return 1;
+ return 0;
}
NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +871,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +974,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
--
2.20.1
On Mon, Apr 29, 2019 at 12:13 PM Linus Torvalds
<torvalds(a)linux-foundation.org> wrote:
>
>
>
> On Mon, Apr 29, 2019, 12:02 Linus Torvalds <torvalds(a)linux-foundation.org> wrote:
>>
>>
>>
>> If nmi were to break it, it would be a cpu bug.
>
>
> Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret.
>
Where? STI; IRET would be nuts.
Before:
commit 4214a16b02971c60960afd675d03544e109e0d75
Author: Andy Lutomirski <luto(a)kernel.org>
Date: Thu Apr 2 17:12:12 2015 -0700
x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER
we did sti; sysxit, but, when we discussed this, I don't recall anyone
speaking up in favor of the safely of the old code.
Not to mention that the crash we'll get if we get an NMI and a
rescheduling interrupt in this path will be very, very hard to debug.
From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org>
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3],
Linus suggested using per CPU data along with special trampolines[4] to emulate
the calls.
Linus's solution was for text poke (which was mostly what the static_call
code did), but as ftrace has its own mechanism, it required doing its own
thing.
Having ftrace use its own per CPU data and having its own set of specialized
trampolines solves the issue of missed calls that live kernel patching
suffers.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de
[2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.154171145…
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=…
Inspired-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: stable(a)vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
Changes since v2:
- Moved inline asm to ftrace_64.S
- Used PER_CPU_VAR() and TRACE_IRQS_ON macros in assembly
- Renamed trampolines to be a little more coherent
- Created assembly version of STACK_FRAME_NON_STANDARD() in asm/frame.h
- Call ftrace_regs_caller instead of ftrace_caller
- No longer support 32 bit (it crashed badly on tests, and I couldn't figure it out)
arch/x86/include/asm/frame.h | 15 ++++++
arch/x86/kernel/ftrace.c | 102 ++++++++++++++++++++++++++++++++++-
arch/x86/kernel/ftrace_64.S | 56 +++++++++++++++++++
3 files changed, 172 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/frame.h b/arch/x86/include/asm/frame.h
index 5cbce6fbb534..04892b374b93 100644
--- a/arch/x86/include/asm/frame.h
+++ b/arch/x86/include/asm/frame.h
@@ -42,4 +42,19 @@
#endif /* CONFIG_FRAME_POINTER */
+#ifdef __ASSEMBLY__
+#ifdef CONFIG_STACK_VALIDATION
+#define STACK_FRAME_NON_STANDARD(func) \
+.section .discard.func_stack_frame_non_standard,"aw"; \
+.align 8; \
+.type __func_stack_frame_non_standard_##func, @object; \
+.size __func_stack_frame_non_standard_##func, 8; \
+__func_stack_frame_non_standard_##func: \
+.quad func; \
+.previous
+#else /* !CONFIG_STACK_VALIDATION */
+#define STACK_FRAME_NON_STANDARD(func)
+#endif /* CONFIG_STACK_VALIDATION */
+#endif /* __ASSEMBLY__ */
+
#endif /* _ASM_X86_FRAME_H */
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..634fc0d4fe97 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -232,6 +232,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
static unsigned long ftrace_update_func;
+/* Used within inline asm below */
+unsigned long ftrace_update_func_call;
+
static int update_ftrace_func(unsigned long ip, void *new)
{
unsigned char old[MCOUNT_INSN_SIZE];
@@ -259,6 +262,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -280,6 +285,46 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip)
return 0;
}
+#ifdef CONFIG_X86_64
+/*
+ * We need to handle the "call func1" -> "call func2" case.
+ * Just skipping the call is not sufficient as it will be like
+ * changing to "nop" first and then updating the call. But some
+ * users of ftrace require calls never to be missed.
+ *
+ * To emulate the call while converting the call site with a breakpoint,
+ * some trampolines are used along with per CPU buffers.
+ * There are three trampolines for the call sites and three trampolines
+ * for the updating of the call in ftrace trampoline. The three
+ * trampolines are:
+ *
+ * 1) Interrupts are enabled when the breakpoint is hit
+ * 2) Interrupts are disabled when the breakpoint is hit
+ * 3) The breakpoint was hit in an NMI
+ *
+ * As per CPU data is used, interrupts must be disabled to prevent them
+ * from corrupting the data. A separate NMI trampoline is used for the
+ * NMI case. If interrupts are already disabled, then the return path
+ * of where the breakpoint was hit (saved in the per CPU data) is pushed
+ * on the stack and then a jump to either the ftrace_caller (which will
+ * loop through all registered ftrace_ops handlers depending on the ip
+ * address), or if its a ftrace trampoline call update, it will call
+ * ftrace_update_func_call which will hold the call that should be
+ * called.
+ */
+extern asmlinkage void ftrace_emulate_call_sti(void);
+extern asmlinkage void ftrace_emulate_call(void);
+extern asmlinkage void ftrace_emulate_call_nmi(void);
+extern asmlinkage void ftrace_emulate_call_update_sti(void);
+extern asmlinkage void ftrace_emulate_call_update(void);
+extern asmlinkage void ftrace_emulate_call_update_nmi(void);
+
+DEFINE_PER_CPU(void *, ftrace_bp_call_return);
+DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return);
+
+/* To hold the ftrace_regs_caller address to push on the stack */
+void *ftrace_caller_func = (void *)ftrace_regs_caller;
+
/*
* A breakpoint was added to the code address we are about to
* modify, and this is the handle that will just skip over it.
@@ -291,6 +336,58 @@ int ftrace_int3_handler(struct pt_regs *regs)
{
unsigned long ip;
+ if (WARN_ON_ONCE(!regs))
+ return 0;
+
+ ip = regs->ip - 1;
+ if (ftrace_location(ip)) {
+ /* A breakpoint at the beginning of the function was hit */
+ if (in_nmi()) {
+ /* NMIs have their own trampoline */
+ this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
+ regs->ip = (unsigned long) ftrace_emulate_call_nmi;
+ return 1;
+ }
+ this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
+ if (regs->flags & X86_EFLAGS_IF) {
+ regs->flags &= ~X86_EFLAGS_IF;
+ regs->ip = (unsigned long) ftrace_emulate_call_sti;
+ /* Tell lockdep here we are enabling interrupts */
+ lockdep_hardirqs_on(_THIS_IP_);
+ } else {
+ regs->ip = (unsigned long) ftrace_emulate_call;
+ }
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ /* An ftrace trampoline is being updated */
+ if (!ftrace_update_func_call) {
+ /* If it's a jump, just need to skip it */
+ regs->ip += MCOUNT_INSN_SIZE -1;
+ return 1;
+ }
+ if (in_nmi()) {
+ /* NMIs have their own trampoline */
+ this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
+ regs->ip = (unsigned long) ftrace_emulate_call_update_nmi;
+ return 1;
+ }
+ this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
+ if (regs->flags & X86_EFLAGS_IF) {
+ regs->flags &= ~X86_EFLAGS_IF;
+ regs->ip = (unsigned long) ftrace_emulate_call_update_sti;
+ } else {
+ regs->ip = (unsigned long) ftrace_emulate_call_update;
+ }
+ return 1;
+ }
+
+ return 0;
+}
+#else /* !X86_64 */
+int ftrace_int3_handler(struct pt_regs *regs)
+{
+ unsigned long ip;
+
if (WARN_ON_ONCE(!regs))
return 0;
@@ -299,9 +396,9 @@ int ftrace_int3_handler(struct pt_regs *regs)
return 0;
regs->ip += MCOUNT_INSN_SIZE - 1;
-
return 1;
}
+#endif
NOKPROBE_SYMBOL(ftrace_int3_handler);
static int ftrace_write(unsigned long ip, const char *val, int size)
@@ -859,6 +956,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +1059,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
{
unsigned char *new;
+ ftrace_update_func_call = 0;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index 75f2b36b41a6..8642e1719370 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -9,6 +9,9 @@
#include <asm/export.h>
#include <asm/nospec-branch.h>
#include <asm/unwind_hints.h>
+#include <asm/irqflags.h>
+#include <asm/percpu.h>
+#include <asm/frame.h>
.code64
.section .entry.text, "ax"
@@ -262,6 +265,59 @@ GLOBAL(ftrace_regs_caller_end)
ENDPROC(ftrace_regs_caller)
+/* Trampoline for function update with interrupts enabled */
+GLOBAL(ftrace_emulate_call_sti)
+ push PER_CPU_VAR(ftrace_bp_call_return)
+ push ftrace_caller_func
+ TRACE_IRQS_ON
+ sti
+ ret
+ENDPROC(ftrace_emulate_call_sti)
+
+/* Trampoline for function update with interrupts disabled*/
+GLOBAL(ftrace_emulate_call)
+ push PER_CPU_VAR(ftrace_bp_call_return)
+ push ftrace_caller_func
+ ret
+ENDPROC(ftrace_emulate_call)
+
+/* Trampoline for function update in an NMI */
+GLOBAL(ftrace_emulate_call_nmi)
+ push PER_CPU_VAR(ftrace_bp_call_nmi_return)
+ push ftrace_caller_func
+ ret
+ENDPROC(ftrace_emulate_call_nmi)
+
+/* Trampoline for ftrace trampoline call update with interrupts enabled */
+GLOBAL(ftrace_emulate_call_update_sti)
+ push PER_CPU_VAR(ftrace_bp_call_return)
+ push ftrace_update_func_call
+ TRACE_IRQS_ON
+ sti
+ ret
+ENDPROC(ftrace_emulate_call_update_sti)
+
+/* Trampoline for ftrace trampoline call update with interrupts disabled */
+GLOBAL(ftrace_emulate_call_update)
+ push PER_CPU_VAR(ftrace_bp_call_return)
+ push ftrace_update_func_call
+ ret
+ENDPROC(ftrace_emulate_call_update)
+
+/* Trampoline for ftrace trampoline call update in an NMI */
+GLOBAL(ftrace_emulate_call_update_nmi)
+ push PER_CPU_VAR(ftrace_bp_call_nmi_return)
+ push ftrace_update_func_call
+ ret
+ENDPROC(ftrace_emulate_call_update_nmi)
+
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_sti)
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call)
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_nmi)
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_sti)
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update)
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_nmi)
+
#else /* ! CONFIG_DYNAMIC_FTRACE */
ENTRY(function_hook)
--
2.20.1
pidfd are file descriptors referring to a process created with the
CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs pidfd
polling support to replace code that currently checks for existence of
/proc/pid for knowing that a process that is signalled to be killed has
died, which is both racy and slow. The pidfd poll approach is race-free,
and also allows the LMK to do other things (such as by polling on other
fds) while awaiting the process being killed to die.
It prevents a situation where a PID is reused between when LMK sends a
kill signal and checks for existence of the PID, since the wrong PID is
now possibly checked for existence.
In this patch, we follow the same existing mechanism in the kernel used
when the parent of the task group is to be notified (do_notify_parent).
This is when the tasks waiting on a poll of pidfd are also awakened.
We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.
2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.
Appropriate test cases are added in the second patch to provide coverage
of all the cases the patch is handling.
Andy had a similar patch [1] in the past which was a good reference
however this patch tries to handle different situations properly related
to thread group existence, and how/where it notifies. And also solves
other bugs (waitqueue lifetime). Daniel had a similar patch [2]
recently which this patch supercedes.
[1] https://lore.kernel.org/patchwork/patch/345098/
[2] https://lore.kernel.org/lkml/20181029175322.189042-1-dancol@google.com/
Cc: luto(a)amacapital.net
Cc: rostedt(a)goodmis.org
Cc: dancol(a)google.com
Cc: sspatil(a)google.com
Cc: christian(a)brauner.io
Cc: jannh(a)google.com
Cc: surenb(a)google.com
Cc: timmurray(a)google.com
Cc: Jonathan Kowalski <bl0pbl33p(a)gmail.com>
Cc: torvalds(a)linux-foundation.org
Cc: kernel-team(a)android.com
Co-developed-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
RFC -> v1:
* Based on CLONE_PIDFD patches: https://lwn.net/Articles/786244/
* Updated selftests.
* Renamed poll wake function to do_notify_pidfd.
* Removed depending on EXIT flags
* Removed POLLERR flag since semantics are controversial and
we don't have usecases for it right now (later we can add if there's
a need for it).
include/linux/pid.h | 3 +++
kernel/fork.c | 33 +++++++++++++++++++++++++++++++++
kernel/pid.c | 2 ++
kernel/signal.c | 14 ++++++++++++++
4 files changed, 52 insertions(+)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..1484db6ca8d1 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -3,6 +3,7 @@
#define _LINUX_PID_H
#include <linux/rculist.h>
+#include <linux/wait.h>
enum pid_type
{
@@ -60,6 +61,8 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+ /* wait queue for pidfd notifications */
+ wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
diff --git a/kernel/fork.c b/kernel/fork.c
index 5525837ed80e..fb3b614f6456 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1685,8 +1685,41 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
}
#endif
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+ struct task_struct *task;
+ struct pid *pid;
+ int poll_flags = 0;
+
+ /*
+ * tasklist_lock must be held because to avoid racing with
+ * changes in exit_state and wake up. Basically to avoid:
+ *
+ * P0: read exit_state = 0
+ * P1: write exit_state = EXIT_DEAD
+ * P1: Do a wake up - wq is empty, so do nothing
+ * P0: Queue for polling - wait forever.
+ */
+ read_lock(&tasklist_lock);
+ pid = file->private_data;
+ task = pid_task(pid, PIDTYPE_PID);
+ WARN_ON_ONCE(task && !thread_group_leader(task));
+
+ if (!task || (task->exit_state && thread_group_empty(task)))
+ poll_flags = POLLIN | POLLRDNORM;
+
+ if (!poll_flags)
+ poll_wait(file, &pid->wait_pidfd, pts);
+
+ read_unlock(&tasklist_lock);
+
+ return poll_flags;
+}
+
+
const struct file_operations pidfd_fops = {
.release = pidfd_release,
+ .poll = pidfd_poll,
#ifdef CONFIG_PROC_FS
.show_fdinfo = pidfd_show_fdinfo,
#endif
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..5c90c239242f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
+ init_waitqueue_head(&pid->wait_pidfd);
+
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
if (!(ns->pid_allocated & PIDNS_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index 1581140f2d99..16e7718316e5 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1800,6 +1800,17 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
return ret;
}
+static void do_notify_pidfd(struct task_struct *task)
+{
+ struct pid *pid;
+
+ lockdep_assert_held(&tasklist_lock);
+
+ pid = get_task_pid(task, PIDTYPE_PID);
+ wake_up_all(&pid->wait_pidfd);
+ put_pid(pid);
+}
+
/*
* Let a parent know about the death of a child.
* For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1823,6 +1834,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
+ /* Wake up all pidfd waiters */
+ do_notify_pidfd(tsk);
+
if (sig != SIGCHLD) {
/*
* This is only possible if parent == real_parent.
--
2.21.0.593.g511ec345e18-goog
Hi,
this series is the result of the discussion to the RFC patch found at [1].
The goal is to make x86' ftrace_int3_handler() not to simply skip over
the trapping instruction as this is problematic in the context of
the live patching consistency model. For details, c.f. the commit message
of [3/4] ("x86/ftrace: make ftrace_int3_handler() not to skip fops
invocation").
Everything is based on v5.1-rc6, please let me know in case you want me to
rebase on somehing else.
For x86_64, the live patching selftest added in [4/4] succeeds with this
series applied and fails without it. On 32 bits I only compile-tested.
checkpatch reports warnings about
- an overlong line in assembly -- I chose to ignore that
- MAINTAINERS perhaps needing updates due to the new files
arch/x86/kernel/ftrace_int3_stubs.S and
tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh.
As the existing arch/x86/kernel/ftrace_{32,64}.S haven't got an
explicit entry either, this one is probably Ok? The selftest
definitely is.
Changes to the RFC patch:
- s/trampoline/stub/ to avoid confusion with the ftrace_ops' trampolines,
- use a fixed size stack kept in struct thread_info for passing the
(adjusted) ->ip values from ftrace_int3_handler() to the stubs,
- provide one stub for each of the two possible jump targets and hardcode
those,
- add the live patching selftest.
Thanks,
Nicolai
Nicolai Stange (4):
x86/thread_info: introduce ->ftrace_int3_stack member
ftrace: drop 'static' qualifier from ftrace_ops_list_func()
x86/ftrace: make ftrace_int3_handler() not to skip fops invocation
selftests/livepatch: add "ftrace a live patched function" test
arch/x86/include/asm/thread_info.h | 11 +++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/asm-offsets.c | 8 +++
arch/x86/kernel/ftrace.c | 79 +++++++++++++++++++---
arch/x86/kernel/ftrace_int3_stubs.S | 61 +++++++++++++++++
kernel/trace/ftrace.c | 8 +--
tools/testing/selftests/livepatch/Makefile | 3 +-
.../livepatch/test-livepatch-vs-ftrace.sh | 44 ++++++++++++
8 files changed, 199 insertions(+), 16 deletions(-)
create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S
create mode 100755 tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh
--
2.13.7
On Mon, Apr 29, 2019 at 11:53 AM Linus Torvalds
<torvalds(a)linux-foundation.org> wrote:
>
>
>
> On Mon, Apr 29, 2019, 11:42 Andy Lutomirski <luto(a)kernel.org> wrote:
>>
>>
>> I'm less than 100% convinced about this argument. Sure, an NMI right
>> there won't cause a problem. But an NMI followed by an interrupt will
>> kill us if preemption is on. I can think of three solutions:
>
>
> No, because either the sti shadow disables nmi too (that's the case on some CPUs at least) or the iret from nmi does.
>
> Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture.
>
Is this documented somewhere? And do you actually believe that this
is true under KVM, Hyper-V, etc? As I recall, Andrew Cooper dug in to
the way that VMX dealt with this stuff and concluded that the SDM was
blatantly wrong in many cases, which leads me to believe that Xen
HVM/PVH is the *only* hypervisor that gets it right.
Steven's point about batched updates is quite valid, though. My
personal favorite solution to this whole mess is to rework the whole
thing so that the int3 handler simply returns and retries and to
replace the sync_core() broadcast with an SMI broadcast. I don't know
whether this will actually work on real CPUs and on VMs and whether
it's going to crash various BIOSes out there.