Hi,
In /proc/PID/stat, there is the kstkesp field which is the stack pointer of a thread. While the thread is active, this field reads zero. But during a coredump, it should have a valid value.
However, at the moment, kstkesp is zero even during coredump.
The first commit fixes this problem, and the second commit adds a selftest to detect if this problem appears again in the future.
Nam Cao (2): fs/proc: do_task_stat: Fix ESP not readable during coredump selftests: coredump: Add stackdump test
fs/proc/array.c | 36 ++-- tools/testing/selftests/coredump/Makefile | 7 + tools/testing/selftests/coredump/README.rst | 50 ++++++ tools/testing/selftests/coredump/stackdump | 14 ++ .../selftests/coredump/stackdump_test.c | 154 ++++++++++++++++++ 5 files changed, 243 insertions(+), 18 deletions(-) create mode 100644 tools/testing/selftests/coredump/Makefile create mode 100644 tools/testing/selftests/coredump/README.rst create mode 100755 tools/testing/selftests/coredump/stackdump create mode 100644 tools/testing/selftests/coredump/stackdump_test.c
Commit 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in /proc/PID/stat") disabled stack pointer reading, because it is generally dangerous to do so.
Commit fd7d56270b52 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping") made an exception for coredumping thread, because for this case it is safe.
The exception was later extended to all threads in a coredumping process by commit cb8f381f1613 ("fs/proc/array.c: allow reporting eip/esp for all coredumping threads").
The above two commits determine if a task is core dumping by checking the PF_EXITING and PF_DUMPCORE flags.
However, commit 92307383082d ("coredump: Don't perform any cleanups before dumping core") moved coredump to happen earlier and before PF_EXITING is set. Thus, the check of the PF_EXITING flag no longer works.
Instead, use task->signal->core_state to determine if coredump is happening. This pointer is set at the beginning of coredump and is cleared once coredump is done. Thus, while this pointer is not NULL, it is safe to read ESP.
Fixes: 92307383082d ("coredump: Don't perform any cleanups before dumping core") Signed-off-by: Nam Cao namcao@linutronix.de Cc: stable@vger.kernel.org Cc: Eric W. Biederman ebiederm@xmission.com --- fs/proc/array.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/fs/proc/array.c b/fs/proc/array.c index 34a47fb0c57f..2f1dbfcf143d 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -489,25 +489,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, vsize = eip = esp = 0; permitted = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS | PTRACE_MODE_NOAUDIT); mm = get_task_mm(task); - if (mm) { + if (mm) vsize = task_vsize(mm); - /* - * esp and eip are intentionally zeroed out. There is no - * non-racy way to read them without freezing the task. - * Programs that need reliable values can use ptrace(2). - * - * The only exception is if the task is core dumping because - * a program is not able to use ptrace(2) in that case. It is - * safe because the task has stopped executing permanently. - */ - if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE))) { - if (try_get_task_stack(task)) { - eip = KSTK_EIP(task); - esp = KSTK_ESP(task); - put_task_stack(task); - } - } - }
sigemptyset(&sigign); sigemptyset(&sigcatch); @@ -534,6 +517,23 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, ppid = task_tgid_nr_ns(task->real_parent, ns); pgid = task_pgrp_nr_ns(task, ns);
+ /* + * esp and eip are intentionally zeroed out. There is no + * non-racy way to read them without freezing the task. + * Programs that need reliable values can use ptrace(2). + * + * The only exception is if the task is core dumping because + * a program is not able to use ptrace(2) in that case. It is + * safe because the task has stopped executing permanently. + */ + if (permitted && task->signal->core_state) { + if (try_get_task_stack(task)) { + eip = KSTK_EIP(task); + esp = KSTK_ESP(task); + put_task_stack(task); + } + } + unlock_task_sighand(task, &flags); }
On 2024-11-06, Nam Cao namcao@linutronix.de wrote:
Commit 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in /proc/PID/stat") disabled stack pointer reading, because it is generally dangerous to do so.
Commit fd7d56270b52 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping") made an exception for coredumping thread, because for this case it is safe.
The exception was later extended to all threads in a coredumping process by commit cb8f381f1613 ("fs/proc/array.c: allow reporting eip/esp for all coredumping threads").
The above two commits determine if a task is core dumping by checking the PF_EXITING and PF_DUMPCORE flags.
However, commit 92307383082d ("coredump: Don't perform any cleanups before dumping core") moved coredump to happen earlier and before PF_EXITING is set. Thus, the check of the PF_EXITING flag no longer works.
Instead, use task->signal->core_state to determine if coredump is happening. This pointer is set at the beginning of coredump and is cleared once coredump is done. Thus, while this pointer is not NULL, it is safe to read ESP.
Fixes: 92307383082d ("coredump: Don't perform any cleanups before dumping core") Signed-off-by: Nam Cao namcao@linutronix.de
Reviewed-by: John Ogness john.ogness@linutronix.de
On Wed, Nov 06 2024 at 10:22, Nam Cao wrote:
Commit 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in /proc/PID/stat") disabled stack pointer reading, because it is generally dangerous to do so.
Commit fd7d56270b52 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping") made an exception for coredumping thread, because for this case it is safe.
The exception was later extended to all threads in a coredumping process by commit cb8f381f1613 ("fs/proc/array.c: allow reporting eip/esp for all coredumping threads").
The above two commits determine if a task is core dumping by checking the PF_EXITING and PF_DUMPCORE flags.
However, commit 92307383082d ("coredump: Don't perform any cleanups before dumping core") moved coredump to happen earlier and before PF_EXITING is set. Thus, the check of the PF_EXITING flag no longer works.
Instead, use task->signal->core_state to determine if coredump is happening. This pointer is set at the beginning of coredump and is cleared once coredump is done. Thus, while this pointer is not NULL, it is safe to read ESP.
Fixes: 92307383082d ("coredump: Don't perform any cleanups before dumping core")
Can we please make progress with that? It's a user space visible change which causes a regression in core dumper tools.
Thanks,
tglx
On 11/06, Nam Cao wrote:
@@ -534,6 +517,23 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, ppid = task_tgid_nr_ns(task->real_parent, ns); pgid = task_pgrp_nr_ns(task, ns);
/*
* esp and eip are intentionally zeroed out. There is no
* non-racy way to read them without freezing the task.
* Programs that need reliable values can use ptrace(2).
OK,
but then:
* The only exception is if the task is core dumping because
* a program is not able to use ptrace(2) in that case. It is
* safe because the task has stopped executing permanently.
*/
if (permitted && task->signal->core_state) {
if (try_get_task_stack(task)) {
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
put_task_stack(task);
How can the task->signal->core_state check help ?
Suppose we have a task T1 with T1-pid == 100 and you read /proc/100/stat. It is possible that the T1's sub-thread T2 starts the coredumping and sets signal->core_state != NULL.
But read(/proc/100/stat) can run before T1 gets SIGKILL from T2 and enters the kernel mode?
Oleg.
On 12/17, Oleg Nesterov wrote:
On 11/06, Nam Cao wrote:
@@ -534,6 +517,23 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, ppid = task_tgid_nr_ns(task->real_parent, ns); pgid = task_pgrp_nr_ns(task, ns);
/*
* esp and eip are intentionally zeroed out. There is no
* non-racy way to read them without freezing the task.
* Programs that need reliable values can use ptrace(2).
OK,
but then:
* The only exception is if the task is core dumping because
* a program is not able to use ptrace(2) in that case. It is
* safe because the task has stopped executing permanently.
*/
if (permitted && task->signal->core_state) {
if (try_get_task_stack(task)) {
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
put_task_stack(task);
How can the task->signal->core_state check help ?
Suppose we have a task T1 with T1-pid == 100 and you read /proc/100/stat. It is possible that the T1's sub-thread T2 starts the coredumping and sets signal->core_state != NULL.
But read(/proc/100/stat) can run before T1 gets SIGKILL from T2 and enters the kernel mode?
Can't the trivial patch below fix the problem?
Oleg.
--- xfs/proc/array.c +++ x/fs/proc/array.c @@ -500,7 +500,7 @@ * a program is not able to use ptrace(2) in that case. It is * safe because the task has stopped executing permanently. */ - if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE))) { + if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE|PF_POSTCOREDUMP))) { if (try_get_task_stack(task)) { eip = KSTK_EIP(task); esp = KSTK_ESP(task);
Hi Oleg,
On Tue, Dec 17, 2024 at 04:09:14PM +0100, Oleg Nesterov wrote:
On 12/17, Oleg Nesterov wrote:
On 11/06, Nam Cao wrote:
@@ -534,6 +517,23 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, ppid = task_tgid_nr_ns(task->real_parent, ns); pgid = task_pgrp_nr_ns(task, ns);
/*
* esp and eip are intentionally zeroed out. There is no
* non-racy way to read them without freezing the task.
* Programs that need reliable values can use ptrace(2).
OK,
but then:
* The only exception is if the task is core dumping because
* a program is not able to use ptrace(2) in that case. It is
* safe because the task has stopped executing permanently.
*/
if (permitted && task->signal->core_state) {
if (try_get_task_stack(task)) {
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
put_task_stack(task);
How can the task->signal->core_state check help ?
Suppose we have a task T1 with T1-pid == 100 and you read /proc/100/stat. It is possible that the T1's sub-thread T2 starts the coredumping and sets signal->core_state != NULL.
But read(/proc/100/stat) can run before T1 gets SIGKILL from T2 and enters the kernel mode?
Right, I missed that race, thanks for pointing it out.
Can't the trivial patch below fix the problem?
It can. In fact this is the original fix we had. I thought that checking a single "core_state" is simpler than checking 3 flags, oh well..
Can you send a proper patch, or should I do it?
Best regards, Nam
Oleg.
--- xfs/proc/array.c +++ x/fs/proc/array.c @@ -500,7 +500,7 @@ * a program is not able to use ptrace(2) in that case. It is * safe because the task has stopped executing permanently. */
if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE))) {
if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE|PF_POSTCOREDUMP))) { if (try_get_task_stack(task)) { eip = KSTK_EIP(task); esp = KSTK_ESP(task);
Hi Nam,
On 12/20, Nam Cao wrote:
Can't the trivial patch below fix the problem?
It can. In fact this is the original fix we had. I thought that checking a single "core_state" is simpler than checking 3 flags, oh well..
Can you send a proper patch, or should I do it?
Can you send V2 please? It was you who found/investigated the problem, and the patch is trivial.
Feel free to include my acked-by.
Thanks,
Oleg.
Add a test which checks that the kstkesp field in /proc/pid/stat can be read for all threads of a coredumping process.
For full details including the motivation for this test and how it works, see the README file added by this commit.
Signed-off-by: Nam Cao namcao@linutronix.de --- tools/testing/selftests/coredump/Makefile | 7 + tools/testing/selftests/coredump/README.rst | 50 ++++++ tools/testing/selftests/coredump/stackdump | 14 ++ .../selftests/coredump/stackdump_test.c | 154 ++++++++++++++++++ 4 files changed, 225 insertions(+) create mode 100644 tools/testing/selftests/coredump/Makefile create mode 100644 tools/testing/selftests/coredump/README.rst create mode 100755 tools/testing/selftests/coredump/stackdump create mode 100644 tools/testing/selftests/coredump/stackdump_test.c
diff --git a/tools/testing/selftests/coredump/Makefile b/tools/testing/selftests/coredump/Makefile new file mode 100644 index 000000000000..ed210037b29d --- /dev/null +++ b/tools/testing/selftests/coredump/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0-only +CFLAGS = $(KHDR_INCLUDES) + +TEST_GEN_PROGS := stackdump_test +TEST_FILES := stackdump + +include ../lib.mk diff --git a/tools/testing/selftests/coredump/README.rst b/tools/testing/selftests/coredump/README.rst new file mode 100644 index 000000000000..164a7aa181c8 --- /dev/null +++ b/tools/testing/selftests/coredump/README.rst @@ -0,0 +1,50 @@ +coredump selftest +================= + +Background context +------------------ + +`coredump` is a feature which dumps a process's memory space when the process terminates +unexpectedly (e.g. due to segmentation fault), which can be useful for debugging. By default, +`coredump` dumps the memory to the file named `core`, but this behavior can be changed by writing a +different file name to `/proc/sys/kernel/core_pattern`. Furthermore, `coredump` can be piped to a +user-space program by writing the pipe symbol (`|`) followed by the command to be executed to +`/proc/sys/kernel/core_pattern`. For the full description, see `man 5 core`. + +The piped user program may be interested in reading the stack pointers of the crashed process. The +crashed process's stack pointers can be read from `procfs`: it is the `kstkesp` field in +`/proc/$PID/stat`. See `man 5 proc` for all the details. + +The problem +----------- +While a thread is active, the stack pointer is unsafe to read and therefore the `kstkesp` field +reads zero. But when the thread is dead (e.g. during a coredump), this field should have valid +value. + +However, this was broken in the past and `kstkesp` was zero even during coredump: + +* commit 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in /proc/PID/stat") changed kstkesp to + always be zero + +* commit fd7d56270b52 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping") fixed it for the + coredumping thread. However, other threads in a coredumping process still had the problem. + +* commit cb8f381f1613 ("fs/proc/array.c: allow reporting eip/esp for all coredumping threads") fixed + for all threads in a coredumping process. + +* commit 92307383082d ("coredump: Don't perform any cleanups before dumping core") broke it again + for the other threads in a coredumping process. + +The problem has been fixed now, but considering the history, it may appear again in the future. + +The goal of this test +--------------------- +This test detects problem with reading `kstkesp` during coredump by doing the following: + +#. Tell the kernel to execute the "stackdump" script when a coredump happens. This script + reads the stack pointers of all threads of crashed processes. + +#. Spawn a child process who creates some threads and then crashes. + +#. Read the output from the "stackdump" script, and make sure all stack pointer values are + non-zero. diff --git a/tools/testing/selftests/coredump/stackdump b/tools/testing/selftests/coredump/stackdump new file mode 100755 index 000000000000..96714ce42d12 --- /dev/null +++ b/tools/testing/selftests/coredump/stackdump @@ -0,0 +1,14 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +CRASH_PROGRAM_ID=$1 +STACKDUMP_FILE=$2 + +TMP=$(mktemp) + +for t in /proc/$CRASH_PROGRAM_ID/task/*; do + tid=$(basename $t) + cat /proc/$tid/stat | awk '{print $29}' >> $TMP +done + +mv $TMP $STACKDUMP_FILE diff --git a/tools/testing/selftests/coredump/stackdump_test.c b/tools/testing/selftests/coredump/stackdump_test.c new file mode 100644 index 000000000000..b86202aa2f04 --- /dev/null +++ b/tools/testing/selftests/coredump/stackdump_test.c @@ -0,0 +1,154 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <fcntl.h> +#include <libgen.h> +#include <linux/limits.h> +#include <pthread.h> +#include <string.h> +#include <sys/resource.h> +#include <unistd.h> + +#include "../kselftest_harness.h" + +#define STACKDUMP_FILE "/tmp/kselftest_stackdump" +#define STACKDUMP_SCRIPT "stackdump" +#define NUM_THREAD_SPAWN 128 + +static void *do_nothing(void *) +{ + while (1) + pause(); +} + +static void crashing_child(void) +{ + pthread_t thread; + int i; + + for (i = 0; i < NUM_THREAD_SPAWN; ++i) + pthread_create(&thread, NULL, do_nothing, NULL); + + /* crash on purpose */ + i = *(int *)NULL; +} + +static void empty_function(int) {} + +FIXTURE(coredump) +{ + char original_core_pattern[256]; +}; + +FIXTURE_SETUP(coredump) +{ + char buf[PATH_MAX]; + FILE *file; + char *dir; + int ret; + + file = fopen("/proc/sys/kernel/core_pattern", "r"); + ASSERT_NE(NULL, file); + + ret = fread(self->original_core_pattern, 1, sizeof(self->original_core_pattern), file); + ASSERT_TRUE(ret || feof(file)); + ASSERT_LT(ret, sizeof(self->original_core_pattern)); + + self->original_core_pattern[ret] = '\0'; + + ret = fclose(file); + ASSERT_EQ(0, ret); +} + +FIXTURE_TEARDOWN(coredump) +{ + const char *reason; + FILE *file; + int ret; + + file = fopen("/proc/sys/kernel/core_pattern", "w"); + if (!file) { + reason = "Unable to open core_pattern"; + goto fail; + } + + ret = fprintf(file, "%s", self->original_core_pattern); + if (ret < 0) { + reason = "Unable to write to core_pattern"; + goto fail; + } + + ret = fclose(file); + if (ret) { + reason = "Unable to close core_pattern"; + goto fail; + } + + return; +fail: + /* This should never happen */ + fprintf(stderr, "Failed to cleanup stackdump test: %s\n", reason); +} + +TEST_F(coredump, stackdump) +{ + struct sigaction action = {}; + unsigned long long stack; + char *test_dir, *line; + size_t line_length; + char buf[PATH_MAX]; + int ret, i; + FILE *file; + pid_t pid; + + /* This file may be left behind by a previous test run */ + unlink(STACKDUMP_FILE); + + /* + * Step 1: Setup core_pattern so that the stackdump script is executed when the child + * process crashes + */ + ret = readlink("/proc/self/exe", buf, sizeof(buf)); + ASSERT_NE(-1, ret); + ASSERT_LT(ret, sizeof(buf)); + buf[ret] = '\0'; + + test_dir = dirname(buf); + + file = fopen("/proc/sys/kernel/core_pattern", "w"); + ASSERT_NE(NULL, file); + + ret = fprintf(file, "|%s/%s %%P %s", test_dir, STACKDUMP_SCRIPT, STACKDUMP_FILE); + ASSERT_LT(0, ret); + + ret = fclose(file); + ASSERT_EQ(0, ret); + + /* Step 2: Create a process who spawns some threads then crashes */ + pid = fork(); + ASSERT_TRUE(pid >= 0); + if (pid == 0) + crashing_child(); + + /* + * Step 3: Wait for the stackdump script to write the stack pointers to the stackdump file + */ + for (i = 0; i < 10; ++i) { + file = fopen(STACKDUMP_FILE, "r"); + if (file) + break; + sleep(1); + } + ASSERT_NE(file, NULL); + + /* Step 4: Make sure all stack pointer values are non-zero */ + for (i = 0; -1 != getline(&line, &line_length, file); ++i) { + stack = strtoull(line, NULL, 10); + ASSERT_NE(stack, 0); + } + + ASSERT_EQ(i, 1 + NUM_THREAD_SPAWN); + + fclose(file); +} + +TEST_HARNESS_MAIN
On 2024-11-06, Nam Cao namcao@linutronix.de wrote:
Add a test which checks that the kstkesp field in /proc/pid/stat can be read for all threads of a coredumping process.
For full details including the motivation for this test and how it works, see the README file added by this commit.
Signed-off-by: Nam Cao namcao@linutronix.de
Reviewed-by: John Ogness john.ogness@linutronix.de
linux-kselftest-mirror@lists.linaro.org