This patch series adds support to freeze the task cgroup hierarchy that is on a default cgroup v2 without going through kernfs interface.
For some cases we want to freeze the cgroup of a task based on some signals, doing so from bpf is better than user space which could be too late.
Planned users of this feature are: tetragon and systemd when freezing a cgroup hierarchy that could be a K8s pod, container, system service or a user session.
Patch 1: cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf Patch 2: bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task Patch 3: selftests/bpf: add selftest for bpf_task_freeze_cgroup
include/linux/cgroup.h | 2 ++ kernel/bpf/helpers.c | 31 ++++ kernel/cgroup/cgroup.c | 67 ++++++++ tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c | 165 +++++++++++++++++++++ tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c | 110 ++++++++++++++ 5 files changed, 375 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c create mode 100644 tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c
This patch adds a new cgroup helper cgroup_freeze_no_kn() to freeze a cgroup hierarchy that is on a default cgroup v2 without going through kernfs interface.
For some cases we want to freeze the cgroup of a task based on some signals, doing so from bpf is better than user space which could be too late.
The cgroup_freeze_no_kn() will acquire the cgroup_mutex and release it at the end.
It also checks if the cgroup is on the default hierarchy and it is not a root cgroup.
Signed-off-by: Djalal Harouni tixxdz@gmail.com --- include/linux/cgroup.h | 2 ++ kernel/cgroup/cgroup.c | 69 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 34aaf0e87def..5019b32ea933 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -137,6 +137,8 @@ int cgroup_init(void);
int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v);
+int cgroup_freeze_no_kn(struct cgroup *cgrp, int freeze); + /* * Iteration helpers and macros. */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a66c088c851c..0aafcd9e39b5 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1595,6 +1595,26 @@ static u16 cgroup_calc_subtree_ss_mask(u16 subtree_control, u16 this_ss_mask) return cur_ss_mask; }
+/** + * cgroup_dfl_write_no_kn - check if direct writes to cgroup without going + * through kernfs is allowed. + * @cgrp: the target cgroup + * + * This helper ensures that the cgroup is on the default hierarchy and it + * is not a root cgroup. + * + * Return: %0 on success or a negative errno code on failure. + */ +static int cgroup_dfl_write_no_kn(struct cgroup *cgrp) +{ + lockdep_assert_held(&cgroup_mutex); + + if (!cgroup_on_dfl(cgrp) || !cgroup_parent(cgrp)) + return -EOPNOTSUPP; + + return 0; +} + /** * cgroup_kn_unlock - unlocking helper for cgroup kernfs methods * @kn: the kernfs_node being serviced @@ -1668,6 +1688,25 @@ struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn, bool drain_offline) return NULL; }
+/** + * cgroup_lock_live_no_kn - locking helper for direct writes to cgroup without + * going through kernfs interface. + * @cgrp: the target cgroup + * + * This helper performs cgroup locking and verifies that the associated cgroup + * is alive. Returns the cgroup if alive; otherwise, %NULL. + * A successful return should be undone by a matching cgroup_unlock() + * invocation. + */ +static struct cgroup *cgroup_lock_live_no_kn(struct cgroup *cgrp) +{ + cgroup_lock(); + if (!cgroup_is_dead(cgrp)) + return cgrp; + cgroup_unlock(); + return NULL; +} + static void cgroup_rm_file(struct cgroup *cgrp, const struct cftype *cft) { char name[CGROUP_FILE_NAME_MAX]; @@ -3930,6 +3969,36 @@ static int cgroup_freeze_show(struct seq_file *seq, void *v) return 0; }
+/** + * cgroup_freeze_no_kn - Freeze a cgroup that is on the default hierarchy + * without going through kernfs interface. + * + * @cgrp: the target cgroup + * @freeze: freeze state, passing value 1 causes the freezing of the cgroup + * and all descendant cgroups. Processes under this cgroup hierarchy will + * be stopped and will not run until the cgroup is explicitly unfrozen. + * Passing value 0 unthaws the cgroup hierarchy. + * + * Return: %0 on success or a negative errno code on failure. + */ +int cgroup_freeze_no_kn(struct cgroup *cgrp, int freeze) +{ + int ret = 0; + + if (freeze < 0 || freeze > 1) + return -ERANGE; + + if (!cgroup_lock_live_no_kn(cgrp)) + return -ENOENT; + + ret = cgroup_dfl_write_no_kn(cgrp); + if (!ret) + cgroup_freeze(cgrp, freeze); + + cgroup_unlock(); + return ret; +} + static ssize_t cgroup_freeze_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) {
This patch adds a new bpf helper bpf_task_freeze_cgroup() to freeze a cgroup of a task and all its descendant cgroups. It requires the task to be on the default cgroup v2 hierarchy.
For some cases we want to freeze the cgroup of a task based on some signals, doing so from bpf is better than user space which could be too late.
Planned users of this feature are: tetragon and systemd when freezing a cgroup hierarchy that could be a K8s pod, container, system service or a user session.
This helper will acquire the cgroup_mutex during its operation and release it before it returns.
Signed-off-by: Djalal Harouni tixxdz@gmail.com --- kernel/bpf/helpers.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 9234174ccb21..8d510a1b265c 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2270,6 +2270,36 @@ bpf_task_get_cgroup1(struct task_struct *task, int hierarchy_id) return NULL; return cgrp; } + +/** + * bpf_task_freeze_cgroup - Freeze the task cgroup and all its descendant cgroups. + * + * @task: The target task + * @freeze: freeze state, passing value 1 causes the freezing of the cgroup + * and all descendant cgroups. Processes under this cgroup hierarchy will + * be stopped and will not run until the cgroup is explicitly unfrozen. + * Passing value 0 unthaws the task cgroup and its descendant cgroups. + * + * Return: %0 on success or a negative errno code on failure. + */ +__bpf_kfunc int bpf_task_freeze_cgroup(struct task_struct *task, int freeze) +{ + int ret; + struct cgroup *cgrp; + + rcu_read_lock(); + cgrp = task_dfl_cgroup(task); + if (!cgrp || !cgroup_tryget(cgrp)) { + rcu_read_unlock(); + return -ENOENT; + } + rcu_read_unlock(); + + ret = cgroup_freeze_no_kn(cgrp, freeze); + cgroup_put(cgrp); + + return ret; +} #endif /* CONFIG_CGROUPS */
/** @@ -2577,6 +2607,7 @@ BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | KF_RCU | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU) BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_task_freeze_cgroup, KF_TRUSTED_ARGS | KF_SLEEPABLE) #endif BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_throw)
This adds a selftest for `bpf_task_freeze_cgroup` kfunc. The test works by forking a child then:
1. Child: - Migrate to a new cgroup - Loads bpf programs - Trigger the 'lsm_freeze_cgroup' bpf program so it freeze itself. by calling "bpf_task_freeze_cgroup(child, 1)"
<- wait for parent to unthaw
- On unthaw it continues, forks another process and triggers the 'tp_newchild' bpf program to set some monitored pids of the new process, that are asserted at user space, to ensure that we resumed correctly.
2. Parent: - Keeps reading the 'cgroup.freeze' file of the child cgroup until it prints 1 which means the child cgroup is frozen - Attaches the sample 'lsm_task_free' so it triggers the bpf program and then calls "bpf_task_freeze_cgroup(task, 0);" on the child task to unthaw its cgroup. - Then waits for a clean exit of the child process.
The scenario allows to test both: freeze and unthaw a task cgroup.
Signed-off-by: Djalal Harouni tixxdz@gmail.com --- .../bpf/prog_tests/task_freeze_cgroup.c | 165 ++++++++++++++++++ .../bpf/progs/test_task_freeze_cgroup.c | 110 ++++++++++++ 2 files changed, 275 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c create mode 100644 tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c
diff --git a/tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c b/tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c new file mode 100644 index 000000000000..afb7d46194c5 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c @@ -0,0 +1,165 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Isovalent */ + +#include <sys/syscall.h> +#include <test_progs.h> +#include <cgroup_helpers.h> +#include <unistd.h> +#include "test_task_freeze_cgroup.skel.h" + +#define FOO "/test-task-freeze-cgroup" + +static int bpf_sleepable(struct test_task_freeze_cgroup *skel) +{ + int err, foo; + pid_t pid; + + foo = test__join_cgroup(FOO); + if (!ASSERT_OK(foo < 0, "cgroup_join_foo")) + return -errno; + + skel = test_task_freeze_cgroup__open(); + if (!ASSERT_OK_PTR(skel, "test_task_freeze_cgroup__open")) + return -errno; + + skel->rodata->parent_pid = getppid(); + skel->rodata->monitor_pid = getpid(); + skel->rodata->cgid = get_cgroup_id(FOO); + skel->bss->new_pid = getpid(); + skel->bss->freeze = 1; + + err = test_task_freeze_cgroup__load(skel); + if (!ASSERT_OK(err, "test_task_freeze_cgroup__load")) + goto cleanup; + + /* First, attach the LSM program, and then it will be triggered when the + * TP_BTF program is attached. + */ + skel->links.lsm_freeze_cgroup = + bpf_program__attach_lsm(skel->progs.lsm_freeze_cgroup); + if (!ASSERT_OK_PTR(skel->links.lsm_freeze_cgroup, "attach_lsm")) { + err = -errno; + goto cleanup; + } + + /* This will fail */ + skel->links.tp_newchild = + bpf_program__attach_trace(skel->progs.tp_newchild); + if (!ASSERT_EQ(errno, EPERM, "attach_trace")) { + err = -EINVAL; + goto cleanup; + } + + /* Try again now */ + skel->links.tp_newchild = + bpf_program__attach_trace(skel->progs.tp_newchild); + if (!ASSERT_OK_PTR(skel->links.tp_newchild, "attach_trace")) { + err = -EINVAL; + goto cleanup; + } + + /* Trigger a new child and assert unfrozen state */ + pid = fork(); + if (pid == 0) + exit(0); + + err = (pid == -1); + if (ASSERT_OK(err, "fork process")) + wait(NULL); + + /* Now we should continue, assert that new_pid reflects child */ + ASSERT_NEQ(skel->rodata->monitor_pid, skel->bss->new_pid, + "test task_freeze_cgroup failed at monitor_pid != new_pid"); + ASSERT_NEQ(0, skel->bss->new_pid, + "test task_freeze_cgroup failed at remote_pid != 0"); + + /* Assert that bpf set new_pid to new forked child pid */ + ASSERT_EQ(pid, skel->bss->new_pid, + "test task_freeze_cgroup failed at pid == new_pid"); + + test_task_freeze_cgroup__detach(skel); + +cleanup: + test_task_freeze_cgroup__destroy(skel); + close(foo); + return err; +} + +void test_task_freeze_cgroup(void) +{ + pid_t pid, result; + char buf[512] = {0}; + char path[PATH_MAX] = {0}; + int ret, status, attempts, frozen = 0; + struct test_task_freeze_cgroup *skel = NULL; + + pid = fork(); + ret = (pid == -1); + if (!ASSERT_OK(ret, "fork process")) + return; + + if (pid == 0) { + ret = bpf_sleepable(skel); + ASSERT_EQ(0, ret, "bpf_sleepable failed"); + exit(ret); + } + + skel = test_task_freeze_cgroup__open(); + if (!ASSERT_OK_PTR(skel, "test_task_freeze_cgroup__open")) + goto out; + + snprintf(path, sizeof(path), + "/sys/fs/cgroup/cgroup-test-work-dir%d%s/cgroup.freeze", + pid, FOO); + + for (attempts = 5; attempts >= 0; attempts--) { + ret = 0; + int fd = open(path, O_RDONLY); + if (fd > 0) + ret = read(fd, buf, sizeof(buf) - 1); + if (ret > 0) { + errno = 0; + frozen = strtol(buf, NULL, 10); + if (errno) + frozen = 0; + } + + close(fd); + if (frozen) + break; + sleep(1); + } + + /* Assert that child cgroup is frozen */ + if (!ASSERT_EQ(1, frozen, "child cgroup not frozen")) + goto out; + + ret = test_task_freeze_cgroup__load(skel); + if (!ASSERT_OK(ret, "test_task_freeze_cgroup__load")) + goto out; + + /* Unthaw child cgroup from parent */ + skel->links.lsm_task_free = + bpf_program__attach_lsm(skel->progs.lsm_task_free); + if (!ASSERT_OK_PTR(skel->links.lsm_task_free, "attach_lsm")) + goto out; + + result = waitpid(pid, &status, WUNTRACED); + if (!ASSERT_NEQ(result, -1, "waitpid")) + goto detach; + + result = WIFEXITED(status); + if (!ASSERT_EQ(result, 1, "forked process did not terminate normally")) + goto detach; + + result = WEXITSTATUS(status); + if (!ASSERT_EQ(result, 0, "forked process did not exit successfully")) + goto detach; + +detach: + test_task_freeze_cgroup__detach(skel); + +out: + if (skel) + test_task_freeze_cgroup__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c b/tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c new file mode 100644 index 000000000000..dbd2d60f464e --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c @@ -0,0 +1,110 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Isovalent */ + +#include <vmlinux.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_helpers.h> +#include <errno.h> +#include "bpf_misc.h" + +struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; +long bpf_task_under_cgroup(struct task_struct *task, struct cgroup *ancestor) __ksym; +void bpf_cgroup_release(struct cgroup *p) __ksym; +struct task_struct *bpf_task_from_pid(s32 pid) __ksym; +struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym; +void bpf_task_release(struct task_struct *p) __ksym; + +int bpf_task_freeze_cgroup(struct task_struct *task, int freeze) __ksym; + +const volatile int parent_pid; +const volatile int monitor_pid; +const volatile __u64 cgid; +int new_pid; +int freeze; + +SEC("tp_btf/task_newtask") +int BPF_PROG(tp_newchild, struct task_struct *task, u64 clone_flags) +{ + struct cgroup *cgrp = NULL; + struct task_struct *acquired; + + if (monitor_pid != (bpf_get_current_pid_tgid() >> 32)) + return 0; + + acquired = bpf_task_acquire(task); + if (!acquired) + return 0; + + cgrp = bpf_cgroup_from_id(cgid); + if (!cgrp) + goto out; + + if (bpf_task_under_cgroup(acquired, cgrp)) + new_pid = acquired->tgid; + +out: + if (cgrp) + bpf_cgroup_release(cgrp); + bpf_task_release(acquired); + + return 0; +} + +/* This is attached from parent to trigger the bpf lsm hook, so parent + * can unthaw the child. + */ +SEC("lsm/task_free") +int BPF_PROG(lsm_task_free, struct task_struct *task) +{ + return 0; +} + +SEC("lsm.s/bpf") +int BPF_PROG(lsm_freeze_cgroup, int cmd, union bpf_attr *attr, unsigned int size) +{ + int ret = 0; + struct cgroup *cgrp = NULL; + struct task_struct *task; + + if (cmd != BPF_LINK_CREATE) + return ret; + + task = bpf_get_current_task_btf(); + if (parent_pid == task->pid) { + /* Unthaw child from parent */ + task = bpf_task_from_pid(monitor_pid); + if (!task) + return -ENOENT; + + ret = bpf_task_freeze_cgroup(task, 0); + bpf_task_release(task); + return ret; + } + + if (monitor_pid != task->pid) + return 0; + + /* Freeze the child cgroup from its context */ + cgrp = bpf_cgroup_from_id(cgid); + if (!cgrp) + goto out; + + if (!bpf_task_under_cgroup(task, cgrp)) + goto out; + + if (freeze) { + /* Schedule freeze task and return -EPERM */ + ret = bpf_task_freeze_cgroup(task, 1); + if (!ret) { + ret = -EPERM; + /* reset for next call */ + freeze = 0; + } + } +out: + if (cgrp) + bpf_cgroup_release(cgrp); + return ret; +} + +char _license[] SEC("license") = "GPL";
Hello, Djalal.
On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni wrote:
This patch series adds support to freeze the task cgroup hierarchy that is on a default cgroup v2 without going through kernfs interface.
For some cases we want to freeze the cgroup of a task based on some signals, doing so from bpf is better than user space which could be too late.
Planned users of this feature are: tetragon and systemd when freezing a cgroup hierarchy that could be a K8s pod, container, system service or a user session.
Patch 1: cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf Patch 2: bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task Patch 3: selftests/bpf: add selftest for bpf_task_freeze_cgroup
It bothers me a bit that it's adding a dedicated interface for something which already has a defined userspace interface. Would it be better to have kfunc wrappers for kernel_read() and kernel_write()?
Thanks.
On Thu, Mar 28, 2024 at 10:22 AM Tejun Heo tj@kernel.org wrote:
Hello, Djalal.
On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni wrote:
This patch series adds support to freeze the task cgroup hierarchy that is on a default cgroup v2 without going through kernfs interface.
For some cases we want to freeze the cgroup of a task based on some signals, doing so from bpf is better than user space which could be too late.
Planned users of this feature are: tetragon and systemd when freezing a cgroup hierarchy that could be a K8s pod, container, system service or a user session.
Patch 1: cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf Patch 2: bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task Patch 3: selftests/bpf: add selftest for bpf_task_freeze_cgroup
It bothers me a bit that it's adding a dedicated interface for something which already has a defined userspace interface. Would it be better to have kfunc wrappers for kernel_read() and kernel_write()?
How would that look ? prog cannot and shouldn't open a file. The seq_file would be passed/pinned by user space?
Hello, Alexei.
On Thu, Mar 28, 2024 at 10:32:24AM -0700, Alexei Starovoitov wrote:
It bothers me a bit that it's adding a dedicated interface for something which already has a defined userspace interface. Would it be better to have kfunc wrappers for kernel_read() and kernel_write()?
How would that look ? prog cannot and shouldn't open a file.
Oh, I didn't know. Why is that?
The seq_file would be passed/pinned by user space?
Would it work if it's just "open this file, write this and then close it"?
Thanks.
On Thu, Mar 28, 2024 at 10:58 AM Tejun Heo tj@kernel.org wrote:
Hello, Alexei.
On Thu, Mar 28, 2024 at 10:32:24AM -0700, Alexei Starovoitov wrote:
It bothers me a bit that it's adding a dedicated interface for something which already has a defined userspace interface. Would it be better to have kfunc wrappers for kernel_read() and kernel_write()?
How would that look ? prog cannot and shouldn't open a file.
Oh, I didn't know. Why is that?
The seq_file would be passed/pinned by user space?
Would it work if it's just "open this file, write this and then close it"?
Continuing discussion... To use kernel_file_open() it would need path, inode, cred. None of that is available now. Allocating all these structures just to wrap a cgroup pointer feels like overkill. Of course, it would solve the need to introduce other cgroup apis that are already available via text based cgroupfs read/write. So there are pros and cons in both approaches. Maybe the 3rd option would be to expose: cgroup_lock() as a special blend of acquire plus lock. Then there will be no need for bpf_task_freeze_cgroup() with task argument. Instead cgroup_freeze() will be such kfunc that takes cgroup argument and the verifier will check that cgroup was acquired/locked. Sort-of what we check to access bpf_rb_root.
Hello,
On Thu, Mar 28, 2024 at 12:46:03PM -0700, Alexei Starovoitov wrote:
To use kernel_file_open() it would need path, inode, cred.
Yeah, it's more involved and potentially more controversial. That said, just to push the argument a bit further, at least for path, it'd be nice to have a helper anyway which can return cgroup path. I don't know whether we'd need direct inode access. For cred, just share some root cred?
None of that is available now. Allocating all these structures just to wrap a cgroup pointer feels like overkill. Of course, it would solve the need to introduce other cgroup apis that are already available via text based cgroupfs read/write. So there are pros and cons in both approaches. Maybe the 3rd option would be to expose: cgroup_lock() as a special blend of acquire plus lock.
Oh, let's not expose that. That has been a source of problem for some use cases and we may have to change how cgroups are locked.
Then there will be no need for bpf_task_freeze_cgroup() with task argument. Instead cgroup_freeze() will be such kfunc that takes cgroup argument and the verifier will check that cgroup was acquired/locked. Sort-of what we check to access bpf_rb_root.
There's also cgroup.kill which would be useful for similar use cases. We can add interface for both but idk. Let's say we have something like the following (pardon the bad naming):
bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
Would that work? I'm not necessarily in love with the idea or against adding separate helpers but the duplication still bothers me a bit.
Thanks.
On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo tj@kernel.org wrote:
There's also cgroup.kill which would be useful for similar use cases. We can add interface for both but idk. Let's say we have something like the following (pardon the bad naming):
bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
Would that work? I'm not necessarily in love with the idea or against adding separate helpers but the duplication still bothers me a bit.
I liked it. So filename will be one of cgroup_base_files[].name ? We probably don't want psi or cgroup1_base_files in there.
From the verifier pov 2nd arg can be "char *knob__str" and the verifier will make sure it's a constant NULL terminated string, so at runtime it will be easier to search cgroup_base_files array. And 'buf' can be: void *mem, int mem__sz with kfunc doing run-time validation that there a null there.
Hello,
On Thu, Mar 28, 2024 at 01:45:56PM -0700, Alexei Starovoitov wrote:
On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo tj@kernel.org wrote:
There's also cgroup.kill which would be useful for similar use cases. We can add interface for both but idk. Let's say we have something like the following (pardon the bad naming):
bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
Would that work? I'm not necessarily in love with the idea or against adding separate helpers but the duplication still bothers me a bit.
I liked it. So filename will be one of cgroup_base_files[].name ? We probably don't want psi or cgroup1_base_files in there.
Would it matter? If the user has root perm, they can do whatever with the files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we wanna make sure @filename doesn't include '/'? Or is it that you don't want to go through the usual file name look up?
From the verifier pov 2nd arg can be "char *knob__str" and the verifier will make sure it's a constant NULL terminated string, so at runtime it will be easier to search cgroup_base_files array. And 'buf' can be: void *mem, int mem__sz with kfunc doing run-time validation that there a null there.
That all sound good.
Thanks.
On Thu, Mar 28, 2024 at 2:01 PM Tejun Heo tj@kernel.org wrote:
Hello,
On Thu, Mar 28, 2024 at 01:45:56PM -0700, Alexei Starovoitov wrote:
On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo tj@kernel.org wrote:
There's also cgroup.kill which would be useful for similar use cases. We can add interface for both but idk. Let's say we have something like the following (pardon the bad naming):
bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
Would that work? I'm not necessarily in love with the idea or against adding separate helpers but the duplication still bothers me a bit.
I liked it. So filename will be one of cgroup_base_files[].name ? We probably don't want psi or cgroup1_base_files in there.
Would it matter?
Few weak reasons: . cgroup_psi_files have show/write/poll/release which doesn't map to this bpf_cgroup_knob_write/read ? . cgroup1_base_files probably needs to a separate kfunc bpf_cgroup1_...
If the user has root perm, they can do whatever with the files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we wanna make sure @filename doesn't include '/'? Or is it that you don't want to go through the usual file name look up?
yeah. why do a file lookup? The names are there in the array. cgroup pointer gives that "relative path" and knob name is the last part of such "path". Easy to search in that array(s).
From the verifier pov 2nd arg can be "char *knob__str" and the verifier will make sure it's a constant NULL terminated string, so at runtime it will be easier to search cgroup_base_files array. And 'buf' can be: void *mem, int mem__sz with kfunc doing run-time validation that there a null there.
That all sound good.
Thanks.
-- tejun
Hello,
On Thu, Mar 28, 2024 at 02:28:51PM -0700, Alexei Starovoitov wrote:
So filename will be one of cgroup_base_files[].name ? We probably don't want psi or cgroup1_base_files in there.
Would it matter?
Few weak reasons: . cgroup_psi_files have show/write/poll/release which doesn't map to this bpf_cgroup_knob_write/read ? . cgroup1_base_files probably needs to a separate kfunc bpf_cgroup1_...
If the user has root perm, they can do whatever with the files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we wanna make sure @filename doesn't include '/'? Or is it that you don't want to go through the usual file name look up?
yeah. why do a file lookup? The names are there in the array. cgroup pointer gives that "relative path" and knob name is the last part of such "path". Easy to search in that array(s).
Difficult to tell without looking at the implementation but I don't have strong opinions. The interface makes sense to me and as long as we can hook it up in a reasonably way, it should be okay. We can always change internal implementation later if necessary.
Thanks.
Hello Tejun, Alexei,
On 3/28/24 22:01, Tejun Heo wrote:
Hello,
On Thu, Mar 28, 2024 at 01:45:56PM -0700, Alexei Starovoitov wrote:
On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo tj@kernel.org wrote:
There's also cgroup.kill which would be useful for similar use cases. We can add interface for both but idk. Let's say we have something like the following (pardon the bad naming):
Yes having the cgroup.kill from bpf would be useful!
bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
Would that work? I'm not necessarily in love with the idea or against adding separate helpers but the duplication still bothers me a bit.
I liked it. So filename will be one of cgroup_base_files[].name ? We probably don't want psi or cgroup1_base_files in there.
Would it matter? If the user has root perm, they can do whatever with the files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we wanna make sure @filename doesn't include '/'? Or is it that you don't want to go through the usual file name look up?
It would be easy at least for me if I just start with cgroupv2 and ensure that it has same available filenames as if we go through kernfs. Not a root cgroup node and maybe only freeze and kill for now that are part of cgroup_base_files.
So if I get it right, somehow like what I did but we endup with:
In bpf, cgroup was already acquired.
bpf_cgroup_knob_write(cgroup, "freeze", buf) |_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock
cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...) |_ parse params -> cgroup_ref++ -> krnfs_active_ref-- -> -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...
Please let me know if I missed something.
Thanks!
Hello,
On Fri, Mar 29, 2024 at 02:22:28PM +0100, Djalal Harouni wrote:
It would be easy at least for me if I just start with cgroupv2 and ensure that it has same available filenames as if we go through kernfs. Not a root cgroup node and maybe only freeze and kill for now that are part of cgroup_base_files.
So if I get it right, somehow like what I did but we endup with:
In bpf, cgroup was already acquired.
bpf_cgroup_knob_write(cgroup, "freeze", buf) |_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock
cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...) |_ parse params -> cgroup_ref++ -> krnfs_active_ref-- -> -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...
Please let me know if I missed something.
I've thought about it a bit and I wonder whether a better way to do this is implementing this at the kernfs layer. Something like (hopefully with a better name):
s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
So, about the same, but takes kernfs_node directory instead of cgroup. This would make the interface useful for accessing sysfs knobs too which use similar conventions. For cgroup, @dir is just cgrp->kn and for sysfs it'd be kobj->sd. This way we can avoid the internal object -> path -> internal object ping-poinging while keeping the interface a lot more generic. What do you think?
Thanks.
On Fri, Mar 29, 2024 at 2:39 PM Tejun Heo tj@kernel.org wrote:
Hello,
On Fri, Mar 29, 2024 at 02:22:28PM +0100, Djalal Harouni wrote:
It would be easy at least for me if I just start with cgroupv2 and ensure that it has same available filenames as if we go through kernfs. Not a root cgroup node and maybe only freeze and kill for now that are part of cgroup_base_files.
So if I get it right, somehow like what I did but we endup with:
In bpf, cgroup was already acquired.
bpf_cgroup_knob_write(cgroup, "freeze", buf) |_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock
cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...) |_ parse params -> cgroup_ref++ -> krnfs_active_ref-- -> -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...
Please let me know if I missed something.
I've thought about it a bit and I wonder whether a better way to do this is implementing this at the kernfs layer. Something like (hopefully with a better name):
s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
So, about the same, but takes kernfs_node directory instead of cgroup. This would make the interface useful for accessing sysfs knobs too which use similar conventions. For cgroup, @dir is just cgrp->kn and for sysfs it'd be kobj->sd. This way we can avoid the internal object -> path -> internal object ping-poinging while keeping the interface a lot more generic. What do you think?
And helpers like cgroup_freeze_write() will be refactored to take kernfs_node directly instead of kernfs_open_file? Makes sense to me. Sounds like a minimal amount of changes and flexible enough.
Hello,
On 3/30/24 00:04, Alexei Starovoitov wrote:
On Fri, Mar 29, 2024 at 2:39 PM Tejun Heo tj@kernel.org wrote:
Hello,
On Fri, Mar 29, 2024 at 02:22:28PM +0100, Djalal Harouni wrote:
It would be easy at least for me if I just start with cgroupv2 and ensure that it has same available filenames as if we go through kernfs. Not a root cgroup node and maybe only freeze and kill for now that are part of cgroup_base_files.
So if I get it right, somehow like what I did but we endup with:
In bpf, cgroup was already acquired.
bpf_cgroup_knob_write(cgroup, "freeze", buf) |_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock
cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...) |_ parse params -> cgroup_ref++ -> krnfs_active_ref-- -> -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...
Please let me know if I missed something.
I've thought about it a bit and I wonder whether a better way to do this is implementing this at the kernfs layer. Something like (hopefully with a better name):
s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
So, about the same, but takes kernfs_node directory instead of cgroup. This would make the interface useful for accessing sysfs knobs too which use similar conventions. For cgroup, @dir is just cgrp->kn and for sysfs it'd be kobj->sd. This way we can avoid the internal object -> path -> internal object ping-poinging while keeping the interface a lot more generic. What do you think?
And helpers like cgroup_freeze_write() will be refactored to take kernfs_node directly instead of kernfs_open_file? Makes sense to me. Sounds like a minimal amount of changes and flexible enough.
Thank you Alexei, Tejun for the feedback. Will try to get back with a v2.
One particular thing is the kernfs_open_file->mutex nests outside of the refcounting of kernfs_node, let's see.
Thanks!
Hello.
On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni tixxdz@gmail.com wrote:
... For some cases we want to freeze the cgroup of a task based on some signals, doing so from bpf is better than user space which could be too late.
Notice that freezer itself is not immediate -- tasks are frozen as if a signal (kill(2)) was delivered to them (i.e. returning to userspace).
What kind of signals (also kill?) are you talking about for illustration?
Planned users of this feature are: tetragon and systemd when freezing a cgroup hierarchy that could be a K8s pod, container, system service or a user session.
It sounds like the signals are related to a particular process. If so what is it good for to freeze unrelated processes in the same cgroup?
I think those answers better clarify why this is needed.
As for the generalization to any cgroup attribute (or kernfs). Can this be compared with sysctls -- I see there are helpers to intercept user writes but no helpers to affect sysctl values without an outer writer. What would justify different approaches between kernfs attributes and sysctls (direct writes vs modified writes)?
Thanks, Michal
Hello Michal,
On 4/2/24 18:16, Michal Koutný wrote:
Hello.
On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni tixxdz@gmail.com wrote:
... For some cases we want to freeze the cgroup of a task based on some signals, doing so from bpf is better than user space which could be too late.
Notice that freezer itself is not immediate -- tasks are frozen as if a signal (kill(2)) was delivered to them (i.e. returning to userspace).
Thanks yes, I would expect freeze to behave like signal, and if one wants to block immediately there is the LSM override return. The selftest attached tries to do exactly that.
What kind of signals (also kill?) are you talking about for illustration?
Could be security signals, reading sensitive files or related to any operation management, for X reasons this user session should be freezed or killed.
The kill is an effective defense against fork-bombs as an example.
Planned users of this feature are: tetragon and systemd when freezing a cgroup hierarchy that could be a K8s pod, container, system service or a user session.
It sounds like the signals are related to a particular process. If so what is it good for to freeze unrelated processes in the same cgroup?
Today some container/pod operations are performed at bpf level, having the freeze and kill available is straightforward to perform this.
I think those answers better clarify why this is needed.
Alright will add those in v2.
As for the generalization to any cgroup attribute (or kernfs). Can this be compared with sysctls -- I see there are helpers to intercept user writes but no helpers to affect sysctl values without an outer writer. What would justify different approaches between kernfs attributes and sysctls (direct writes vs modified writes)?
For generalizing this, haven't thought about it that much. First use case is to try to get freeze and possibly kill support, and use a common interface as requested.
Thank you!
Thanks, Michal
Hi.
On Tue, Apr 02, 2024 at 07:20:45PM +0100, Djalal Harouni tixxdz@gmail.com wrote:
Thanks yes, I would expect freeze to behave like signal, and if one wants to block immediately there is the LSM override return. The selftest attached tries to do exactly that.
Are you refering to this part:
int BPF_PROG(lsm_freeze_cgroup, int cmd, union bpf_attr *attr, unsigned int size) ... ret = bpf_task_freeze_cgroup(task, 1); if (!ret) { ret = -EPERM; /* reset for next call */ ?
Could be security signals, reading sensitive files or related to any operation management, for X reasons this user session should be freezed or killed.
What can be done with a frozen cgroup after anything of that happens? Anything besides killing anyway?
Killing of an offending process could be caught by its supervisor (like container runtime or systemd) and propagated accordingly to the whole cgroup.
The kill is an effective defense against fork-bombs as an example.
There are several ways how to prevent fork-bombs in kernel already, it looks like a contrived example.
Today some container/pod operations are performed at bpf level, having the freeze and kill available is straightforward to perform this.
It seems to me like an extra step when the same operation can be done from (the managing) userspace already.
For generalizing this, haven't thought about it that much. First use case is to try to get freeze and possibly kill support, and use a common interface as requested.
BTW, I notice that there is bpf_sys_bpf() helper that allows calling an arbitrary syscall. Wouldn't that be sufficient for everything?
(Based on how I still understand the problem: either you must respond immediately and then the direct return from LSM is appropriate or timing is not sensitive but you want act on whole cgroup.)
Thanks, Michal
On 4/9/24 8:32 AM, Michal Koutný wrote:
Hi.
On Tue, Apr 02, 2024 at 07:20:45PM +0100, Djalal Harouni tixxdz@gmail.com wrote:
Thanks yes, I would expect freeze to behave like signal, and if one wants to block immediately there is the LSM override return. The selftest attached tries to do exactly that.
Are you refering to this part:
int BPF_PROG(lsm_freeze_cgroup, int cmd, union bpf_attr *attr, unsigned int size) ... ret = bpf_task_freeze_cgroup(task, 1); if (!ret) { ret = -EPERM; /* reset for next call */ ?
Could be security signals, reading sensitive files or related to any operation management, for X reasons this user session should be freezed or killed.
What can be done with a frozen cgroup after anything of that happens? Anything besides killing anyway?
Killing of an offending process could be caught by its supervisor (like container runtime or systemd) and propagated accordingly to the whole cgroup.
The kill is an effective defense against fork-bombs as an example.
There are several ways how to prevent fork-bombs in kernel already, it looks like a contrived example.
Today some container/pod operations are performed at bpf level, having the freeze and kill available is straightforward to perform this.
It seems to me like an extra step when the same operation can be done from (the managing) userspace already.
For generalizing this, haven't thought about it that much. First use case is to try to get freeze and possibly kill support, and use a common interface as requested.
BTW, I notice that there is bpf_sys_bpf() helper that allows calling an arbitrary syscall. Wouldn't that be sufficient for everything?
This is not true. Currently, only 'bpf' and 'close' syscalls are supported.
static const struct bpf_func_proto * syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { case BPF_FUNC_sys_bpf: return !bpf_token_capable(prog->aux->token, CAP_PERFMON) ? NULL : &bpf_sys_bpf_proto; case BPF_FUNC_btf_find_by_name_kind: return &bpf_btf_find_by_name_kind_proto; case BPF_FUNC_sys_close: return &bpf_sys_close_proto; case BPF_FUNC_kallsyms_lookup_name: return &bpf_kallsyms_lookup_name_proto; default: return tracing_prog_func_proto(func_id, prog); } }
More syscalls can be added (through kfunc) if there is a use case for that.
(Based on how I still understand the problem: either you must respond immediately and then the direct return from LSM is appropriate or timing is not sensitive but you want act on whole cgroup.)
Thanks, Michal
On Wed, Apr 10, 2024 at 05:26:18PM -0700, Yonghong Song yonghong.song@linux.dev wrote:
This is not true.
Oh, I misunderstood a manpage, I can see now it's not for any syscall.
More syscalls can be added (through kfunc) if there is a use case for that.
Thus, I don't want to open this up.
Michal
On Tue, Apr 9, 2024 at 5:32 PM Michal Koutný mkoutny@suse.com wrote:
Hi.
On Tue, Apr 02, 2024 at 07:20:45PM +0100, Djalal Harouni tixxdz@gmail.com wrote:
Thanks yes, I would expect freeze to behave like signal, and if one wants to block immediately there is the LSM override return. The selftest attached tries to do exactly that.
Are you refering to this part:
int BPF_PROG(lsm_freeze_cgroup, int cmd, union bpf_attr *attr, unsigned int size) ... ret = bpf_task_freeze_cgroup(task, 1); if (!ret) { ret = -EPERM; /* reset for next call */
?
Yes.
Could be security signals, reading sensitive files or related to any operation management, for X reasons this user session should be freezed or killed.
What can be done with a frozen cgroup after anything of that happens? Anything besides killing anyway?
Some users would like to inspect.
Killing of an offending process could be caught by its supervisor (like container runtime or systemd) and propagated accordingly to the whole cgroup.
Most bpf technologies do not run as a supervisor.
The kill is an effective defense against fork-bombs as an example.
There are several ways how to prevent fork-bombs in kernel already, it looks like a contrived example.
I doubt if they are as effective, flexible and reflect today's workflow as the cgroup way.
Thanks
linux-kselftest-mirror@lists.linaro.org