This patch series add support to write cgroup interfaces from BPF.
It is useful to freeze a cgroup hierarchy on suspicious activity for a more thorough analysis before killing it. Planned users of this feature are: systemd and BPF tools where the cgroup hierarchy could be a system service, user session, k8s pod or a container.
The writing happens via kernfs nodes and the cgroup must be on the default hierarchy. It implements the requests and feedback from v1 [1] where now we use a unified path for cgroup user space and BPF writing.
So I want to validate that this is the right approach first.
Todo: * Limit size of data to be written. * Further tests. * Add cgroup kill support.
# RFC v1 -> v2
* Implemented Alexei and Tejun requests [1]. * Unified path where user space or BPF writing end up taking directly a kernfs_node with an example on the "cgroup.freeze" interface.
[1] https://lore.kernel.org/bpf/20240327225334.58474-1-tixxdz@gmail.com/
Djalal Harouni (3): kernfs: cgroup: support writing cgroup interfaces from a kernfs node bpf: cgroup: Add BPF Kfunc to write cgroup interfaces selftests/bpf: add selftest for bpf_cgroup_write_interface
include/linux/cgroup.h | 3 ++ kernel/bpf/helpers.c | 45 +++++ kernel/cgroup/cgroup.c | 102 +++++++ tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c | 172 ++++++++++++ tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c | 155 ++++++++++ 5 files changed, 471 insertions(+), 6 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c create mode 100644 tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c
Freezing a cgroup of a task from BPF is better than user space which could be too late and is subject to races. To achieve this allow writing to cgroup core interfaces from BPF by adding a new kfunc helper that take a kernfs node directly.
Currently only writing to "cgroup.freeze" on the default hierarchy is allowed. The writing goes directly via a kernfs_node which allows to share the same path as if a kernfs_node was opened from userspace.
Signed-off-by: Djalal Harouni tixxdz@gmail.com --- include/linux/cgroup.h | 3 ++ kernel/cgroup/cgroup.c | 102 ++++++++++++++++++++++++++++++++++++++--- 2 files changed, 99 insertions(+), 6 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index b18fb5fcb38e..03a0782c94bf 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -125,6 +125,9 @@ int cgroup_rm_cftypes(struct cftype *cfts); void cgroup_file_notify(struct cgroup_file *cfile); void cgroup_file_show(struct cgroup_file *cfile, bool show);
+ssize_t cgroup_kn_interface_write(struct kernfs_node *kn, const char *name__str, + const char *buf, size_t nbytes, loff_t off); + int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry); int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *tsk); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 312c6a8b55bb..cddd7c1d354d 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -229,6 +229,24 @@ static struct file_system_type cgroup2_fs_type; static struct cftype cgroup_base_files[]; static struct cftype cgroup_psi_files[];
+struct cgroup_kn_cftype { + char name[MAX_CFTYPE_NAME]; + unsigned int namelen; + + /* + * write() is the write operation on a kernfs node. + */ + ssize_t (*write)(struct kernfs_node *kn, const char *buf, size_t nbytes, + loff_t off, bool revalidate); +}; + +#define CGROUP_PREFIX "cgroup." +#define CGROUP_CORE_INTERFACE_FREEZE_SUFFIX "freeze" +#define CGROUP_CORE_INTERFACE_FREEZE (CGROUP_PREFIX CGROUP_CORE_INTERFACE_FREEZE_SUFFIX) +#define CGROUP_CORE_INTERFACE_FREEZE_LEN (sizeof(CGROUP_CORE_INTERFACE_FREEZE) - 1) + +static struct cgroup_kn_cftype kn_cfts[]; + /* cgroup optional features */ enum cgroup_opt_features { #ifdef CONFIG_PSI @@ -4030,29 +4048,58 @@ static int cgroup_freeze_show(struct seq_file *seq, void *v) return 0; }
-static ssize_t cgroup_freeze_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) +static bool cgroup_kn_revalidate(struct cgroup *cgrp) +{ + if (!cgroup_on_dfl(cgrp) || !cgroup_parent(cgrp)) + return false; + + return true; +} + +static ssize_t cgroup_kn_freeze(struct kernfs_node *kn, + const char *buf, size_t nbytes, loff_t off, + bool revalidate) { struct cgroup *cgrp; ssize_t ret; int freeze; + char b[4] = {0}; + + /* Handle userspace writes +(0|1)\n and fail otherwise */ + ret = strscpy(b, buf, sizeof(b)); + if (ret < 0) + return ret;
- ret = kstrtoint(strstrip(buf), 0, &freeze); + nbytes = ret; + ret = kstrtoint(strstrip(b), 0, &freeze); if (ret) return ret;
if (freeze < 0 || freeze > 1) return -ERANGE;
- cgrp = cgroup_kn_lock_live(of->kn, false); + cgrp = cgroup_kn_lock_live(kn, false); if (!cgrp) return -ENOENT;
+ if (revalidate && !cgroup_kn_revalidate(cgrp)) { + ret = -EOPNOTSUPP; + goto out; + } + cgroup_freeze(cgrp, freeze);
- cgroup_kn_unlock(of->kn); + ret = nbytes;
- return nbytes; +out: + cgroup_kn_unlock(kn); + return ret; +} + +static ssize_t cgroup_freeze_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + return cgroup_kn_freeze(of->kn, buf, nbytes, off, false); }
static void __cgroup_kill(struct cgroup *cgrp) @@ -4601,6 +4648,49 @@ void cgroup_file_show(struct cgroup_file *cfile, bool show) kernfs_put(kn); }
+static struct cgroup_kn_cftype kn_cfts[] = { + { + .name = CGROUP_CORE_INTERFACE_FREEZE, + .namelen = CGROUP_CORE_INTERFACE_FREEZE_LEN, + .write = cgroup_kn_freeze, + }, + { }, +}; + +static const struct cgroup_kn_cftype *cgroup_kn_cft(const char *name__str) +{ + struct cgroup_kn_cftype *kn_cft; + + for (kn_cft = kn_cfts; kn_cft && kn_cft->name[0] != '\0'; kn_cft++) { + if (!strncmp(name__str, kn_cft->name, kn_cft->namelen)) + return kn_cft; + } + + return ERR_PTR(-EOPNOTSUPP); +} + +ssize_t cgroup_kn_interface_write(struct kernfs_node *kn, const char *name__str, + const char *buf, size_t nbytes, loff_t off) +{ + const struct cgroup_kn_cftype *kn_cft; + + /* empty, do not remove */ + if (!nbytes) + return 0; + + if (kernfs_type(kn) != KERNFS_DIR) + return -ENOTDIR; + + kn_cft = cgroup_kn_cft(name__str); + if (IS_ERR(kn_cft)) + return PTR_ERR(kn_cft); + + if (unlikely(!kn_cft->write)) + return -EOPNOTSUPP; + + return kn_cft->write(kn, buf, nbytes, off, true); +} + /** * css_next_child - find the next child of a given css * @pos: the current position (%NULL to initiate traversal)
Add bpf_cgroup_write_interface() kfunc that writes to a cgroup interface. Takes a cgroup on the default hierarchy as argument, and writes to the specified interface file of that cgroup.
Freezing a cgroup of a task from BPF is better than user space which could be too late and is subject to races. Hence, add support for writing to "cgroup.freeze" interface using the mentioned bpf kfunc.
Planned users of this feature are: systemd and BPF tools. Taking the freezing example, we could freeze a cgroup hierarchy on suspicious activity for a more thorough analysis. The cgroup hierarchies could be system services, user sessions, K8s pods or containers.
Signed-off-by: Djalal Harouni tixxdz@gmail.com --- kernel/bpf/helpers.c | 45 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 6b4877e85a68..5efc1bc57db9 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2605,6 +2605,50 @@ bpf_task_get_cgroup1(struct task_struct *task, int hierarchy_id) return NULL; return cgrp; } + +#define BPF_CGROUP_MAX_WRITE ((1UL << 24) - 1) + +/** + * bpf_cgroup_write_interface - Writes to a cgroup interface file. + * @cgrp: The target cgroup + * @name__str: name of the cgroup core interface file + * @value_p: value to write + * @off: offset + * + * Return: number of bytes written on success, a negative value on error. + */ +__bpf_kfunc int +bpf_cgroup_write_interface(struct cgroup *cgrp, const char *name__str, + const struct bpf_dynptr *value_p, loff_t off) +{ + struct bpf_dynptr_kern *value_ptr = (struct bpf_dynptr_kern *)value_p; + struct kernfs_node *kn; + const void *value; + u32 value_len; + int ret; + + value_len = __bpf_dynptr_size(value_ptr); + if (!value_len) + return 0; + + if (value_len > BPF_CGROUP_MAX_WRITE) + return -E2BIG; + + value = __bpf_dynptr_data(value_ptr, value_len); + if (!value) + return -EINVAL; + + rcu_read_lock(); + kn = cgrp->kn; + rcu_read_unlock(); + + kernfs_get(kn); + ret = cgroup_kn_interface_write(kn, name__str, value, value_len, off); + kernfs_put(kn); + + return ret; +} + #endif /* CONFIG_CGROUPS */
/** @@ -3736,6 +3780,7 @@ BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | KF_RCU | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU) BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_cgroup_write_interface, KF_TRUSTED_ARGS | KF_SLEEPABLE) #endif BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_task_from_vpid, KF_ACQUIRE | KF_RET_NULL)
This adds a selftest for `bpf_cgroup_write_interface` kfunc. The test works by forking a child then:
1. Child: - Migrate to a new cgroup - Loads bpf programs - Trigger the 'lsm_freeze_cgroup' bpf program so it freeze itself.
<- wait for parent to unthaw
- On unthaw it continues, forks another process and triggers the 'tp_newchild' bpf program to set some monitored pids of the new process, that assert that the user space resumed correctly.
2. Parent: - Keeps reading the 'cgroup.freeze' file of the child cgroup until it prints 1 which means the child cgroup is frozen. - Attaches the sample 'lsm_task_free' so it triggers the bpf program to unthaw the child task cgroup. - Then waits for a clean exit of the child process.
The scenario allows to test multiple sides of: freeze and unthaw a cgroup.
Signed-off-by: Djalal Harouni tixxdz@gmail.com --- .../bpf/prog_tests/task_freeze_cgroup.c | 172 ++++++++++++++++++ .../bpf/progs/test_task_freeze_cgroup.c | 155 ++++++++++++++++ 2 files changed, 327 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c create mode 100644 tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c
diff --git a/tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c b/tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c new file mode 100644 index 000000000000..d4e9c0f32196 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c @@ -0,0 +1,172 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <sys/syscall.h> +#include <test_progs.h> +#include <cgroup_helpers.h> +#include <unistd.h> +#include "test_task_freeze_cgroup.skel.h" + +#define CGROUP_PATH "/test-task-freeze-cgroup" + +static int bpf_sleepable(struct test_task_freeze_cgroup *skel) +{ + int err, cgroup_fd; + pid_t new_pid2; + + cgroup_fd = cgroup_setup_and_join(CGROUP_PATH); + if (!ASSERT_OK(cgroup_fd < 0, "cgroup_setup_and_join")) + return -errno; + + skel = test_task_freeze_cgroup__open(); + if (!ASSERT_OK_PTR(skel, "test_task_freeze_cgroup__open")) { + err = -errno; + goto cleanup_cgroup; + } + + skel->rodata->parent_pid = getppid(); + skel->rodata->monitor_pid = getpid(); + skel->rodata->cgid = get_cgroup_id(CGROUP_PATH); + skel->bss->new_pid = getpid(); + skel->bss->freeze = 1; + + err = test_task_freeze_cgroup__load(skel); + if (!ASSERT_OK(err, "test_task_freeze_cgroup__load")) { + err = -errno; + goto cleanup_skel; + } + + /* First attach the LSM Program that is triggered on bpf() calls + * especially on TP_BTF programs when attached. + */ + skel->links.lsm_freeze_cgroup = + bpf_program__attach_lsm(skel->progs.lsm_freeze_cgroup); + if (!ASSERT_OK_PTR(skel->links.lsm_freeze_cgroup, "attach_lsm")) { + err = -errno; + goto cleanup_detach; + } + + /* Attaching this must fail with -EPERM and freeze current task */ + skel->links.tp_newchild = + bpf_program__attach_trace(skel->progs.tp_newchild); + if (!ASSERT_EQ(errno, EPERM, "attach_trace() must fail here")) { + err = -EINVAL; + goto cleanup_detach; + } + + /* Continue */ + + /* Attach again now with success */ + skel->links.tp_newchild = + bpf_program__attach_trace(skel->progs.tp_newchild); + if (!ASSERT_OK_PTR(skel->links.tp_newchild, "attach_trace")) { + err = -EINVAL; + goto cleanup_detach; + } + + /* Fork, update vars from BPF and assert the unfrozen state */ + new_pid2 = fork(); + if (new_pid2 == 0) + exit(0); + + err = (new_pid2 == -1); + if (ASSERT_OK(err, "fork process")) + wait(NULL); + + /* Now assert that new_pid2 reflects this new child */ + ASSERT_NEQ(0, skel->bss->new_pid, + "test task_freeze_cgroup failed at new_pid != 0"); + ASSERT_NEQ(skel->rodata->monitor_pid, skel->bss->new_pid, + "test task_freeze_cgroup failed at old monitor_pid != new_pid"); + /* Assert that bpf sets new_pid to new forked child new_pid2 */ + ASSERT_EQ(skel->bss->new_pid, new_pid2, + "test task_freeze_cgroup failed first child new_pid == new_pid2"); + +cleanup_detach: + test_task_freeze_cgroup__detach(skel); +cleanup_skel: + test_task_freeze_cgroup__destroy(skel); +cleanup_cgroup: + close(cgroup_fd); + cleanup_cgroup_environment(); + return err; +} + +void test_task_freeze_cgroup(void) +{ + pid_t pid, result; + char buf[512] = {0}; + char path[PATH_MAX] = {0}; + int ret, status, attempts, frozen = 0, fd; + struct test_task_freeze_cgroup *skel = NULL; + + pid = fork(); + ret = (pid == -1); + if (!ASSERT_OK(ret, "fork process")) + return; + + if (pid == 0) { + ret = bpf_sleepable(skel); + ASSERT_EQ(0, ret, "child bpf_sleepable failed"); + exit(ret); + } + + skel = test_task_freeze_cgroup__open(); + if (!ASSERT_OK_PTR(skel, "test_task_freeze_cgroup__open")) + goto out; + + snprintf(path, sizeof(path), + "/sys/fs/cgroup/cgroup-test-work-dir%d%s/cgroup.freeze", + pid, CGROUP_PATH); + + for (attempts = 10; attempts >= 0; attempts--) { + ret = 0; + + fd = open(path, O_RDONLY); + if (fd > 0) + ret = read(fd, buf, sizeof(buf) - 1); + if (ret > 0) { + errno = 0; + frozen = strtol(buf, NULL, 10); + if (errno) + frozen = 0; + } + + close(fd); + if (frozen) + break; + sleep(1); + } + + /* Assert that child cgroup is frozen */ + if (!ASSERT_EQ(1, frozen, "child cgroup not frozen")) + goto out; + + ret = test_task_freeze_cgroup__load(skel); + if (!ASSERT_OK(ret, "test_task_freeze_cgroup__load")) + goto out; + + /* Trigger the unthaw child cgroup from parent */ + skel->links.lsm_task_free = + bpf_program__attach_lsm(skel->progs.lsm_task_free); + if (!ASSERT_OK_PTR(skel->links.lsm_task_free, "attach_lsm")) + goto out; + + result = waitpid(pid, &status, WUNTRACED); + if (!ASSERT_NEQ(result, -1, "waitpid")) + goto detach; + + result = WIFEXITED(status); + if (!ASSERT_EQ(result, 1, "forked process did not terminate normally")) + goto detach; + + result = WEXITSTATUS(status); + if (!ASSERT_EQ(result, 0, "forked process did not exit successfully")) + goto detach; + +detach: + test_task_freeze_cgroup__detach(skel); + +out: + if (skel) + test_task_freeze_cgroup__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c b/tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c new file mode 100644 index 000000000000..07b4b65abc36 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c @@ -0,0 +1,155 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <vmlinux.h> +#include <linux/types.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_helpers.h> +#include <errno.h> +#include "bpf_kfuncs.h" +#include "bpf_misc.h" + +struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; +long bpf_task_under_cgroup(struct task_struct *task, struct cgroup *ancestor) __ksym; +void bpf_cgroup_release(struct cgroup *p) __ksym; +struct task_struct *bpf_task_from_pid(s32 pid) __ksym; +struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym; +void bpf_task_release(struct task_struct *p) __ksym; + +extern int bpf_cgroup_write_interface(struct cgroup *cgrp, + const char *name__str, + const struct bpf_dynptr *value_p, + loff_t off) __ksym __weak; + +char freeze_val[] = "1"; +char unthaw_val[] = "0"; + +const volatile int parent_pid; +const volatile int monitor_pid; +const volatile __u64 cgid; +int new_pid; +int freeze; + +SEC("tp_btf/task_newtask") +int BPF_PROG(tp_newchild, struct task_struct *task, u64 clone_flags) +{ + struct cgroup *cgrp = NULL; + struct task_struct *acquired; + + if (monitor_pid != (bpf_get_current_pid_tgid() >> 32)) + return 0; + + acquired = bpf_task_acquire(task); + if (!acquired) + return 0; + + cgrp = bpf_cgroup_from_id(cgid); + if (!cgrp) + goto out; + + /* Update new_pid with current pid */ + if (bpf_task_under_cgroup(acquired, cgrp)) + new_pid = acquired->tgid; + +out: + if (cgrp) + bpf_cgroup_release(cgrp); + bpf_task_release(acquired); + + return 0; +} + +/* Try to attach from parent to trigger the bpf lsm hook, so from + * parent context we unthaw child cgroup. + */ +SEC("lsm/task_free") +int BPF_PROG(lsm_task_free, struct task_struct *task) +{ + return 0; +} + +static int process_freeze_cgroup(int pid, int freeze) +{ + int ret = 0; + struct task_struct *task; + struct bpf_dynptr dyn_ptr; + struct cgroup *cgrp = NULL; + + task = bpf_task_from_pid(pid); + if (!task) + return -EINVAL; + + cgrp = bpf_cgroup_from_id(cgid); + if (!cgrp) { + ret = -EINVAL; + goto out; + } + + if (!bpf_task_under_cgroup(task, cgrp)) + goto out; + + if (freeze) + bpf_dynptr_from_mem(freeze_val, sizeof(freeze_val), 0, &dyn_ptr); + else + bpf_dynptr_from_mem(unthaw_val, sizeof(unthaw_val), 0, &dyn_ptr); + + ret = bpf_cgroup_write_interface(cgrp, "cgroup.freeze", &dyn_ptr, 0); + +out: + if (cgrp) + bpf_cgroup_release(cgrp); + bpf_task_release(task); + return ret; +} + +SEC("lsm.s/bpf") +int BPF_PROG(lsm_freeze_cgroup, int cmd, union bpf_attr *attr, unsigned int size) +{ + int ret = 0; + struct task_struct *task; + struct cgroup *cgrp = NULL; + + if (cmd != BPF_LINK_CREATE) + return 0; + + task = bpf_get_current_task_btf(); + if (parent_pid == task->pid) { + /* Parent context: unthaw child */ + process_freeze_cgroup(monitor_pid, 0); + return 0; + } + + /* Nothing todo */ + if (!freeze) + return 0; + + /* Child context */ + if (monitor_pid != task->pid) + return 0; + + /* Ensure we are under the corresponding cgroup so we freeze + * current child from its context + */ + cgrp = bpf_cgroup_from_id(cgid); + if (!cgrp) + return 0; + + if (!bpf_task_under_cgroup(task, cgrp)) + goto out; + + /* Schedule freeze task and return -EPERM */ + ret = process_freeze_cgroup(monitor_pid, freeze); + + /* On error or 0 we return zero and we catch at + * user space if the cgroup was not frozen. + */ + ret = (ret > 0) ? -EPERM : 0; + + /* Reset for next calls */ + freeze = 0; +out: + if (cgrp) + bpf_cgroup_release(cgrp); + return ret; +} + +char _license[] SEC("license") = "GPL";
On Mon, Aug 18, 2025 at 10:04:21AM +0100, Djalal Harouni wrote:
This patch series add support to write cgroup interfaces from BPF.
It is useful to freeze a cgroup hierarchy on suspicious activity for a more thorough analysis before killing it. Planned users of this feature are: systemd and BPF tools where the cgroup hierarchy could be a system service, user session, k8s pod or a container.
The writing happens via kernfs nodes and the cgroup must be on the default hierarchy. It implements the requests and feedback from v1 [1] where now we use a unified path for cgroup user space and BPF writing.
So I want to validate that this is the right approach first.
I don't see any reason to object to the feature but the way it's constructed seems rather odd to me. If it's going to need per-feature code, might as well bypass the write part and implement a simpler interface - ie. bpf_cgroup_freeze(). Otherwise, can't it actually write to kernfs files so that we don't need to add code per enabled feature?
Thanks.
On 8/18/25 18:32, Tejun Heo wrote:
On Mon, Aug 18, 2025 at 10:04:21AM +0100, Djalal Harouni wrote:
This patch series add support to write cgroup interfaces from BPF.
It is useful to freeze a cgroup hierarchy on suspicious activity for a more thorough analysis before killing it. Planned users of this feature are: systemd and BPF tools where the cgroup hierarchy could be a system service, user session, k8s pod or a container.
The writing happens via kernfs nodes and the cgroup must be on the default hierarchy. It implements the requests and feedback from v1 [1] where now we use a unified path for cgroup user space and BPF writing.
So I want to validate that this is the right approach first.
I don't see any reason to object to the feature but the way it's constructed seems rather odd to me. If it's going to need per-feature code, might as well bypass the write part and implement a simpler interface - ie. bpf_cgroup_freeze().
Approach 1: First RFC months ago was something like that "bpf_task_freeze_cgroup" [1], can make it bpf_cgroup_freeze() as a proper kfunc, so resurrect approach 1?
Internally it used an ugly path to workaround kernfs active reference since we don't hold a kernfs_open_file coming from userspace kernfs->write path.
I can improve it, but let's discuss please approach (2) since you suggested it ;-)
Approach 2: Per the old suggestions from you and Alexei [2] [3] you wanted something like:
s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
I didn't make it generic for kernfs, since don't know yet about sysfs use cases and named it "bpf_cgroup_write_interface" to focus on cgroup base interfaces. Doing something that generic now including sysfs without a proper valid use cases seems a bit too much. Also we have some cgroup kfunc to acquire and release that integrate well, so I kept it focused.
Alexei suggested to refactor the cgroup_base_file[] [4][5] to take "kernfs_node" as argument instead of "kernfs_open_file", which will open other possibilities for BPF.
However, instead of going full change on cgroup_base_files[], I added a minimalist: cgroup_kn_cftype kn_cfts[] that for now hold only "cgroup.freeze".
I see three possibilities here:
A. Minimal change with approach presented here: add dedicated array cgroup_kn_cftype kn_cfts[] with only "cgroup.freeze" and later try to unify it inside cgroup_base_file[]. B. Add "->bpf_write()" handler to cgroup_base_file[] and start only with "cgroup.freeze". C. Refactor all cgroup_base_file[] to take a kernfs_node directly instead of kernfs_open_file as suggested.
I took (C) the simple one since I wanted to do cgroup freeze first. You also suggested maybe in future "cgroup.kill" well if we have it, we definitely will start using it. Not sure if we are allowed to BPF sleep that path, however we can also start doing "cgroup.freeze" from BPF and kill from user space as a first step. But we definitly want more BPF operations on cgroup interfaces, I can think of a companion cgroup where we migrate tasks there on specific signals...
So more or less current proposed approach (2) followed the suggestions, but focused only on writing cgroup kernfs knobs.
Thoughts, did I miss something?
Otherwise, can't it actually write to kernfs files so that we don't need to add code per enabled feature?
I'm not sure how we would write to kernfs files? As pointed by Alexei [6] it is more involved if we want to open files...
About "that we don't need to add code per enabled feature?" well if we go the path of "bpf_cgroup_write_interface" or "bpf_cgroup_write_knob" adding a new write interface will involve theoretically:
1. Check if program can sleep or/and the calling context. 2. Add the new "cgroup.x" either in cgroup_base_file[] or in the new array. 3. New handlers or refactor old ones to take a "kernfs_node" instead of "kernfs_open_file".
Compared to having multiple bpf_cgroup_(freeze|kill|...) kfunc seems fair too, and not that much code from BPF part.
BTW current patches contain a bug, after testing. In normal writes from user context we break kernfs active protection to avoid nesting cgroup locking under. Forget this part. In next series since we don't grab the kernfs_node active protection like userspace kernfs_write, then no need to break it... will add a parameter to check like the revalidate one that checks things is cgroup on dfl and not root, things that are automatically handled from normal userspace ->write.
Thank you!
[1] https://lore.kernel.org/bpf/20240327225334.58474-3-tixxdz@gmail.com/ [2] https://lore.kernel.org/bpf/ZgXMww9kJiKi4Vmd@slm.duckdns.org/ [3] https://lore.kernel.org/bpf/Zgc1BZnYCS9OSSTw@slm.duckdns.org/ [4] https://lore.kernel.org/bpf/CAADnVQK970_Nx3918V41ue031RkGs+WsteOAm6EJOY7oSwz... [5] https://lore.kernel.org/bpf/CAADnVQ+WmaPG1WOaSDbjxNPVzVape_JfG_CNSRy188ni076... [6] https://lore.kernel.org/bpf/CAADnVQLhWDcX-7XCdo-W=jthU=9iPqODwrE6c9fvU8sfAJ5...
Thanks.
On 8/20/25 00:31, Djalal Harouni wrote:
On 8/18/25 18:32, Tejun Heo wrote:
On Mon, Aug 18, 2025 at 10:04:21AM +0100, Djalal Harouni wrote:
This patch series add support to write cgroup interfaces from BPF.
It is useful to freeze a cgroup hierarchy on suspicious activity for a more thorough analysis before killing it. Planned users of this feature are: systemd and BPF tools where the cgroup hierarchy could be a system service, user session, k8s pod or a container.
The writing happens via kernfs nodes and the cgroup must be on the default hierarchy. It implements the requests and feedback from v1 [1] where now we use a unified path for cgroup user space and BPF writing.
So I want to validate that this is the right approach first.
I don't see any reason to object to the feature but the way it's constructed seems rather odd to me. If it's going to need per-feature code, might as well bypass the write part and implement a simpler interface - ie. bpf_cgroup_freeze().
Approach 1: First RFC months ago was something like that "bpf_task_freeze_cgroup" [1], can make it bpf_cgroup_freeze() as a proper kfunc, so resurrect approach 1?
Internally it used an ugly path to workaround kernfs active reference since we don't hold a kernfs_open_file coming from userspace kernfs->write path.
I can improve it, but let's discuss please approach (2) since you suggested it ;-)
Approach 2: Per the old suggestions from you and Alexei [2] [3] you wanted something like:
s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
I didn't make it generic for kernfs, since don't know yet about sysfs use cases and named it "bpf_cgroup_write_interface" to focus on cgroup base interfaces. Doing something that generic now including sysfs without a proper valid use cases seems a bit too much. Also we have some cgroup kfunc to acquire and release that integrate well, so I kept it focused.
Alexei suggested to refactor the cgroup_base_file[] [4][5] to take "kernfs_node" as argument instead of "kernfs_open_file", which will open other possibilities for BPF.
However, instead of going full change on cgroup_base_files[], I added a minimalist: cgroup_kn_cftype kn_cfts[] that for now hold only "cgroup.freeze".
I see three possibilities here:
A. Minimal change with approach presented here: add dedicated array cgroup_kn_cftype kn_cfts[] with only "cgroup.freeze" and later try to unify it inside cgroup_base_file[]. B. Add "->bpf_write()" handler to cgroup_base_file[] and start only with "cgroup.freeze". C. Refactor all cgroup_base_file[] to take a kernfs_node directly instead of kernfs_open_file as suggested.
I took (C) the simple one since I wanted to do cgroup freeze first. You
Sorry here I meant (A).
Thanks.
also suggested maybe in future "cgroup.kill" well if we have it, we definitely will start using it. Not sure if we are allowed to BPF sleep that path, however we can also start doing "cgroup.freeze" from BPF and kill from user space as a first step. But we definitly want more BPF operations on cgroup interfaces, I can think of a companion cgroup where we migrate tasks there on specific signals...
So more or less current proposed approach (2) followed the suggestions, but focused only on writing cgroup kernfs knobs.
Thoughts, did I miss something?
Otherwise, can't it actually write to kernfs files so that we don't need to add code per enabled feature?
I'm not sure how we would write to kernfs files? As pointed by Alexei [6] it is more involved if we want to open files...
About "that we don't need to add code per enabled feature?" well if we go the path of "bpf_cgroup_write_interface" or "bpf_cgroup_write_knob" adding a new write interface will involve theoretically:
- Check if program can sleep or/and the calling context.
- Add the new "cgroup.x" either in cgroup_base_file[] or in the new
array. 3. New handlers or refactor old ones to take a "kernfs_node" instead of "kernfs_open_file".
Compared to having multiple bpf_cgroup_(freeze|kill|...) kfunc seems fair too, and not that much code from BPF part.
BTW current patches contain a bug, after testing. In normal writes from user context we break kernfs active protection to avoid nesting cgroup locking under. Forget this part. In next series since we don't grab the kernfs_node active protection like userspace kernfs_write, then no need to break it... will add a parameter to check like the revalidate one that checks things is cgroup on dfl and not root, things that are automatically handled from normal userspace ->write.
Thank you!
[1] https://lore.kernel.org/bpf/20240327225334.58474-3-tixxdz@gmail.com/ [2] https://lore.kernel.org/bpf/ZgXMww9kJiKi4Vmd@slm.duckdns.org/ [3] https://lore.kernel.org/bpf/Zgc1BZnYCS9OSSTw@slm.duckdns.org/ [4] https://lore.kernel.org/bpf/ CAADnVQK970_Nx3918V41ue031RkGs+WsteOAm6EJOY7oSwzS1A@mail.gmail.com/ [5] https://lore.kernel.org/bpf/ CAADnVQ+WmaPG1WOaSDbjxNPVzVape_JfG_CNSRy188ni076Mog@mail.gmail.com/ [6] https://lore.kernel.org/bpf/CAADnVQLhWDcX-7XCdo- W=jthU=9iPqODwrE6c9fvU8sfAJ5ARg@mail.gmail.com/
Thanks.
Hello,
On Wed, Aug 20, 2025 at 12:31:01AM +0100, Djalal Harouni wrote: ...
Approach 1: First RFC months ago was something like that "bpf_task_freeze_cgroup" [1], can make it bpf_cgroup_freeze() as a proper kfunc, so resurrect approach 1?
Thanks for reminding me. I often feel like my memory is a ring buffer which lasts a few weeks at most.
Internally it used an ugly path to workaround kernfs active reference since we don't hold a kernfs_open_file coming from userspace kernfs->write path.
I can improve it, but let's discuss please approach (2) since you suggested it ;-)
Approach 2: Per the old suggestions from you and Alexei [2] [3] you wanted something like:
s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
I didn't make it generic for kernfs, since don't know yet about sysfs use cases and named it "bpf_cgroup_write_interface" to focus on cgroup base interfaces. Doing something that generic now including sysfs without a proper valid use cases seems a bit too much. Also we have some cgroup kfunc to acquire and release that integrate well, so I kept it focused.
Alexei suggested to refactor the cgroup_base_file[] [4][5] to take "kernfs_node" as argument instead of "kernfs_open_file", which will open other possibilities for BPF.
However, instead of going full change on cgroup_base_files[], I added a minimalist: cgroup_kn_cftype kn_cfts[] that for now hold only "cgroup.freeze".
I think there's some misunderstanding here. IIUC, Alexei didn't want to expose direct file interface because none of the necessary abstractions are available and the effort becomes too big and wide. However, I don't think the suggestion to use @kn as the path designator (or @cgroup for that matter) means that we want to pipe that all the way down to each op that's to be written to. That'd be rather pointless - why add generic interface if each needs a dedicated backend anyway? Can't you make kernfs to create open_file from kn and follow the usual write path?
Thanks.
On 8/20/25 02:14, Tejun Heo wrote:
Hello,
On Wed, Aug 20, 2025 at 12:31:01AM +0100, Djalal Harouni wrote: ...
Approach 1: First RFC months ago was something like that "bpf_task_freeze_cgroup" [1], can make it bpf_cgroup_freeze() as a proper kfunc, so resurrect approach 1?
Thanks for reminding me. I often feel like my memory is a ring buffer which lasts a few weeks at most.
yeh been a while...
Internally it used an ugly path to workaround kernfs active reference since we don't hold a kernfs_open_file coming from userspace kernfs->write path.
I can improve it, but let's discuss please approach (2) since you suggested it ;-)
Approach 2: Per the old suggestions from you and Alexei [2] [3] you wanted something like:
s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
I didn't make it generic for kernfs, since don't know yet about sysfs use cases and named it "bpf_cgroup_write_interface" to focus on cgroup base interfaces. Doing something that generic now including sysfs without a proper valid use cases seems a bit too much. Also we have some cgroup kfunc to acquire and release that integrate well, so I kept it focused.
Alexei suggested to refactor the cgroup_base_file[] [4][5] to take "kernfs_node" as argument instead of "kernfs_open_file", which will open other possibilities for BPF.
However, instead of going full change on cgroup_base_files[], I added a minimalist: cgroup_kn_cftype kn_cfts[] that for now hold only "cgroup.freeze".
I think there's some misunderstanding here. IIUC, Alexei didn't want to expose direct file interface because none of the necessary abstractions are available and the effort becomes too big and wide. However, I don't think the suggestion to use @kn as the path designator (or @cgroup for that matter) means that we want to pipe that all the way down to each op that's to be written to. That'd be rather pointless - why add generic interface if each needs a dedicated backend anyway?
I took time to check...
I thought to avoid precisely creating a kernfs_open_file, since we don't have the corresponding context. Beside the user information who opened the file, kernfs_open_file assumes a corresponding opened file and a corresponding kernfs_open_node for each kernfs_node with one or more open files.
Also it seems each kernfs_open_file is added back into kernfs_open_node->files...
Can't you make kernfs to create open_file from kn and follow the usual write path?
Seems tricky, needs to handle also kernfs_global_locks->open_file_mutex compared with writing directly to the dedicated cgroup backend. The only protection there beside cgroup_lock() is break kernfs active protection so kernfs nodes can be drained later...
The cgroup_kn_lock_live() grabs a ref on the cgroup, breaks the active protection, takes the cgroup_lock() and ensures the cgroup is not dead, from there we don't dereference kn-> until we restore the protection. Plus we have the kernfs_get(cgrp->kn) on cgroup_mkdir which ensures it is always accessible.
I do realize taking the same usual path with write is the obvious thing, but we don't have the corresponding open context, and faking it seems more trouble than calling directly cgroup backends...
Allow me please to do it again directly on cgroup_base_file[] assuming it was Alexei suggestion and see how it looks.
Also Tejun, could you please point me to extra cgroup or kernfs tests you run? much appreciated!
Thanks!
Thanks.
Hello,
On Fri, Aug 22, 2025 at 07:16:15PM +0100, Djalal Harouni wrote: ...
I do realize taking the same usual path with write is the obvious thing, but we don't have the corresponding open context, and faking it seems more trouble than calling directly cgroup backends...
Allow me please to do it again directly on cgroup_base_file[] assuming it was Alexei suggestion and see how it looks.
I'm probably missing something but what prevents you from getting a dentry from kernfs_node and then calling vfs_open() on it and then do vfs_write() on the returned file?
If there are some fundamental reasons that we can't do something like that, let's go back to the simple approach where we just have bpf helpers for freezing and unfreezing cgroups outside of fs interface.
Also Tejun, could you please point me to extra cgroup or kernfs tests you run? much appreciated!
I'm afraid there isn't much outside what's in the selftest directory.
Thanks.
On Mon, Aug 25, 2025 at 11:48 AM Tejun Heo tj@kernel.org wrote:
Hello,
On Fri, Aug 22, 2025 at 07:16:15PM +0100, Djalal Harouni wrote: ...
I do realize taking the same usual path with write is the obvious thing, but we don't have the corresponding open context, and faking it seems more trouble than calling directly cgroup backends...
Allow me please to do it again directly on cgroup_base_file[] assuming it was Alexei suggestion and see how it looks.
It's been 1.5 year since v1. It's safe to assume that all opinions have changed, including mine.
I'm probably missing something but what prevents you from getting a dentry from kernfs_node and then calling vfs_open() on it and then do vfs_write() on the returned file?
Generic vfs ops from kfunc feels like a can of worms. It will require a lot more thinking and coordination with vfs folks. I'd rather keep things simple especially, since this thread might continue in 2026.
If there are some fundamental reasons that we can't do something like that, let's go back to the simple approach where we just have bpf helpers for freezing and unfreezing cgroups outside of fs interface.
I'd just do whatever version of cgroup_lock*() is necessary from kfunc followed by cgroup_freeze(), and limit kfunc to sleepable progs, of course.
Hello,
On 8/25/25 19:48, Tejun Heo wrote:
Hello,
On Fri, Aug 22, 2025 at 07:16:15PM +0100, Djalal Harouni wrote: ...
I do realize taking the same usual path with write is the obvious thing, but we don't have the corresponding open context, and faking it seems more trouble than calling directly cgroup backends...
Allow me please to do it again directly on cgroup_base_file[] assuming it was Alexei suggestion and see how it looks.
I'm probably missing something but what prevents you from getting a dentry from kernfs_node and then calling vfs_open() on it and then do vfs_write() on the returned file?
If we include the open path then don't have the right context, first example in vfs_open() will use the wrong current cred context to perform permission checks, current could have dropped privileges while the cgroup hierarchy is still root owned...
The thing here is that the bpf program will be called from arbitrary paths, not a single pre-defined path/function were we could control the context...
If there are some fundamental reasons that we can't do something like that, let's go back to the simple approach where we just have bpf helpers for freezing and unfreezing cgroups outside of fs interface.
Alright, seems Alexei also agree on this. Thanks will prepare another version.
Also Tejun, could you please point me to extra cgroup or kernfs tests you run? much appreciated!
I'm afraid there isn't much outside what's in the selftest directory.
Ok, thank you!
Thanks.
Hi Djalal.
On Mon, Aug 18, 2025 at 10:04:21AM +0100, Djalal Harouni tixxdz@gmail.com wrote:
This patch series add support to write cgroup interfaces from BPF.
It is useful to freeze a cgroup hierarchy on suspicious activity for a more thorough analysis before killing it. Planned users of this feature are: systemd and BPF tools where the cgroup hierarchy could be a system service, user session, k8s pod or a container.
Could you please give more specific example of the "suspicious activity"? The last time (v1) it was referring to LSM hooks where such asynchronous approach wasn't ideal. Also why couldn't all these tools execute the cgroup actions themselves through traditional userspace API?
One more point (for possible interference with lifecycles) -- what is the relation between cgroup in which the BPF code "runs" and cgroup that's target of the operation? (I hope this isn't supposed to run from BPF without process context.)
Todo:
- Limit size of data to be written.
- Further tests.
- Add cgroup kill support.
I'm missing the retrieval of freeze result in this plan :) cgroup kill would be simpler for PoC (and maybe even sufficient for your use case?).
Regards, Michal
Hi Michal,
On 8/26/25 15:18, Michal Koutný wrote:
Hi Djalal.
On Mon, Aug 18, 2025 at 10:04:21AM +0100, Djalal Harouni tixxdz@gmail.com wrote:
This patch series add support to write cgroup interfaces from BPF.
It is useful to freeze a cgroup hierarchy on suspicious activity for a more thorough analysis before killing it. Planned users of this feature are: systemd and BPF tools where the cgroup hierarchy could be a system service, user session, k8s pod or a container.
Could you please give more specific example of the "suspicious activity"? The last time (v1) it was referring to LSM hooks where such asynchronous approach wasn't ideal.
It solves the case perfectly, you detect something you fail the security hook return -EPERM and optionally freeze the cgroup, snapshot the runtime state.
Oh I thought the attached example is an obvious one, customers want to restrict bpf() usage per cgroup specific container/pod, so when we detect bpf() that's not per allowed cgroup we fail it and freeze it.
Take this and build on top, detect bash/shell exec or any other new dropped binaries, fail and freeze the exec early at linux_bprm object checks.
Also why couldn't all these tools execute the cgroup actions themselves through traditional userspace API?
- Freezing at BPF is obviously better, less race since you don't need access to the corresponding cgroup fs and namespace. Not all tools run as supervisor/container manager. - The bpf_send_signal in some cases is not enough, what if you race with a task clone as an example? however freezing the cgroup hierarchy or the one above is a catch all...
One more point (for possible interference with lifecycles) -- what is the relation between cgroup in which the BPF code "runs" and cgroup that's target of the operation? (I hope this isn't supposed to run from BPF without process context.)
The feature is supposed to be used by sleepable BPF programs, I don't think we need extra checks here?
It could be that this BPF code runs in a process that is under pod-x/container-y/cgroup-z/ and maybe you want to freeze "cgroup-z" or "container-y" and so on... or in case of delegated hierarchies, freezing the parent is a catch all.
Todo:
- Limit size of data to be written.
- Further tests.
- Add cgroup kill support.
I'm missing the retrieval of freeze result in this plan :) cgroup kill
Indeed you are right a small kfunc to read back, yes ;) !
would be simpler for PoC (and maybe even sufficient for your use case?).
I think both are useful cases.
Thank you!
Regards, Michal
On Wed, Aug 27, 2025 at 12:27:08AM +0100, Djalal Harouni tixxdz@gmail.com wrote:
It solves the case perfectly, you detect something you fail the security hook return -EPERM and optionally freeze the cgroup, snapshot the runtime state.
So -EPERM is the right way to cut off such tasks.
Oh I thought the attached example is an obvious one, customers want to restrict bpf() usage per cgroup specific container/pod, so when we detect bpf() that's not per allowed cgroup we fail it and freeze it.
Take this and build on top, detect bash/shell exec or any other new dropped binaries, fail and freeze the exec early at linux_bprm object checks.
Or if you want to do some followup analysis, the process can be killed and coredump'd (at least seccomp allows this, it'd be good to have such a possibility with LSMs if there isn't (I'm not that familiar)). Freezing the groups sounds like a way to DoS the system (not only because of hanging the faulty process itself but possibly spreading via IPC dependencies to unrelated processes).
Also why couldn't all these tools execute the cgroup actions themselves through traditional userspace API?
- Freezing at BPF is obviously better, less race since you don't need access to the corresponding cgroup fs and namespace. Not all tools run as supervisor/container manager.
Less race or more race -- I know the race window size may vary but strictly speaking , there is a race or isn't (depends on having proper synchronization or not). (And when intentionally misbehaving processes are considered even tiny window is potential risk.)
- The bpf_send_signal in some cases is not enough, what if you race with a task clone as an example? however freezing the cgroup hierarchy or the one above is a catch all...
Yeah, this might be part that I don't internalize well. If you're running the hook in particular task's process context, it cannot do clone at the same time. If they are independent tasks, there's no ordering, so there's always possibility of the race (so why not embrace it and do whatever is possible with userspace monitoring audit log or similar and respond based on that).
The feature is supposed to be used by sleepable BPF programs, I don't think we need extra checks here?
Good.
It could be that this BPF code runs in a process that is under pod-x/container-y/cgroup-z/ and maybe you want to freeze "cgroup-z" or "container-y" and so on... or in case of delegated hierarchies, freezing the parent is a catch all.
OK, this would be good. Could it also be pod-x/container-y/cgroup-z2?
---
I acknowledge that sooner or later some kind of access to cgroup through BPF will be added, I'd prefer if it was done in a generic way (so that it doesn't become cgroup's problem but someone else's e.g. VFS's or kernfs's ;-)). I can even imagine some usefulness of helpers for selected specific cgroup (core) operations (which is the direction brought up in the other discussion), I just don't think it solves the problem as you present it.
HTH, Michal
linux-kselftest-mirror@lists.linaro.org