I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go. With some upcoming changes the library is sitting at 5MiB for a parse. Most of that memory is simply copying the BTF blob into user space. By allowing vmlinux BTF to be mmapped read-only into user space I can cut memory usage by about 75%.
Signed-off-by: Lorenz Bauer lmb@isovalent.com --- Changes in v4: - Go back to remap_pfn_range for aarch64 compat - Dropped btf_new_no_copy (Andrii) - Fixed nits in selftests (Andrii) - Clearer error handling in the mmap handler (Andrii) - Fixed build on s390 - Link to v3: https://lore.kernel.org/r/20250505-vmlinux-mmap-v3-0-5d53afa060e8@isovalent....
Changes in v3: - Remove slightly confusing calculation of trailing (Alexei) - Use vm_insert_page (Alexei) - Simplified libbpf code - Link to v2: https://lore.kernel.org/r/20250502-vmlinux-mmap-v2-0-95c271434519@isovalent....
Changes in v2: - Use btf__new in selftest - Avoid vm_iomap_memory in btf_vmlinux_mmap - Add VM_DONTDUMP - Add support to libbpf - Link to v1: https://lore.kernel.org/r/20250501-vmlinux-mmap-v1-0-aa2724572598@isovalent....
--- Lorenz Bauer (3): btf: allow mmap of vmlinux btf selftests: bpf: add a test for mmapable vmlinux BTF libbpf: Use mmap to parse vmlinux BTF from sysfs
include/asm-generic/vmlinux.lds.h | 3 +- kernel/bpf/sysfs_btf.c | 32 ++++++++ tools/lib/bpf/btf.c | 85 ++++++++++++++++++---- tools/testing/selftests/bpf/prog_tests/btf_sysfs.c | 81 +++++++++++++++++++++ 4 files changed, 184 insertions(+), 17 deletions(-) --- base-commit: 7220eabff8cb4af3b93cd021aa853b9f5df2923f change-id: 20250501-vmlinux-mmap-2ec5563c3ef1
Best regards,
User space needs access to kernel BTF for many modern features of BPF. Right now each process needs to read the BTF blob either in pieces or as a whole. Allow mmaping the sysfs file so that processes can directly access the memory allocated for it in the kernel.
remap_pfn_range is used instead of vm_insert_page due to aarch64 compatibility issues.
Signed-off-by: Lorenz Bauer lmb@isovalent.com --- include/asm-generic/vmlinux.lds.h | 3 ++- kernel/bpf/sysfs_btf.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 58a635a6d5bdf0c53c267c2a3d21a5ed8678ce73..1750390735fac7637cc4d2fa05f96cb2a36aa448 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -667,10 +667,11 @@ defined(CONFIG_AUTOFDO_CLANG) || defined(CONFIG_PROPELLER_CLANG) */ #ifdef CONFIG_DEBUG_INFO_BTF #define BTF \ + . = ALIGN(PAGE_SIZE); \ .BTF : AT(ADDR(.BTF) - LOAD_OFFSET) { \ BOUNDED_SECTION_BY(.BTF, _BTF) \ } \ - . = ALIGN(4); \ + . = ALIGN(PAGE_SIZE); \ .BTF_ids : AT(ADDR(.BTF_ids) - LOAD_OFFSET) { \ *(.BTF_ids) \ } diff --git a/kernel/bpf/sysfs_btf.c b/kernel/bpf/sysfs_btf.c index 81d6cf90584a7157929c50f62a5c6862e7a3d081..941d0d2427e3a2d27e8f1cff7b6424d0d41817c1 100644 --- a/kernel/bpf/sysfs_btf.c +++ b/kernel/bpf/sysfs_btf.c @@ -7,14 +7,46 @@ #include <linux/kobject.h> #include <linux/init.h> #include <linux/sysfs.h> +#include <linux/mm.h> +#include <linux/io.h> +#include <linux/btf.h>
/* See scripts/link-vmlinux.sh, gen_btf() func for details */ extern char __start_BTF[]; extern char __stop_BTF[];
+static int btf_sysfs_vmlinux_mmap(struct file *filp, struct kobject *kobj, + const struct bin_attribute *attr, + struct vm_area_struct *vma) +{ + unsigned long pages = PAGE_ALIGN(attr->size) >> PAGE_SHIFT; + size_t vm_size = vma->vm_end - vma->vm_start; + phys_addr_t addr = virt_to_phys(__start_BTF); + unsigned long pfn = addr >> PAGE_SHIFT; + + if (attr->private != __start_BTF || !PAGE_ALIGNED(addr)) + return -EINVAL; + + if (vma->vm_pgoff) + return -EINVAL; + + if (vma->vm_flags & (VM_WRITE | VM_EXEC | VM_MAYSHARE)) + return -EACCES; + + if (pfn + pages < pfn) + return -EINVAL; + + if ((vm_size >> PAGE_SHIFT) > pages) + return -EINVAL; + + vm_flags_mod(vma, VM_DONTDUMP, VM_MAYEXEC | VM_MAYWRITE); + return remap_pfn_range(vma, vma->vm_start, pfn, vm_size, vma->vm_page_prot); +} + static struct bin_attribute bin_attr_btf_vmlinux __ro_after_init = { .attr = { .name = "vmlinux", .mode = 0444, }, .read_new = sysfs_bin_attr_simple_read, + .mmap = btf_sysfs_vmlinux_mmap, };
struct kobject *btf_kobj;
Add a basic test for the ability to mmap /sys/kernel/btf/vmlinux. Ensure that the data is valid BTF and that it is padded with zero.
Signed-off-by: Lorenz Bauer lmb@isovalent.com --- tools/testing/selftests/bpf/prog_tests/btf_sysfs.c | 81 ++++++++++++++++++++++ 1 file changed, 81 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/btf_sysfs.c b/tools/testing/selftests/bpf/prog_tests/btf_sysfs.c new file mode 100644 index 0000000000000000000000000000000000000000..3923e64c4c1d0f1dfeef2a39c7bbab7c9a19f0ca --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/btf_sysfs.c @@ -0,0 +1,81 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +/* Copyright (c) 2025 Isovalent */ + +#include <test_progs.h> +#include <bpf/btf.h> +#include <sys/stat.h> +#include <sys/mman.h> +#include <fcntl.h> +#include <unistd.h> + +static void test_btf_mmap_sysfs(const char *path, struct btf *base) +{ + struct stat st; + __u64 btf_size, end; + void *raw_data = NULL; + int fd = -1; + long page_size; + struct btf *btf = NULL; + + page_size = sysconf(_SC_PAGESIZE); + if (!ASSERT_GE(page_size, 0, "get_page_size")) + goto cleanup; + + if (!ASSERT_OK(stat(path, &st), "stat_btf")) + goto cleanup; + + btf_size = st.st_size; + end = (btf_size + page_size - 1) / page_size * page_size; + + fd = open(path, O_RDONLY); + if (!ASSERT_GE(fd, 0, "open_btf")) + goto cleanup; + + raw_data = mmap(NULL, btf_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); + if (!ASSERT_EQ(raw_data, MAP_FAILED, "mmap_btf_writable")) + goto cleanup; + + raw_data = mmap(NULL, btf_size, PROT_READ, MAP_SHARED, fd, 0); + if (!ASSERT_EQ(raw_data, MAP_FAILED, "mmap_btf_shared")) + goto cleanup; + + raw_data = mmap(NULL, end + 1, PROT_READ, MAP_PRIVATE, fd, 0); + if (!ASSERT_EQ(raw_data, MAP_FAILED, "mmap_btf_invalid_size")) + goto cleanup; + + raw_data = mmap(NULL, end, PROT_READ, MAP_PRIVATE, fd, 0); + if (!ASSERT_OK_PTR(raw_data, "mmap_btf")) + goto cleanup; + + if (!ASSERT_EQ(mprotect(raw_data, btf_size, PROT_READ | PROT_WRITE), -1, + "mprotect_writable")) + goto cleanup; + + if (!ASSERT_EQ(mprotect(raw_data, btf_size, PROT_READ | PROT_EXEC), -1, + "mprotect_executable")) + goto cleanup; + + /* Check padding is zeroed */ + for (int i = btf_size; i < end; i++) { + if (((__u8 *)raw_data)[i] != 0) { + PRINT_FAIL("tail of BTF is not zero at page offset %d\n", i); + goto cleanup; + } + } + + btf = btf__new_split(raw_data, btf_size, base); + if (!ASSERT_OK_PTR(btf, "parse_btf")) + goto cleanup; + +cleanup: + btf__free(btf); + if (raw_data && raw_data != MAP_FAILED) + munmap(raw_data, btf_size); + if (fd >= 0) + close(fd); +} + +void test_btf_sysfs(void) +{ + test_btf_mmap_sysfs("/sys/kernel/btf/vmlinux", NULL); +}
Teach libbpf to use mmap when parsing vmlinux BTF from /sys. We don't apply this to fall-back paths on the regular file system because there is no way to ensure that modifications underlying the MAP_PRIVATE mapping are not visible to the process.
Signed-off-by: Lorenz Bauer lmb@isovalent.com --- tools/lib/bpf/btf.c | 85 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 69 insertions(+), 16 deletions(-)
diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index f18d7e6a453cd9e5c384487659df04f7efafdf5a..42815a29c0a52a1a7eed2c6b22b9b1754ae01c9a 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -12,6 +12,7 @@ #include <sys/utsname.h> #include <sys/param.h> #include <sys/stat.h> +#include <sys/mman.h> #include <linux/kernel.h> #include <linux/err.h> #include <linux/btf.h> @@ -120,6 +121,9 @@ struct btf { /* whether base_btf should be freed in btf_free for this instance */ bool owns_base;
+ /* whether raw_data is a (read-only) mmap */ + bool raw_data_is_mmap; + /* BTF object FD, if loaded into kernel */ int fd;
@@ -951,6 +955,17 @@ static bool btf_is_modifiable(const struct btf *btf) return (void *)btf->hdr != btf->raw_data; }
+static void btf_free_raw_data(struct btf *btf) +{ + if (btf->raw_data_is_mmap) { + munmap(btf->raw_data, btf->raw_size); + btf->raw_data_is_mmap = false; + } else { + free(btf->raw_data); + } + btf->raw_data = NULL; +} + void btf__free(struct btf *btf) { if (IS_ERR_OR_NULL(btf)) @@ -970,7 +985,7 @@ void btf__free(struct btf *btf) free(btf->types_data); strset__free(btf->strs_set); } - free(btf->raw_data); + btf_free_raw_data(btf); free(btf->raw_data_swapped); free(btf->type_offs); if (btf->owns_base) @@ -1030,7 +1045,7 @@ struct btf *btf__new_empty_split(struct btf *base_btf) return libbpf_ptr(btf_new_empty(base_btf)); }
-static struct btf *btf_new(const void *data, __u32 size, struct btf *base_btf) +static struct btf *btf_new(const void *data, __u32 size, struct btf *base_btf, bool is_mmap) { struct btf *btf; int err; @@ -1050,12 +1065,18 @@ static struct btf *btf_new(const void *data, __u32 size, struct btf *base_btf) btf->start_str_off = base_btf->hdr->str_len; }
- btf->raw_data = malloc(size); - if (!btf->raw_data) { - err = -ENOMEM; - goto done; + if (is_mmap) { + btf->raw_data = (void *)data; + btf->raw_data_is_mmap = true; + } else { + btf->raw_data = malloc(size); + if (!btf->raw_data) { + err = -ENOMEM; + goto done; + } + memcpy(btf->raw_data, data, size); } - memcpy(btf->raw_data, data, size); + btf->raw_size = size;
btf->hdr = btf->raw_data; @@ -1083,12 +1104,12 @@ static struct btf *btf_new(const void *data, __u32 size, struct btf *base_btf)
struct btf *btf__new(const void *data, __u32 size) { - return libbpf_ptr(btf_new(data, size, NULL)); + return libbpf_ptr(btf_new(data, size, NULL, false)); }
struct btf *btf__new_split(const void *data, __u32 size, struct btf *base_btf) { - return libbpf_ptr(btf_new(data, size, base_btf)); + return libbpf_ptr(btf_new(data, size, base_btf, false)); }
struct btf_elf_secs { @@ -1209,7 +1230,7 @@ static struct btf *btf_parse_elf(const char *path, struct btf *base_btf,
if (secs.btf_base_data) { dist_base_btf = btf_new(secs.btf_base_data->d_buf, secs.btf_base_data->d_size, - NULL); + NULL, false); if (IS_ERR(dist_base_btf)) { err = PTR_ERR(dist_base_btf); dist_base_btf = NULL; @@ -1218,7 +1239,7 @@ static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, }
btf = btf_new(secs.btf_data->d_buf, secs.btf_data->d_size, - dist_base_btf ?: base_btf); + dist_base_btf ?: base_btf, false); if (IS_ERR(btf)) { err = PTR_ERR(btf); goto done; @@ -1335,7 +1356,7 @@ static struct btf *btf_parse_raw(const char *path, struct btf *base_btf) }
/* finally parse BTF data */ - btf = btf_new(data, sz, base_btf); + btf = btf_new(data, sz, base_btf, false);
err_out: free(data); @@ -1354,6 +1375,36 @@ struct btf *btf__parse_raw_split(const char *path, struct btf *base_btf) return libbpf_ptr(btf_parse_raw(path, base_btf)); }
+static struct btf *btf_parse_raw_mmap(const char *path, struct btf *base_btf) +{ + struct stat st; + void *data; + struct btf *btf; + int fd, err; + + fd = open(path, O_RDONLY); + if (fd < 0) + return libbpf_err_ptr(-errno); + + if (fstat(fd, &st) < 0) { + err = -errno; + close(fd); + return libbpf_err_ptr(err); + } + + data = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0); + close(fd); + + if (data == MAP_FAILED) + return NULL; + + btf = btf_new(data, st.st_size, base_btf, true); + if (IS_ERR(btf)) + munmap(data, st.st_size); + + return btf; +} + static struct btf *btf_parse(const char *path, struct btf *base_btf, struct btf_ext **btf_ext) { struct btf *btf; @@ -1618,7 +1669,7 @@ struct btf *btf_get_from_fd(int btf_fd, struct btf *base_btf) goto exit_free; }
- btf = btf_new(ptr, btf_info.btf_size, base_btf); + btf = btf_new(ptr, btf_info.btf_size, base_btf, false);
exit_free: free(ptr); @@ -1659,8 +1710,7 @@ struct btf *btf__load_from_kernel_by_id(__u32 id) static void btf_invalidate_raw_data(struct btf *btf) { if (btf->raw_data) { - free(btf->raw_data); - btf->raw_data = NULL; + btf_free_raw_data(btf); } if (btf->raw_data_swapped) { free(btf->raw_data_swapped); @@ -5331,7 +5381,10 @@ struct btf *btf__load_vmlinux_btf(void) pr_warn("kernel BTF is missing at '%s', was CONFIG_DEBUG_INFO_BTF enabled?\n", sysfs_btf_path); } else { - btf = btf__parse(sysfs_btf_path, NULL); + btf = btf_parse_raw_mmap(sysfs_btf_path, NULL); + if (IS_ERR_OR_NULL(btf)) + btf = btf__parse(sysfs_btf_path, NULL); + if (!btf) { err = -errno; pr_warn("failed to read kernel BTF from '%s': %s\n",
On Sat, May 10, 2025 at 5:35 AM Lorenz Bauer lmb@isovalent.com wrote:
Teach libbpf to use mmap when parsing vmlinux BTF from /sys. We don't apply this to fall-back paths on the regular file system because there is no way to ensure that modifications underlying the MAP_PRIVATE mapping are not visible to the process.
Signed-off-by: Lorenz Bauer lmb@isovalent.com
tools/lib/bpf/btf.c | 85 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 69 insertions(+), 16 deletions(-)
Almost there, see below. With those changes feel free to add my ack
Acked-by: Andrii Nakryiko andrii@kernel.org
+static struct btf *btf_parse_raw_mmap(const char *path, struct btf *base_btf) +{
struct stat st;
void *data;
struct btf *btf;
int fd, err;
fd = open(path, O_RDONLY);
if (fd < 0)
return libbpf_err_ptr(-errno);
if (fstat(fd, &st) < 0) {
err = -errno;
close(fd);
return libbpf_err_ptr(err);
}
data = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
err = -errno;
close(fd);
if (data == MAP_FAILED)
return NULL;
s/return NULL;/libbpf_err_ptr(err);/
btf = btf_new(data, st.st_size, base_btf, true);
if (IS_ERR(btf))
munmap(data, st.st_size);
return btf;
+}
static struct btf *btf_parse(const char *path, struct btf *base_btf, struct btf_ext **btf_ext) { struct btf *btf; @@ -1618,7 +1669,7 @@ struct btf *btf_get_from_fd(int btf_fd, struct btf *base_btf) goto exit_free; }
btf = btf_new(ptr, btf_info.btf_size, base_btf);
btf = btf_new(ptr, btf_info.btf_size, base_btf, false);
exit_free: free(ptr); @@ -1659,8 +1710,7 @@ struct btf *btf__load_from_kernel_by_id(__u32 id) static void btf_invalidate_raw_data(struct btf *btf) { if (btf->raw_data) {
free(btf->raw_data);
btf->raw_data = NULL;
btf_free_raw_data(btf); }
drop now unnecessary {} ?
if (btf->raw_data_swapped) { free(btf->raw_data_swapped);
@@ -5331,7 +5381,10 @@ struct btf *btf__load_vmlinux_btf(void) pr_warn("kernel BTF is missing at '%s', was CONFIG_DEBUG_INFO_BTF enabled?\n", sysfs_btf_path); } else {
btf = btf__parse(sysfs_btf_path, NULL);
btf = btf_parse_raw_mmap(sysfs_btf_path, NULL);
if (IS_ERR_OR_NULL(btf))
just IS_ERR() with the fixes I pointed out above
btf = btf__parse(sysfs_btf_path, NULL);
if (!btf) { err = -errno; pr_warn("failed to read kernel BTF from '%s': %s\n",
-- 2.49.0
I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go. With some upcoming changes the library is sitting at 5MiB for a parse. Most of that memory is simply copying the BTF blob into user space. By allowing vmlinux BTF to be mmapped read-only into user space I can cut memory usage by about 75%.
Signed-off-by: Lorenz Bauer lmb@isovalent.com
For the series,
Tested-by: Alan Maguire alan.maguire@oracle.com
Tested with 4k and 64k page size on aarch64; all worked perfectly. Thanks!
On Thu, May 15, 2025 at 08:51:45AM +0100, Alan Maguire wrote:
I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go. With some upcoming changes the library is sitting at 5MiB for a parse. Most of that memory is simply copying the BTF blob into user space. By allowing vmlinux BTF to be mmapped read-only into user space I can cut memory usage by about 75%.
Signed-off-by: Lorenz Bauer lmb@isovalent.com
For the series,
Tested-by: Alan Maguire alan.maguire@oracle.com
Tested with 4k and 64k page size on aarch64; all worked perfectly. Thanks!
Hi Alan,
Thanks for taking a look at this. I've been following your related effort to allow /sys/kernel/btf/vmlinux as a module in support of small systems with kernel-size constraints, and wondered how this series might affect that work? Such support would be well-received in the embedded space when it happens, so am keen to understand.
Thanks, Tony
Hi Alan,
Thanks for taking a look at this. I've been following your related effort to allow /sys/kernel/btf/vmlinux as a module in support of small systems with kernel-size constraints, and wondered how this series might affect that work? Such support would be well-received in the embedded space when it happens, so am keen to understand.
Thanks, Tony
hi Tony
I had something nearly working a few months back but there are a bunch of complications that made it a bit trickier than I'd first anticipated. One challenge for example is that we want /sys/kernel/btf to behave just as it would if vmlinux BTF was not a module. My original hope was to just have the vmlinux BTF module forceload early, but the request module approach won't work since the vmlinux_btf.ko module would have to be part of the initrd image. A question for you on this - I presume that's what you want to avoid, right? So I'm assuming that we need to extract the .BTF section out of the vmlinu[xz] binary and out of initrd into a later-loading vmlinux_btf.ko module for small-footprint systems. Is that correct?
The reason I ask is having a later-loading vmlinux_btf.ko is a bit of a pain since we need to walk the set of kernel modules and load their BTF, relocate it and do kfunc registration. If we can simplify things via a shared module dependency on vmlinux_btf.ko that would be great, but I'd like to better understand the constraints from the small system perspective first. Thanks!
Alan
On Wed, May 21, 2025 at 8:00 AM Alan Maguire alan.maguire@oracle.com wrote:
Hi Alan,
Thanks for taking a look at this. I've been following your related effort to allow /sys/kernel/btf/vmlinux as a module in support of small systems with kernel-size constraints, and wondered how this series might affect that work? Such support would be well-received in the embedded space when it happens, so am keen to understand.
Thanks, Tony
hi Tony
I had something nearly working a few months back but there are a bunch of complications that made it a bit trickier than I'd first anticipated. One challenge for example is that we want /sys/kernel/btf to behave just as it would if vmlinux BTF was not a module. My original hope was to just have the vmlinux BTF module forceload early, but the request module approach won't work since the vmlinux_btf.ko module would have to be part of the initrd image. A question for you on this - I presume that's what you want to avoid, right? So I'm assuming that we need to extract the .BTF section out of the vmlinu[xz] binary and out of initrd into a later-loading vmlinux_btf.ko module for small-footprint systems. Is that correct?
The reason I ask is having a later-loading vmlinux_btf.ko is a bit of a pain since we need to walk the set of kernel modules and load their BTF, relocate it and do kfunc registration. If we can simplify things via a shared module dependency on vmlinux_btf.ko that would be great, but I'd like to better understand the constraints from the small system perspective first. Thanks!
We cannot require other modules to depend on vmlinux_btf.ko. Some of them might load during the boot. So adding to the dependency will defeat the point of vmlinux_btf.ko. The only option I see is to let modules load and ignore their BTFs and vmlinux BTF is not present. Later vmlinux_btf.ko can be loaded and modules loaded after that time will succeed in loading their BTFs too. So some modules will have their BTF and some don't. I don't think it's an issue.
If an admin loads a module with kfuncs and vmlixnu_btf.ko is not loaded yet the kfunc registration will fail, of course. It's an issue, but I don't think we need to fix it right now by messing with depmod.
The bigger issue is how to split vmlinux_btf.ko itself. The kernel has a bunch of kfuncs and they need BTF ids for protos and for all types they reference, so vmlinux BTF cannot be empty. minimize_btf() can probably help. So before we proceed with vmlinux_btf.ko we need to see the data how big the mandatory part of vmlinux BTF will be vs the rest of BTF in vmlinux_btf.ko.
On Thu, May 22, 2025 at 6:04 PM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Wed, May 21, 2025 at 8:00 AM Alan Maguire alan.maguire@oracle.com wrote:
Hi Alan,
Thanks for taking a look at this. I've been following your related effort to allow /sys/kernel/btf/vmlinux as a module in support of small systems with kernel-size constraints, and wondered how this series might affect that work? Such support would be well-received in the embedded space when it happens, so am keen to understand.
Thanks, Tony
hi Tony
I had something nearly working a few months back but there are a bunch of complications that made it a bit trickier than I'd first anticipated. One challenge for example is that we want /sys/kernel/btf to behave just as it would if vmlinux BTF was not a module. My original hope was to just have the vmlinux BTF module forceload early, but the request module approach won't work since the vmlinux_btf.ko module would have to be part of the initrd image. A question for you on this - I presume that's what you want to avoid, right? So I'm assuming that we need to extract the .BTF section out of the vmlinu[xz] binary and out of initrd into a later-loading vmlinux_btf.ko module for small-footprint systems. Is that correct?
The reason I ask is having a later-loading vmlinux_btf.ko is a bit of a pain since we need to walk the set of kernel modules and load their BTF, relocate it and do kfunc registration. If we can simplify things via a shared module dependency on vmlinux_btf.ko that would be great, but I'd like to better understand the constraints from the small system perspective first. Thanks!
We cannot require other modules to depend on vmlinux_btf.ko. Some of them might load during the boot. So adding to the dependency will defeat the point of vmlinux_btf.ko. The only option I see is to let modules load and ignore their BTFs and vmlinux BTF is not present. Later vmlinux_btf.ko can be loaded and modules loaded after that time will succeed in loading their BTFs too. So some modules will have their BTF and some don't. I don't think it's an issue.
If an admin loads a module with kfuncs and vmlixnu_btf.ko is not loaded yet the kfunc registration will fail, of course. It's an issue, but I don't think we need to fix it right now by messing with depmod.
The bigger issue is how to split vmlinux_btf.ko itself. The kernel has a bunch of kfuncs and they need BTF ids for protos and for all types they reference, so vmlinux BTF cannot be empty. minimize_btf() can probably help. So before we proceed with vmlinux_btf.ko we need to see the data how big the mandatory part of vmlinux BTF will be vs the rest of BTF in vmlinux_btf.ko.
I think there is a way to avoid all these problems by switching kfunc registration to a lazy validation model. I'll explain what I mean.
1) vmlinux_btf.ko isn't loaded by default, but kernel is aware that there is vmlinux BTF available, if necessary. 2) when user-space tries to access /sys/kernel/btf/vmlinux, we automatically try to load vmlinux_btf.ko; similarly, if kernel internally needs vmlinux BTF information, we provided that transparently through automatic loading of vmlinux_btf.ko 3a) if kernel module is loaded and it needs to register kfuncs, we allow that, but instead of eagerly validating kfunc's associated BTF information for correctness, we just record the fact that there is a kfunc registered, with name ABC and associated BTF ID XYZ. 3b) when user tries to verify BPF program that needs to use kfunc ABC from that module, that's the time when we load vmlinux_btf.ko and validate kfunc's BTF information for correctness. If that information is broken, report error, maybe log dmesg. If not, we are golden (and that's the expected outcome) and we proceed with verification just like today.
The key observation here is that with BTF there is no direct pointer involved. It's all just stable integer IDs, so it doesn't really matter whether we have instantiated BTF information at the kernel module loading time or not. We can always (later) access this data through BTF ID.
The biggest change is handling of kernel modules with broken kfuncs. Right now we'll reject the load, because registration will fail. In the new lazy model, this will be delayed until the very first use of that kfunc. And if no one ever use that kfunc, it, technically, doesn't matter. It's basically the same approach as with BPF CO-RE and dead code elimination in verifier: if there is unknown/unsupported code, but it's guaranteed to never execute, it's OK from the verifier's POV.
I think that's an acceptable tradeoff, because really it's not an expected typical situation to have such a broken module. On the other hand, we don't need to complicate and extend BTF itself to accommodate this, it all will works as is and will keep working in the future.
P.S. And of course all this can/should be cached, so we don't redo all this validation, but that's just an optimization.
linux-kselftest-mirror@lists.linaro.org