This patch set introduces the BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps, as the requirement of BPF_F_ALL_CPUS flag for percpu_array maps was discussed in the thread of "[PATCH bpf-next v3 0/4] bpf: Introduce global percpu data"[1].
The goal of BPF_F_ALL_CPUS flag is to reduce data caching overhead in light skeletons by allowing a single value to be reused to update values across all CPUs. This avoids the M:N problem where M cached values are used to update a map on N CPUs kernel.
The BPF_F_CPU flag is accompanied by *flags*-embedded cpu info, which specifies the target CPU for the operation:
* For lookup operations: the flag field alongside cpu info enable querying a value on the specified CPU. * For update operations: the flag field alongside cpu info enable updating value for specified CPU.
Links: [1] https://lore.kernel.org/bpf/20250526162146.24429-1-leon.hwang@linux.dev/
Changes: v10 -> v11: * Support the combination of BPF_EXIST and BPF_F_CPU/BPF_F_ALL_CPUS for update operations. * Fix unstable lru_percpu_hash map test using the combination of BPF_EXIST and BPF_F_CPU/BPF_F_ALL_CPUS to avoid LRU eviction (reported by Alexei).
v9 -> v10: * Add tests to verify array and hash maps do not support BPF_F_CPU and BPF_F_ALL_CPUS flags. * Address comment from Andrii: * Copy map value using copy_map_value_long for percpu_cgroup_storage maps in a separate patch.
v8 -> v9: * Change value type from u64 to u32 in selftests. * Address comments from Andrii: * Keep value_size unaligned and update everywhere for consistency when cpu flags are specified. * Update value by getting pointer for percpu hash and percpu cgroup_storage maps.
v7 -> v8: * Address comments from Andrii: * Check BPF_F_LOCK when update percpu_array, percpu_hash and lru_percpu_hash maps. * Refactor flags check in __htab_map_lookup_and_delete_batch(). * Keep value_size unaligned and copy value using copy_map_value() in __htab_map_lookup_and_delete_batch() when BPF_F_CPU is specified. * Update warn message in libbpf's validate_map_op(). * Update comment of libbpf's bpf_map__lookup_elem().
v6 -> v7: * Get correct value size for percpu_hash and lru_percpu_hash in update_batch API. * Set 'count' as 'max_entries' in test cases for lookup_batch API. * Address comment from Alexei: * Move cpu flags check into bpf_map_check_op_flags().
v5 -> v6: * Move bpf_map_check_op_flags() from 'bpf.h' to 'syscall.c'. * Address comments from Alexei: * Drop the refactoring code of data copying logic for percpu maps. * Drop bpf_map_check_op_flags() wrappers.
v4 -> v5: * Address comments from Andrii: * Refactor data copying logic for all percpu maps. * Drop this_cpu_ptr() micro-optimization. * Drop cpu check in libbpf's validate_map_op(). * Enhance bpf_map_check_op_flags() using *allowed flags* instead of 'extra_flags_mask'.
v3 -> v4: * Address comments from Andrii: * Remove unnecessary map_type check in bpf_map_value_size(). * Reduce code churn. * Remove unnecessary do_delete check in __htab_map_lookup_and_delete_batch(). * Introduce bpf_percpu_copy_to_user() and bpf_percpu_copy_from_user(). * Rename check_map_flags() to bpf_map_check_op_flags() with extra_flags_mask. * Add human-readable pr_warn() explanations in validate_map_op(). * Use flags in bpf_map__delete_elem() and bpf_map__lookup_and_delete_elem(). * Drop "for alignment reasons". v3 link: https://lore.kernel.org/bpf/20250821160817.70285-1-leon.hwang@linux.dev/
v2 -> v3: * Address comments from Alexei: * Use BPF_F_ALL_CPUS instead of BPF_ALL_CPUS magic. * Introduce these two cpu flags for all percpu maps. * Address comments from Jiri: * Reduce some unnecessary u32 cast. * Refactor more generic map flags check function. * A code style issue. v2 link: https://lore.kernel.org/bpf/20250805163017.17015-1-leon.hwang@linux.dev/
v1 -> v2: * Address comments from Andrii: * Embed cpu info as high 32 bits of *flags* totally. * Use ERANGE instead of E2BIG. * Few format issues.
Leon Hwang (8): bpf: Introduce internal bpf_map_check_op_flags helper function bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array maps bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_hash and lru_percpu_hash maps bpf: Copy map value using copy_map_value_long for percpu_cgroup_storage maps bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_cgroup_storage maps libbpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu maps selftests/bpf: Add cases to test BPF_F_CPU and BPF_F_ALL_CPUS flags
include/linux/bpf-cgroup.h | 4 +- include/linux/bpf.h | 44 ++- include/uapi/linux/bpf.h | 2 + kernel/bpf/arraymap.c | 32 +- kernel/bpf/hashtab.c | 96 +++-- kernel/bpf/local_storage.c | 27 +- kernel/bpf/syscall.c | 68 ++-- tools/include/uapi/linux/bpf.h | 2 + tools/lib/bpf/bpf.h | 8 + tools/lib/bpf/libbpf.c | 26 +- tools/lib/bpf/libbpf.h | 21 +- .../selftests/bpf/prog_tests/percpu_alloc.c | 335 ++++++++++++++++++ .../selftests/bpf/progs/percpu_alloc_array.c | 32 ++ 13 files changed, 590 insertions(+), 107 deletions(-)
-- 2.51.2
It is to unify map flags checking for lookup_elem, update_elem, lookup_batch and update_batch APIs.
Therefore, it will be convenient to check BPF_F_CPU and BPF_F_ALL_CPUS flags in it for these APIs in next patch.
Acked-by: Andrii Nakryiko andrii@kernel.org Signed-off-by: Leon Hwang leon.hwang@linux.dev --- include/linux/bpf.h | 11 +++++++++++ kernel/bpf/syscall.c | 34 +++++++++++----------------------- 2 files changed, 22 insertions(+), 23 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index a9b788c7b4aa..6498be4c44f8 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -3829,4 +3829,15 @@ bpf_prog_update_insn_ptrs(struct bpf_prog *prog, u32 *offsets, void *image) } #endif
+static inline int bpf_map_check_op_flags(struct bpf_map *map, u64 flags, u64 allowed_flags) +{ + if (flags & ~allowed_flags) + return -EINVAL; + + if ((flags & BPF_F_LOCK) && !btf_record_has_field(map->record, BPF_SPIN_LOCK)) + return -EINVAL; + + return 0; +} + #endif /* _LINUX_BPF_H */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 792623a7c90b..cef8963d69f9 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1725,9 +1725,6 @@ static int map_lookup_elem(union bpf_attr *attr) if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM)) return -EINVAL;
- if (attr->flags & ~BPF_F_LOCK) - return -EINVAL; - CLASS(fd, f)(attr->map_fd); map = __bpf_map_get(f); if (IS_ERR(map)) @@ -1735,9 +1732,9 @@ static int map_lookup_elem(union bpf_attr *attr) if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) return -EPERM;
- if ((attr->flags & BPF_F_LOCK) && - !btf_record_has_field(map->record, BPF_SPIN_LOCK)) - return -EINVAL; + err = bpf_map_check_op_flags(map, attr->flags, BPF_F_LOCK); + if (err) + return err;
key = __bpf_copy_key(ukey, map->key_size); if (IS_ERR(key)) @@ -1800,11 +1797,9 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr) goto err_put; }
- if ((attr->flags & BPF_F_LOCK) && - !btf_record_has_field(map->record, BPF_SPIN_LOCK)) { - err = -EINVAL; + err = bpf_map_check_op_flags(map, attr->flags, ~0); + if (err) goto err_put; - }
key = ___bpf_copy_key(ukey, map->key_size); if (IS_ERR(key)) { @@ -2008,13 +2003,9 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file, void *key, *value; int err = 0;
- if (attr->batch.elem_flags & ~BPF_F_LOCK) - return -EINVAL; - - if ((attr->batch.elem_flags & BPF_F_LOCK) && - !btf_record_has_field(map->record, BPF_SPIN_LOCK)) { - return -EINVAL; - } + err = bpf_map_check_op_flags(map, attr->batch.elem_flags, BPF_F_LOCK); + if (err) + return err;
value_size = bpf_map_value_size(map);
@@ -2071,12 +2062,9 @@ int generic_map_lookup_batch(struct bpf_map *map, u32 value_size, cp, max_count; int err;
- if (attr->batch.elem_flags & ~BPF_F_LOCK) - return -EINVAL; - - if ((attr->batch.elem_flags & BPF_F_LOCK) && - !btf_record_has_field(map->record, BPF_SPIN_LOCK)) - return -EINVAL; + err = bpf_map_check_op_flags(map, attr->batch.elem_flags, BPF_F_LOCK); + if (err) + return err;
value_size = bpf_map_value_size(map);
Add libbpf support for the BPF_F_CPU flag for percpu maps by embedding the cpu info into the high 32 bits of:
1. **flags**: bpf_map_lookup_elem_flags(), bpf_map__lookup_elem(), bpf_map_update_elem() and bpf_map__update_elem() 2. **opts->elem_flags**: bpf_map_lookup_batch() and bpf_map_update_batch()
And the flag can be BPF_F_ALL_CPUS, but cannot be 'BPF_F_CPU | BPF_F_ALL_CPUS'.
Behavior:
* If the flag is BPF_F_ALL_CPUS, the update is applied across all CPUs. * If the flag is BPF_F_CPU, it updates value only to the specified CPU. * If the flag is BPF_F_CPU, lookup value only from the specified CPU. * lookup does not support BPF_F_ALL_CPUS.
Acked-by: Andrii Nakryiko andrii@kernel.org Signed-off-by: Leon Hwang leon.hwang@linux.dev --- tools/lib/bpf/bpf.h | 8 ++++++++ tools/lib/bpf/libbpf.c | 26 ++++++++++++++++++++------ tools/lib/bpf/libbpf.h | 21 ++++++++------------- 3 files changed, 36 insertions(+), 19 deletions(-)
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index e983a3e40d61..ffd93feffd71 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -289,6 +289,14 @@ LIBBPF_API int bpf_map_lookup_and_delete_batch(int fd, void *in_batch, * Update spin_lock-ed map elements. This must be * specified if the map value contains a spinlock. * + * **BPF_F_CPU** + * As for percpu maps, update value on the specified CPU. And the cpu + * info is embedded into the high 32 bits of **opts->elem_flags**. + * + * **BPF_F_ALL_CPUS** + * As for percpu maps, update value across all CPUs. This flag cannot + * be used with BPF_F_CPU at the same time. + * * @param fd BPF map file descriptor * @param keys pointer to an array of *count* keys * @param values pointer to an array of *count* values diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 706e7481bdf6..65b9b5e95544 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -10913,7 +10913,7 @@ bpf_object__find_map_fd_by_name(const struct bpf_object *obj, const char *name) }
static int validate_map_op(const struct bpf_map *map, size_t key_sz, - size_t value_sz, bool check_value_sz) + size_t value_sz, bool check_value_sz, __u64 flags) { if (!map_is_created(map)) /* map is not yet created */ return -ENOENT; @@ -10940,6 +10940,20 @@ static int validate_map_op(const struct bpf_map *map, size_t key_sz, int num_cpu = libbpf_num_possible_cpus(); size_t elem_sz = roundup(map->def.value_size, 8);
+ if (flags & (BPF_F_CPU | BPF_F_ALL_CPUS)) { + if ((flags & BPF_F_CPU) && (flags & BPF_F_ALL_CPUS)) { + pr_warn("map '%s': BPF_F_CPU and BPF_F_ALL_CPUS are mutually exclusive\n", + map->name); + return -EINVAL; + } + if (map->def.value_size != value_sz) { + pr_warn("map '%s': unexpected value size %zu provided for either BPF_F_CPU or BPF_F_ALL_CPUS, expected %u\n", + map->name, value_sz, map->def.value_size); + return -EINVAL; + } + break; + } + if (value_sz != num_cpu * elem_sz) { pr_warn("map '%s': unexpected value size %zu provided for per-CPU map, expected %d * %zu = %zd\n", map->name, value_sz, num_cpu, elem_sz, num_cpu * elem_sz); @@ -10964,7 +10978,7 @@ int bpf_map__lookup_elem(const struct bpf_map *map, { int err;
- err = validate_map_op(map, key_sz, value_sz, true); + err = validate_map_op(map, key_sz, value_sz, true, flags); if (err) return libbpf_err(err);
@@ -10977,7 +10991,7 @@ int bpf_map__update_elem(const struct bpf_map *map, { int err;
- err = validate_map_op(map, key_sz, value_sz, true); + err = validate_map_op(map, key_sz, value_sz, true, flags); if (err) return libbpf_err(err);
@@ -10989,7 +11003,7 @@ int bpf_map__delete_elem(const struct bpf_map *map, { int err;
- err = validate_map_op(map, key_sz, 0, false /* check_value_sz */); + err = validate_map_op(map, key_sz, 0, false /* check_value_sz */, flags); if (err) return libbpf_err(err);
@@ -11002,7 +11016,7 @@ int bpf_map__lookup_and_delete_elem(const struct bpf_map *map, { int err;
- err = validate_map_op(map, key_sz, value_sz, true); + err = validate_map_op(map, key_sz, value_sz, true, flags); if (err) return libbpf_err(err);
@@ -11014,7 +11028,7 @@ int bpf_map__get_next_key(const struct bpf_map *map, { int err;
- err = validate_map_op(map, key_sz, 0, false /* check_value_sz */); + err = validate_map_op(map, key_sz, 0, false /* check_value_sz */, 0); if (err) return libbpf_err(err);
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index 5118d0a90e24..7c38b2e54608 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -1196,12 +1196,13 @@ LIBBPF_API struct bpf_map *bpf_map__inner_map(struct bpf_map *map); * @param key_sz size in bytes of key data, needs to match BPF map definition's **key_size** * @param value pointer to memory in which looked up value will be stored * @param value_sz size in byte of value data memory; it has to match BPF map - * definition's **value_size**. For per-CPU BPF maps value size has to be - * a product of BPF map value size and number of possible CPUs in the system - * (could be fetched with **libbpf_num_possible_cpus()**). Note also that for - * per-CPU values value size has to be aligned up to closest 8 bytes for - * alignment reasons, so expected size is: `round_up(value_size, 8) - * * libbpf_num_possible_cpus()`. + * definition's **value_size**. For per-CPU BPF maps, value size can be + * `value_size` if either **BPF_F_CPU** or **BPF_F_ALL_CPUS** is specified + * in **flags**, otherwise a product of BPF map value size and number of + * possible CPUs in the system (could be fetched with + * **libbpf_num_possible_cpus()**). Note also that for per-CPU values value + * size has to be aligned up to closest 8 bytes, so expected size is: + * `round_up(value_size, 8) * libbpf_num_possible_cpus()`. * @flags extra flags passed to kernel for this operation * @return 0, on success; negative error, otherwise * @@ -1219,13 +1220,7 @@ LIBBPF_API int bpf_map__lookup_elem(const struct bpf_map *map, * @param key pointer to memory containing bytes of the key * @param key_sz size in bytes of key data, needs to match BPF map definition's **key_size** * @param value pointer to memory containing bytes of the value - * @param value_sz size in byte of value data memory; it has to match BPF map - * definition's **value_size**. For per-CPU BPF maps value size has to be - * a product of BPF map value size and number of possible CPUs in the system - * (could be fetched with **libbpf_num_possible_cpus()**). Note also that for - * per-CPU values value size has to be aligned up to closest 8 bytes for - * alignment reasons, so expected size is: `round_up(value_size, 8) - * * libbpf_num_possible_cpus()`. + * @param value_sz refer to **bpf_map__lookup_elem**'s description.' * @flags extra flags passed to kernel for this operation * @return 0, on success; negative error, otherwise *
Introduce BPF_F_ALL_CPUS flag support for percpu_hash and lru_percpu_hash maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs.
Introduce BPF_F_CPU flag support for percpu_hash and lru_percpu_hash maps to allow:
* update value for specified CPU for both update_elem and update_batch APIs. * lookup value for specified CPU for both lookup_elem and lookup_batch APIs.
The BPF_F_CPU flag is passed via:
* map_flags along with embedded cpu info. * elem_flags along with embedded cpu info.
Signed-off-by: Leon Hwang leon.hwang@linux.dev --- v10 -> v11: - Drop buggy '(u32)map_flags > BPF_F_ALL_CPUS' check in htab_map_check_update_flags(). - Update 'map_flags != BPF_EXIST' to '!(map_flags & BPF_EXIST)' in __htab_lru_percpu_map_update_elem(). --- include/linux/bpf.h | 4 +- kernel/bpf/hashtab.c | 96 ++++++++++++++++++++++++++++++-------------- kernel/bpf/syscall.c | 2 +- 3 files changed, 69 insertions(+), 33 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 01a99e3a3e51..f79d2ae27335 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2761,7 +2761,7 @@ int map_set_for_each_callback_args(struct bpf_verifier_env *env, struct bpf_func_state *caller, struct bpf_func_state *callee);
-int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value); +int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value, u64 flags); int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value, u64 flags); int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value, u64 flags); @@ -3833,6 +3833,8 @@ static inline bool bpf_map_supports_cpu_flags(enum bpf_map_type map_type) { switch (map_type) { case BPF_MAP_TYPE_PERCPU_ARRAY: + case BPF_MAP_TYPE_PERCPU_HASH: + case BPF_MAP_TYPE_LRU_PERCPU_HASH: return true; default: return false; diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index c8a9b27f8663..c768bf71d60f 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -932,7 +932,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) }
static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr, - void *value, bool onallcpus) + void *value, bool onallcpus, u64 map_flags) { void *ptr;
@@ -943,19 +943,28 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr, bpf_obj_free_fields(htab->map.record, ptr); } else { u32 size = round_up(htab->map.value_size, 8); - int off = 0, cpu; + void *val; + int cpu; + + if (map_flags & BPF_F_CPU) { + cpu = map_flags >> 32; + ptr = per_cpu_ptr(pptr, cpu); + copy_map_value(&htab->map, ptr, value); + bpf_obj_free_fields(htab->map.record, ptr); + return; + }
for_each_possible_cpu(cpu) { ptr = per_cpu_ptr(pptr, cpu); - copy_map_value_long(&htab->map, ptr, value + off); + val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu; + copy_map_value(&htab->map, ptr, val); bpf_obj_free_fields(htab->map.record, ptr); - off += size; } } }
static void pcpu_init_value(struct bpf_htab *htab, void __percpu *pptr, - void *value, bool onallcpus) + void *value, bool onallcpus, u64 map_flags) { /* When not setting the initial value on all cpus, zero-fill element * values for other cpus. Otherwise, bpf program has no way to ensure @@ -973,7 +982,7 @@ static void pcpu_init_value(struct bpf_htab *htab, void __percpu *pptr, zero_map_value(&htab->map, per_cpu_ptr(pptr, cpu)); } } else { - pcpu_copy_value(htab, pptr, value, onallcpus); + pcpu_copy_value(htab, pptr, value, onallcpus, map_flags); } }
@@ -985,7 +994,7 @@ static bool fd_htab_map_needs_adjust(const struct bpf_htab *htab) static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, void *value, u32 key_size, u32 hash, bool percpu, bool onallcpus, - struct htab_elem *old_elem) + struct htab_elem *old_elem, u64 map_flags) { u32 size = htab->map.value_size; bool prealloc = htab_is_prealloc(htab); @@ -1043,7 +1052,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, pptr = *(void __percpu **)ptr; }
- pcpu_init_value(htab, pptr, value, onallcpus); + pcpu_init_value(htab, pptr, value, onallcpus, map_flags);
if (!prealloc) htab_elem_set_ptr(l_new, key_size, pptr); @@ -1147,7 +1156,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value, }
l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false, - l_old); + l_old, map_flags); if (IS_ERR(l_new)) { /* all pre-allocated elements are in use or memory exhausted */ ret = PTR_ERR(l_new); @@ -1249,6 +1258,15 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value return ret; }
+static int htab_map_check_update_flags(bool onallcpus, u64 map_flags) +{ + if (unlikely(!onallcpus && map_flags > BPF_EXIST)) + return -EINVAL; + if (unlikely(onallcpus && (map_flags & BPF_F_LOCK))) + return -EINVAL; + return 0; +} + static long htab_map_update_elem_in_place(struct bpf_map *map, void *key, void *value, u64 map_flags, bool percpu, bool onallcpus) @@ -1262,9 +1280,9 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key, u32 key_size, hash; int ret;
- if (unlikely(map_flags > BPF_EXIST)) - /* unknown flags */ - return -EINVAL; + ret = htab_map_check_update_flags(onallcpus, map_flags); + if (unlikely(ret)) + return ret;
WARN_ON_ONCE(!bpf_rcu_lock_held());
@@ -1289,7 +1307,7 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key, /* Update value in-place */ if (percpu) { pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size), - value, onallcpus); + value, onallcpus, map_flags); } else { void **inner_map_pptr = htab_elem_value(l_old, key_size);
@@ -1298,7 +1316,7 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key, } } else { l_new = alloc_htab_elem(htab, key, value, key_size, - hash, percpu, onallcpus, NULL); + hash, percpu, onallcpus, NULL, map_flags); if (IS_ERR(l_new)) { ret = PTR_ERR(l_new); goto err; @@ -1324,9 +1342,9 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, u32 key_size, hash; int ret;
- if (unlikely(map_flags > BPF_EXIST)) - /* unknown flags */ - return -EINVAL; + ret = htab_map_check_update_flags(onallcpus, map_flags); + if (unlikely(ret)) + return ret;
WARN_ON_ONCE(!bpf_rcu_lock_held());
@@ -1342,7 +1360,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, * to remove older elem from htab and this removal * operation will need a bucket lock. */ - if (map_flags != BPF_EXIST) { + if (!(map_flags & BPF_EXIST)) { l_new = prealloc_lru_pop(htab, key, hash); if (!l_new) return -ENOMEM; @@ -1363,10 +1381,10 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
/* per-cpu hash map can update value in-place */ pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size), - value, onallcpus); + value, onallcpus, map_flags); } else { pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size), - value, onallcpus); + value, onallcpus, map_flags); hlist_nulls_add_head_rcu(&l_new->hash_node, head); l_new = NULL; } @@ -1678,9 +1696,9 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, void __user *ukeys = u64_to_user_ptr(attr->batch.keys); void __user *ubatch = u64_to_user_ptr(attr->batch.in_batch); u32 batch, max_count, size, bucket_size, map_id; + u64 elem_map_flags, map_flags, allowed_flags; u32 bucket_cnt, total, key_size, value_size; struct htab_elem *node_to_free = NULL; - u64 elem_map_flags, map_flags; struct hlist_nulls_head *head; struct hlist_nulls_node *n; unsigned long flags = 0; @@ -1690,9 +1708,12 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, int ret = 0;
elem_map_flags = attr->batch.elem_flags; - if ((elem_map_flags & ~BPF_F_LOCK) || - ((elem_map_flags & BPF_F_LOCK) && !btf_record_has_field(map->record, BPF_SPIN_LOCK))) - return -EINVAL; + allowed_flags = BPF_F_LOCK; + if (!do_delete && is_percpu) + allowed_flags |= BPF_F_CPU; + ret = bpf_map_check_op_flags(map, elem_map_flags, allowed_flags); + if (ret) + return ret;
map_flags = attr->batch.flags; if (map_flags) @@ -1715,7 +1736,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, key_size = htab->map.key_size; value_size = htab->map.value_size; size = round_up(value_size, 8); - if (is_percpu) + if (is_percpu && !(elem_map_flags & BPF_F_CPU)) value_size = size * num_possible_cpus(); total = 0; /* while experimenting with hash tables with sizes ranging from 10 to @@ -1798,10 +1819,17 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, void __percpu *pptr;
pptr = htab_elem_get_ptr(l, map->key_size); - for_each_possible_cpu(cpu) { - copy_map_value_long(&htab->map, dst_val + off, per_cpu_ptr(pptr, cpu)); - check_and_init_map_value(&htab->map, dst_val + off); - off += size; + if (elem_map_flags & BPF_F_CPU) { + cpu = elem_map_flags >> 32; + copy_map_value(&htab->map, dst_val, per_cpu_ptr(pptr, cpu)); + check_and_init_map_value(&htab->map, dst_val); + } else { + for_each_possible_cpu(cpu) { + copy_map_value_long(&htab->map, dst_val + off, + per_cpu_ptr(pptr, cpu)); + check_and_init_map_value(&htab->map, dst_val + off); + off += size; + } } } else { value = htab_elem_value(l, key_size); @@ -2357,7 +2385,7 @@ static void *htab_lru_percpu_map_lookup_percpu_elem(struct bpf_map *map, void *k return NULL; }
-int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value) +int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value, u64 map_flags) { struct htab_elem *l; void __percpu *pptr; @@ -2374,16 +2402,22 @@ int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value) l = __htab_map_lookup_elem(map, key); if (!l) goto out; + ret = 0; /* We do not mark LRU map element here in order to not mess up * eviction heuristics when user space does a map walk. */ pptr = htab_elem_get_ptr(l, map->key_size); + if (map_flags & BPF_F_CPU) { + cpu = map_flags >> 32; + copy_map_value(map, value, per_cpu_ptr(pptr, cpu)); + check_and_init_map_value(map, value); + goto out; + } for_each_possible_cpu(cpu) { copy_map_value_long(map, value + off, per_cpu_ptr(pptr, cpu)); check_and_init_map_value(map, value + off); off += size; } - ret = 0; out: rcu_read_unlock(); return ret; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 238238086b08..7e6cca641026 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -316,7 +316,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, bpf_disable_instrumentation(); if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { - err = bpf_percpu_hash_copy(map, key, value); + err = bpf_percpu_hash_copy(map, key, value, flags); } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { err = bpf_percpu_array_copy(map, key, value, flags); } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
Copy map value using 'copy_map_value_long()'. It's to keep consistent style with the way of other percpu maps.
No functional change intended.
Signed-off-by: Leon Hwang leon.hwang@linux.dev --- kernel/bpf/local_storage.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index c93a756e035c..2ab4b60ffe61 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -200,8 +200,7 @@ int bpf_percpu_cgroup_storage_copy(struct bpf_map *_map, void *key, */ size = round_up(_map->value_size, 8); for_each_possible_cpu(cpu) { - bpf_long_memcpy(value + off, - per_cpu_ptr(storage->percpu_buf, cpu), size); + copy_map_value_long(_map, value + off, per_cpu_ptr(storage->percpu_buf, cpu)); off += size; } rcu_read_unlock(); @@ -234,8 +233,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *_map, void *key, */ size = round_up(_map->value_size, 8); for_each_possible_cpu(cpu) { - bpf_long_memcpy(per_cpu_ptr(storage->percpu_buf, cpu), - value + off, size); + copy_map_value_long(_map, per_cpu_ptr(storage->percpu_buf, cpu), value + off); off += size; } rcu_read_unlock();
Introduce BPF_F_ALL_CPUS flag support for percpu_cgroup_storage maps to allow updating values for all CPUs with a single value for update_elem API.
Introduce BPF_F_CPU flag support for percpu_cgroup_storage maps to allow:
* update value for specified CPU for update_elem API. * lookup value for specified CPU for lookup_elem API.
The BPF_F_CPU flag is passed via map_flags along with embedded cpu info.
Signed-off-by: Leon Hwang leon.hwang@linux.dev --- include/linux/bpf-cgroup.h | 4 ++-- include/linux/bpf.h | 1 + kernel/bpf/local_storage.c | 23 ++++++++++++++++++----- kernel/bpf/syscall.c | 2 +- 4 files changed, 22 insertions(+), 8 deletions(-)
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index aedf573bdb42..013f4db9903f 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -172,7 +172,7 @@ void bpf_cgroup_storage_link(struct bpf_cgroup_storage *storage, void bpf_cgroup_storage_unlink(struct bpf_cgroup_storage *storage); int bpf_cgroup_storage_assign(struct bpf_prog_aux *aux, struct bpf_map *map);
-int bpf_percpu_cgroup_storage_copy(struct bpf_map *map, void *key, void *value); +int bpf_percpu_cgroup_storage_copy(struct bpf_map *map, void *key, void *value, u64 flags); int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, void *value, u64 flags);
@@ -467,7 +467,7 @@ static inline struct bpf_cgroup_storage *bpf_cgroup_storage_alloc( static inline void bpf_cgroup_storage_free( struct bpf_cgroup_storage *storage) {} static inline int bpf_percpu_cgroup_storage_copy(struct bpf_map *map, void *key, - void *value) { + void *value, u64 flags) { return 0; } static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map, diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f79d2ae27335..9e756db5e132 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -3835,6 +3835,7 @@ static inline bool bpf_map_supports_cpu_flags(enum bpf_map_type map_type) case BPF_MAP_TYPE_PERCPU_ARRAY: case BPF_MAP_TYPE_PERCPU_HASH: case BPF_MAP_TYPE_LRU_PERCPU_HASH: + case BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE: return true; default: return false; diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index 2ab4b60ffe61..1ccbf28b2ad9 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -180,7 +180,7 @@ static long cgroup_storage_update_elem(struct bpf_map *map, void *key, }
int bpf_percpu_cgroup_storage_copy(struct bpf_map *_map, void *key, - void *value) + void *value, u64 map_flags) { struct bpf_cgroup_storage_map *map = map_to_storage(_map); struct bpf_cgroup_storage *storage; @@ -198,11 +198,17 @@ int bpf_percpu_cgroup_storage_copy(struct bpf_map *_map, void *key, * access 'value_size' of them, so copying rounded areas * will not leak any kernel data */ + if (map_flags & BPF_F_CPU) { + cpu = map_flags >> 32; + copy_map_value(_map, value, per_cpu_ptr(storage->percpu_buf, cpu)); + goto unlock; + } size = round_up(_map->value_size, 8); for_each_possible_cpu(cpu) { copy_map_value_long(_map, value + off, per_cpu_ptr(storage->percpu_buf, cpu)); off += size; } +unlock: rcu_read_unlock(); return 0; } @@ -212,10 +218,11 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *_map, void *key, { struct bpf_cgroup_storage_map *map = map_to_storage(_map); struct bpf_cgroup_storage *storage; - int cpu, off = 0; + void *val; u32 size; + int cpu;
- if (map_flags != BPF_ANY && map_flags != BPF_EXIST) + if ((u32)map_flags & ~(BPF_ANY | BPF_EXIST | BPF_F_CPU | BPF_F_ALL_CPUS)) return -EINVAL;
rcu_read_lock(); @@ -231,11 +238,17 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *_map, void *key, * returned or zeros which were zero-filled by percpu_alloc, * so no kernel data leaks possible */ + if (map_flags & BPF_F_CPU) { + cpu = map_flags >> 32; + copy_map_value(_map, per_cpu_ptr(storage->percpu_buf, cpu), value); + goto unlock; + } size = round_up(_map->value_size, 8); for_each_possible_cpu(cpu) { - copy_map_value_long(_map, per_cpu_ptr(storage->percpu_buf, cpu), value + off); - off += size; + val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu; + copy_map_value(_map, per_cpu_ptr(storage->percpu_buf, cpu), val); } +unlock: rcu_read_unlock(); return 0; } diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 7e6cca641026..d64567c6ef50 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -320,7 +320,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { err = bpf_percpu_array_copy(map, key, value, flags); } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) { - err = bpf_percpu_cgroup_storage_copy(map, key, value); + err = bpf_percpu_cgroup_storage_copy(map, key, value, flags); } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) { err = bpf_stackmap_extract(map, key, value, false); } else if (IS_FD_ARRAY(map) || IS_FD_PROG_ARRAY(map)) {
Introduce support for the BPF_F_ALL_CPUS flag in percpu_array maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs.
Introduce support for the BPF_F_CPU flag in percpu_array maps to allow:
* update value for specified CPU for both update_elem and update_batch APIs. * lookup value for specified CPU for both lookup_elem and lookup_batch APIs.
The BPF_F_CPU flag is passed via:
* map_flags of lookup_elem and update_elem APIs along with embedded cpu info. * elem_flags of lookup_batch and update_batch APIs along with embedded cpu info.
Signed-off-by: Leon Hwang leon.hwang@linux.dev --- v10 -> v11: - Drop buggy '(u32)map_flags > BPF_F_ALL_CPUS' check in bpf_percpu_array_update(). - Update 'map_flags == BPF_NOEXIST' to 'map_flags & BPF_NOEXIST' in bpf_percpu_array_update(). --- include/linux/bpf.h | 9 +++++++-- kernel/bpf/arraymap.c | 32 ++++++++++++++++++++++++-------- kernel/bpf/syscall.c | 2 +- 3 files changed, 32 insertions(+), 11 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index d84af3719b59..01a99e3a3e51 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2762,7 +2762,7 @@ int map_set_for_each_callback_args(struct bpf_verifier_env *env, struct bpf_func_state *callee);
int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value); -int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value); +int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value, u64 flags); int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value, u64 flags); int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value, @@ -3831,7 +3831,12 @@ bpf_prog_update_insn_ptrs(struct bpf_prog *prog, u32 *offsets, void *image)
static inline bool bpf_map_supports_cpu_flags(enum bpf_map_type map_type) { - return false; + switch (map_type) { + case BPF_MAP_TYPE_PERCPU_ARRAY: + return true; + default: + return false; + } }
static inline int bpf_map_check_op_flags(struct bpf_map *map, u64 flags, u64 allowed_flags) diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 1eeb31c5b317..241f11d4d62a 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -307,7 +307,7 @@ static void *percpu_array_map_lookup_percpu_elem(struct bpf_map *map, void *key, return per_cpu_ptr(array->pptrs[index & array->index_mask], cpu); }
-int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value) +int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value, u64 map_flags) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; @@ -325,11 +325,18 @@ int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value) size = array->elem_size; rcu_read_lock(); pptr = array->pptrs[index & array->index_mask]; + if (map_flags & BPF_F_CPU) { + cpu = map_flags >> 32; + copy_map_value(map, value, per_cpu_ptr(pptr, cpu)); + check_and_init_map_value(map, value); + goto unlock; + } for_each_possible_cpu(cpu) { copy_map_value_long(map, value + off, per_cpu_ptr(pptr, cpu)); check_and_init_map_value(map, value + off); off += size; } +unlock: rcu_read_unlock(); return 0; } @@ -398,18 +405,18 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value, struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; void __percpu *pptr; - int cpu, off = 0; + void *ptr, *val; u32 size; + int cpu;
- if (unlikely(map_flags > BPF_EXIST)) - /* unknown flags */ + if (unlikely(map_flags & BPF_F_LOCK)) return -EINVAL;
if (unlikely(index >= array->map.max_entries)) /* all elements were pre-allocated, cannot insert a new one */ return -E2BIG;
- if (unlikely(map_flags == BPF_NOEXIST)) + if (unlikely(map_flags & BPF_NOEXIST)) /* all elements already exist */ return -EEXIST;
@@ -422,11 +429,20 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value, size = array->elem_size; rcu_read_lock(); pptr = array->pptrs[index & array->index_mask]; + if (map_flags & BPF_F_CPU) { + cpu = map_flags >> 32; + ptr = per_cpu_ptr(pptr, cpu); + copy_map_value(map, ptr, value); + bpf_obj_free_fields(array->map.record, ptr); + goto unlock; + } for_each_possible_cpu(cpu) { - copy_map_value_long(map, per_cpu_ptr(pptr, cpu), value + off); - bpf_obj_free_fields(array->map.record, per_cpu_ptr(pptr, cpu)); - off += size; + ptr = per_cpu_ptr(pptr, cpu); + val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu; + copy_map_value(map, ptr, val); + bpf_obj_free_fields(array->map.record, ptr); } +unlock: rcu_read_unlock(); return 0; } diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 3c3e3b4095b9..238238086b08 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -318,7 +318,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { err = bpf_percpu_hash_copy(map, key, value); } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { - err = bpf_percpu_array_copy(map, key, value); + err = bpf_percpu_array_copy(map, key, value, flags); } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) { err = bpf_percpu_cgroup_storage_copy(map, key, value); } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
Add test coverage for the new BPF_F_CPU and BPF_F_ALL_CPUS flags support in percpu maps. The following APIs are exercised:
* bpf_map_update_batch() * bpf_map_lookup_batch() * bpf_map_update_elem() * bpf_map__update_elem() * bpf_map_lookup_elem_flags() * bpf_map__lookup_elem()
Add tests to verify that array and hash maps do not support BPF_F_CPU and BPF_F_ALL_CPUS flags.
Signed-off-by: Leon Hwang leon.hwang@linux.dev --- v10 -> v11: - Use libbpf_num_possible_cpus() as max_entries for lru_percpu_hash map. - Add BPF_EXIST to flags when update_elem() and update_batch(). --- .../selftests/bpf/prog_tests/percpu_alloc.c | 335 ++++++++++++++++++ .../selftests/bpf/progs/percpu_alloc_array.c | 32 ++ 2 files changed, 367 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/percpu_alloc.c b/tools/testing/selftests/bpf/prog_tests/percpu_alloc.c index 343da65864d6..ff31107434d7 100644 --- a/tools/testing/selftests/bpf/prog_tests/percpu_alloc.c +++ b/tools/testing/selftests/bpf/prog_tests/percpu_alloc.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include <test_progs.h> +#include "cgroup_helpers.h" #include "percpu_alloc_array.skel.h" #include "percpu_alloc_cgrp_local_storage.skel.h" #include "percpu_alloc_fail.skel.h" @@ -115,6 +116,328 @@ static void test_failure(void) { RUN_TESTS(percpu_alloc_fail); }
+static void test_percpu_map_op_cpu_flag(struct bpf_map *map, void *keys, size_t key_sz, + u32 max_entries, bool test_batch) +{ + size_t value_sz = sizeof(u32), value_sz_total; + u32 *values = NULL, *values_percpu = NULL; + int i, j, cpu, map_fd, nr_cpus, err; + const u32 value = 0xDEADC0DE; + u64 batch = 0, flags; + void *values_row; + u32 count, v; + LIBBPF_OPTS(bpf_map_batch_opts, batch_opts); + + nr_cpus = libbpf_num_possible_cpus(); + if (!ASSERT_GT(nr_cpus, 0, "libbpf_num_possible_cpus")) + return; + + values = calloc(max_entries, value_sz * nr_cpus); + if (!ASSERT_OK_PTR(values, "calloc values")) + return; + + values_percpu = calloc(max_entries, roundup(value_sz, 8) * nr_cpus); + if (!ASSERT_OK_PTR(values_percpu, "calloc values_percpu")) { + free(values); + return; + } + memset(values_percpu, 0, roundup(value_sz, 8) * nr_cpus * max_entries); + + value_sz_total = value_sz * nr_cpus * max_entries; + memset(values, 0, value_sz_total); + + map_fd = bpf_map__fd(map); + flags = BPF_F_CPU | BPF_F_ALL_CPUS; + err = bpf_map_lookup_elem_flags(map_fd, keys, values, flags); + if (!ASSERT_ERR(err, "bpf_map_lookup_elem_flags cpu|all_cpus")) + goto out; + + err = bpf_map_update_elem(map_fd, keys, values, flags); + if (!ASSERT_ERR(err, "bpf_map_update_elem cpu|all_cpus")) + goto out; + + flags = BPF_F_ALL_CPUS; + err = bpf_map_lookup_elem_flags(map_fd, keys, values, flags); + if (!ASSERT_ERR(err, "bpf_map_lookup_elem_flags all_cpus")) + goto out; + + flags = BPF_F_LOCK | BPF_F_CPU; + err = bpf_map_lookup_elem_flags(map_fd, keys, values, flags); + if (!ASSERT_ERR(err, "bpf_map_lookup_elem_flags BPF_F_LOCK")) + goto out; + + flags = BPF_F_LOCK | BPF_F_ALL_CPUS; + err = bpf_map_update_elem(map_fd, keys, values, flags); + if (!ASSERT_ERR(err, "bpf_map_update_elem BPF_F_LOCK")) + goto out; + + flags = (u64)nr_cpus << 32 | BPF_F_CPU; + err = bpf_map_update_elem(map_fd, keys, values, flags); + if (!ASSERT_EQ(err, -ERANGE, "bpf_map_update_elem -ERANGE")) + goto out; + + err = bpf_map__update_elem(map, keys, key_sz, values, value_sz, flags); + if (!ASSERT_EQ(err, -ERANGE, "bpf_map__update_elem -ERANGE")) + goto out; + + err = bpf_map_lookup_elem_flags(map_fd, keys, values, flags); + if (!ASSERT_EQ(err, -ERANGE, "bpf_map_lookup_elem_flags -ERANGE")) + goto out; + + err = bpf_map__lookup_elem(map, keys, key_sz, values, value_sz, flags); + if (!ASSERT_EQ(err, -ERANGE, "bpf_map__lookup_elem -ERANGE")) + goto out; + + + flags = BPF_ANY; + for (i = 0; i < max_entries; i++) { + err = bpf_map__update_elem(map, keys + i * key_sz, key_sz, values_percpu, + roundup(value_sz, 8) * nr_cpus, flags); + if (!ASSERT_OK(err, "bpf_map__update_elem init")) + goto out; + } + + for (cpu = 0; cpu < nr_cpus; cpu++) { + /* clear value on all cpus */ + values[0] = 0; + flags = BPF_F_ALL_CPUS | BPF_EXIST; + for (i = 0; i < max_entries; i++) { + err = bpf_map__update_elem(map, keys + i * key_sz, key_sz, values, + value_sz, flags); + if (!ASSERT_OK(err, "bpf_map__update_elem all_cpus")) + goto out; + } + + /* update value on specified cpu */ + for (i = 0; i < max_entries; i++) { + values[0] = value; + flags = (u64)cpu << 32 | BPF_F_CPU | BPF_EXIST; + err = bpf_map__update_elem(map, keys + i * key_sz, key_sz, values, + value_sz, flags); + if (!ASSERT_OK(err, "bpf_map__update_elem specified cpu")) + goto out; + + /* lookup then check value on CPUs */ + for (j = 0; j < nr_cpus; j++) { + flags = (u64)j << 32 | BPF_F_CPU; + err = bpf_map__lookup_elem(map, keys + i * key_sz, key_sz, values, + value_sz, flags); + if (!ASSERT_OK(err, "bpf_map__lookup_elem specified cpu")) + goto out; + if (!ASSERT_EQ(values[0], j != cpu ? 0 : value, + "bpf_map__lookup_elem value on specified cpu")) + goto out; + } + } + } + + if (!test_batch) + goto out; + + count = max_entries; + batch_opts.elem_flags = (u64)nr_cpus << 32 | BPF_F_CPU; + err = bpf_map_update_batch(map_fd, keys, values, &count, &batch_opts); + if (!ASSERT_EQ(err, -ERANGE, "bpf_map_update_batch -ERANGE")) + goto out; + + for (cpu = 0; cpu < nr_cpus; cpu++) { + memset(values, 0, value_sz_total); + + /* clear values across all CPUs */ + count = max_entries; + batch_opts.elem_flags = BPF_F_ALL_CPUS | BPF_EXIST; + err = bpf_map_update_batch(map_fd, keys, values, &count, &batch_opts); + if (!ASSERT_OK(err, "bpf_map_update_batch all_cpus")) + goto out; + + /* update values on specified CPU */ + for (i = 0; i < max_entries; i++) + values[i] = value; + + count = max_entries; + batch_opts.elem_flags = (u64)cpu << 32 | BPF_F_CPU | BPF_EXIST; + err = bpf_map_update_batch(map_fd, keys, values, &count, &batch_opts); + if (!ASSERT_OK(err, "bpf_map_update_batch specified cpu")) + goto out; + + /* lookup values on specified CPU */ + count = max_entries; + memset(values, 0, max_entries * value_sz); + batch_opts.elem_flags = (u64)cpu << 32 | BPF_F_CPU; + err = bpf_map_lookup_batch(map_fd, NULL, &batch, keys, values, &count, &batch_opts); + if (!ASSERT_TRUE(!err || err == -ENOENT, "bpf_map_lookup_batch specified cpu")) + goto out; + + for (i = 0; i < max_entries; i++) + if (!ASSERT_EQ(values[i], value, + "bpf_map_lookup_batch value on specified cpu")) + goto out; + + /* lookup values from all CPUs */ + batch = 0; + count = max_entries; + batch_opts.elem_flags = 0; + memset(values_percpu, 0, roundup(value_sz, 8) * nr_cpus * max_entries); + err = bpf_map_lookup_batch(map_fd, NULL, &batch, keys, values_percpu, &count, + &batch_opts); + if (!ASSERT_TRUE(!err || err == -ENOENT, "bpf_map_lookup_batch all_cpus")) + goto out; + + for (i = 0; i < max_entries; i++) { + values_row = (void *) values_percpu + + roundup(value_sz, 8) * i * nr_cpus; + for (j = 0; j < nr_cpus; j++) { + v = *(u32 *) (values_row + roundup(value_sz, 8) * j); + if (!ASSERT_EQ(v, j != cpu ? 0 : value, + "bpf_map_lookup_batch value all_cpus")) + goto out; + } + } + } + +out: + free(values_percpu); + free(values); +} + +static void test_percpu_map_cpu_flag(enum bpf_map_type map_type, u32 max_entries) +{ + struct percpu_alloc_array *skel; + size_t key_sz = sizeof(int); + struct bpf_map *map; + int *keys, i, err; + + keys = calloc(max_entries, key_sz); + if (!ASSERT_OK_PTR(keys, "calloc keys")) + return; + + for (i = 0; i < max_entries; i++) + keys[i] = i; + + skel = percpu_alloc_array__open(); + if (!ASSERT_OK_PTR(skel, "percpu_alloc_array__open")) { + free(keys); + return; + } + + map = skel->maps.percpu; + bpf_map__set_type(map, map_type); + bpf_map__set_max_entries(map, max_entries); + + err = percpu_alloc_array__load(skel); + if (!ASSERT_OK(err, "test_percpu_alloc__load")) + goto out; + + test_percpu_map_op_cpu_flag(map, keys, key_sz, max_entries, true); +out: + percpu_alloc_array__destroy(skel); + free(keys); +} + +static void test_percpu_array_cpu_flag(void) +{ + test_percpu_map_cpu_flag(BPF_MAP_TYPE_PERCPU_ARRAY, 2); +} + +static void test_percpu_hash_cpu_flag(void) +{ + test_percpu_map_cpu_flag(BPF_MAP_TYPE_PERCPU_HASH, 2); +} + +static void test_lru_percpu_hash_cpu_flag(void) +{ + int nr_cpus = libbpf_num_possible_cpus(); + + if (!ASSERT_GT(nr_cpus, 0, "libbpf_num_possible_cpus")) + return; + + test_percpu_map_cpu_flag(BPF_MAP_TYPE_LRU_PERCPU_HASH, nr_cpus); +} + +static void test_percpu_cgroup_storage_cpu_flag(void) +{ + struct percpu_alloc_array *skel = NULL; + struct bpf_cgroup_storage_key key; + int cgroup, prog_fd, err; + struct bpf_map *map; + + err = setup_cgroup_environment(); + if (!ASSERT_OK(err, "setup_cgroup_environment")) + return; + + cgroup = create_and_get_cgroup("/cg_percpu"); + if (!ASSERT_GE(cgroup, 0, "create_and_get_cgroup")) { + cleanup_cgroup_environment(); + return; + } + + err = join_cgroup("/cg_percpu"); + if (!ASSERT_OK(err, "join_cgroup")) + goto out; + + skel = percpu_alloc_array__open_and_load(); + if (!ASSERT_OK_PTR(skel, "percpu_alloc_array__open_and_load")) + goto out; + + prog_fd = bpf_program__fd(skel->progs.cgroup_egress); + err = bpf_prog_attach(prog_fd, cgroup, BPF_CGROUP_INET_EGRESS, 0); + if (!ASSERT_OK(err, "bpf_prog_attach")) + goto out; + + map = skel->maps.percpu_cgroup_storage; + err = bpf_map_get_next_key(bpf_map__fd(map), NULL, &key); + if (!ASSERT_OK(err, "bpf_map_get_next_key")) + goto out; + + test_percpu_map_op_cpu_flag(map, &key, sizeof(key), 1, false); +out: + bpf_prog_detach2(-1, cgroup, BPF_CGROUP_INET_EGRESS); + close(cgroup); + cleanup_cgroup_environment(); + percpu_alloc_array__destroy(skel); +} + +static void test_map_op_cpu_flag(enum bpf_map_type map_type) +{ + u32 max_entries = 1, count = max_entries; + u64 flags, batch = 0, val = 0; + int err, map_fd, key = 0; + LIBBPF_OPTS(bpf_map_batch_opts, batch_opts); + + map_fd = bpf_map_create(map_type, "test_cpu_flag", sizeof(int), sizeof(u64), max_entries, + NULL); + if (!ASSERT_GE(map_fd, 0, "bpf_map_create")) + return; + + flags = BPF_F_ALL_CPUS; + err = bpf_map_update_elem(map_fd, &key, &val, flags); + ASSERT_ERR(err, "bpf_map_update_elem all_cpus"); + + batch_opts.elem_flags = BPF_F_ALL_CPUS; + err = bpf_map_update_batch(map_fd, &key, &val, &count, &batch_opts); + ASSERT_ERR(err, "bpf_map_update_batch all_cpus"); + + flags = BPF_F_CPU; + err = bpf_map_lookup_elem_flags(map_fd, &key, &val, flags); + ASSERT_ERR(err, "bpf_map_lookup_elem_flags cpu"); + + batch_opts.elem_flags = BPF_F_CPU; + err = bpf_map_lookup_batch(map_fd, NULL, &batch, &key, &val, &count, &batch_opts); + ASSERT_ERR(err, "bpf_map_lookup_batch cpu"); + + close(map_fd); +} + +static void test_array_cpu_flag(void) +{ + test_map_op_cpu_flag(BPF_MAP_TYPE_ARRAY); +} + +static void test_hash_cpu_flag(void) +{ + test_map_op_cpu_flag(BPF_MAP_TYPE_HASH); +} + void test_percpu_alloc(void) { if (test__start_subtest("array")) @@ -125,4 +448,16 @@ void test_percpu_alloc(void) test_cgrp_local_storage(); if (test__start_subtest("failure_tests")) test_failure(); + if (test__start_subtest("cpu_flag_percpu_array")) + test_percpu_array_cpu_flag(); + if (test__start_subtest("cpu_flag_percpu_hash")) + test_percpu_hash_cpu_flag(); + if (test__start_subtest("cpu_flag_lru_percpu_hash")) + test_lru_percpu_hash_cpu_flag(); + if (test__start_subtest("cpu_flag_percpu_cgroup_storage")) + test_percpu_cgroup_storage_cpu_flag(); + if (test__start_subtest("cpu_flag_array")) + test_array_cpu_flag(); + if (test__start_subtest("cpu_flag_hash")) + test_hash_cpu_flag(); } diff --git a/tools/testing/selftests/bpf/progs/percpu_alloc_array.c b/tools/testing/selftests/bpf/progs/percpu_alloc_array.c index 37c2d2608ec0..ed6a2a93d5a5 100644 --- a/tools/testing/selftests/bpf/progs/percpu_alloc_array.c +++ b/tools/testing/selftests/bpf/progs/percpu_alloc_array.c @@ -187,4 +187,36 @@ int BPF_PROG(test_array_map_10) return 0; }
+struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(max_entries, 2); + __type(key, int); + __type(value, u32); +} percpu SEC(".maps"); + +SEC("?fentry/bpf_fentry_test1") +int BPF_PROG(test_percpu_array, int x) +{ + u64 value = 0xDEADC0DE; + int key = 0; + + bpf_map_update_elem(&percpu, &key, &value, BPF_ANY); + return 0; +} + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE); + __type(key, struct bpf_cgroup_storage_key); + __type(value, u32); +} percpu_cgroup_storage SEC(".maps"); + +SEC("cgroup_skb/egress") +int cgroup_egress(struct __sk_buff *skb) +{ + u32 *val = bpf_get_local_storage(&percpu_cgroup_storage, 0); + + *val = 1; + return 1; +} + char _license[] SEC("license") = "GPL";
Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags and check them for following APIs:
* 'map_lookup_elem()' * 'map_update_elem()' * 'generic_map_lookup_batch()' * 'generic_map_update_batch()'
And, get the correct value size for these APIs.
Acked-by: Andrii Nakryiko andrii@kernel.org Signed-off-by: Leon Hwang leon.hwang@linux.dev --- v10 -> v11: - Use '(BPF_F_ALL_CPUS << 1) - 1' as allowed_flags in map_update_elem(). - Add BPF_EXIST to allowed_flags in generic_map_update_batch(). --- include/linux/bpf.h | 23 +++++++++++++++++++++- include/uapi/linux/bpf.h | 2 ++ kernel/bpf/syscall.c | 36 ++++++++++++++++++++-------------- tools/include/uapi/linux/bpf.h | 2 ++ 4 files changed, 47 insertions(+), 16 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 6498be4c44f8..d84af3719b59 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -3829,14 +3829,35 @@ bpf_prog_update_insn_ptrs(struct bpf_prog *prog, u32 *offsets, void *image) } #endif
+static inline bool bpf_map_supports_cpu_flags(enum bpf_map_type map_type) +{ + return false; +} + static inline int bpf_map_check_op_flags(struct bpf_map *map, u64 flags, u64 allowed_flags) { - if (flags & ~allowed_flags) + u32 cpu; + + if ((u32)flags & ~allowed_flags) return -EINVAL;
if ((flags & BPF_F_LOCK) && !btf_record_has_field(map->record, BPF_SPIN_LOCK)) return -EINVAL;
+ if (!(flags & BPF_F_CPU) && flags >> 32) + return -EINVAL; + + if (flags & (BPF_F_CPU | BPF_F_ALL_CPUS)) { + if (!bpf_map_supports_cpu_flags(map->map_type)) + return -EINVAL; + if ((flags & BPF_F_CPU) && (flags & BPF_F_ALL_CPUS)) + return -EINVAL; + + cpu = flags >> 32; + if ((flags & BPF_F_CPU) && cpu >= num_possible_cpus()) + return -ERANGE; + } + return 0; }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index f5713f59ac10..8b6279ca6e66 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1373,6 +1373,8 @@ enum { BPF_NOEXIST = 1, /* create new element if it didn't exist */ BPF_EXIST = 2, /* update existing element */ BPF_F_LOCK = 4, /* spin_lock-ed map_lookup/map_update */ + BPF_F_CPU = 8, /* cpu flag for percpu maps, upper 32-bit of flags is a cpu number */ + BPF_F_ALL_CPUS = 16, /* update value across all CPUs for percpu maps */ };
/* flags for BPF_MAP_CREATE command */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index cef8963d69f9..3c3e3b4095b9 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -133,12 +133,14 @@ bool bpf_map_write_active(const struct bpf_map *map) return atomic64_read(&map->writecnt) != 0; }
-static u32 bpf_map_value_size(const struct bpf_map *map) -{ - if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || - map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH || - map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY || - map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) +static u32 bpf_map_value_size(const struct bpf_map *map, u64 flags) +{ + if (flags & (BPF_F_CPU | BPF_F_ALL_CPUS)) + return map->value_size; + else if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || + map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH || + map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY || + map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) return round_up(map->value_size, 8) * num_possible_cpus(); else if (IS_FD_MAP(map)) return sizeof(u32); @@ -1732,7 +1734,7 @@ static int map_lookup_elem(union bpf_attr *attr) if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) return -EPERM;
- err = bpf_map_check_op_flags(map, attr->flags, BPF_F_LOCK); + err = bpf_map_check_op_flags(map, attr->flags, BPF_F_LOCK | BPF_F_CPU); if (err) return err;
@@ -1740,7 +1742,7 @@ static int map_lookup_elem(union bpf_attr *attr) if (IS_ERR(key)) return PTR_ERR(key);
- value_size = bpf_map_value_size(map); + value_size = bpf_map_value_size(map, attr->flags);
err = -ENOMEM; value = kvmalloc(value_size, GFP_USER | __GFP_NOWARN); @@ -1781,6 +1783,7 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr) bpfptr_t uvalue = make_bpfptr(attr->value, uattr.is_kernel); struct bpf_map *map; void *key, *value; + u64 allowed_flags; u32 value_size; int err;
@@ -1797,7 +1800,8 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr) goto err_put; }
- err = bpf_map_check_op_flags(map, attr->flags, ~0); + allowed_flags = (BPF_F_ALL_CPUS << 1) - 1; + err = bpf_map_check_op_flags(map, attr->flags, allowed_flags); if (err) goto err_put;
@@ -1807,7 +1811,7 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr) goto err_put; }
- value_size = bpf_map_value_size(map); + value_size = bpf_map_value_size(map, attr->flags); value = kvmemdup_bpfptr(uvalue, value_size); if (IS_ERR(value)) { err = PTR_ERR(value); @@ -2001,13 +2005,15 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file, void __user *keys = u64_to_user_ptr(attr->batch.keys); u32 value_size, cp, max_count; void *key, *value; + u64 allowed_flags; int err = 0;
- err = bpf_map_check_op_flags(map, attr->batch.elem_flags, BPF_F_LOCK); + allowed_flags = BPF_EXIST | BPF_F_LOCK | BPF_F_CPU | BPF_F_ALL_CPUS; + err = bpf_map_check_op_flags(map, attr->batch.elem_flags, allowed_flags); if (err) return err;
- value_size = bpf_map_value_size(map); + value_size = bpf_map_value_size(map, attr->batch.elem_flags);
max_count = attr->batch.count; if (!max_count) @@ -2062,11 +2068,11 @@ int generic_map_lookup_batch(struct bpf_map *map, u32 value_size, cp, max_count; int err;
- err = bpf_map_check_op_flags(map, attr->batch.elem_flags, BPF_F_LOCK); + err = bpf_map_check_op_flags(map, attr->batch.elem_flags, BPF_F_LOCK | BPF_F_CPU); if (err) return err;
- value_size = bpf_map_value_size(map); + value_size = bpf_map_value_size(map, attr->batch.elem_flags);
max_count = attr->batch.count; if (!max_count) @@ -2188,7 +2194,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr) goto err_put; }
- value_size = bpf_map_value_size(map); + value_size = bpf_map_value_size(map, 0);
err = -ENOMEM; value = kvmalloc(value_size, GFP_USER | __GFP_NOWARN); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index f5713f59ac10..8b6279ca6e66 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1373,6 +1373,8 @@ enum { BPF_NOEXIST = 1, /* create new element if it didn't exist */ BPF_EXIST = 2, /* update existing element */ BPF_F_LOCK = 4, /* spin_lock-ed map_lookup/map_update */ + BPF_F_CPU = 8, /* cpu flag for percpu maps, upper 32-bit of flags is a cpu number */ + BPF_F_ALL_CPUS = 16, /* update value across all CPUs for percpu maps */ };
/* flags for BPF_MAP_CREATE command */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 1eeb31c5b..241f11d4d 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c
[ ... ]
@@ -398,18 +405,18 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value, struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; void __percpu *pptr;
- int cpu, off = 0;
- void *ptr, *val; u32 size;
- int cpu;
- if (unlikely(map_flags > BPF_EXIST))
/* unknown flags */
if (unlikely(map_flags & BPF_F_LOCK)) return -EINVAL;
if (unlikely(index >= array->map.max_entries)) /* all elements were pre-allocated, cannot insert a new one */ return -E2BIG;
- if (unlikely(map_flags == BPF_NOEXIST))
- if (unlikely(map_flags & BPF_NOEXIST)) /* all elements already exist */ return -EEXIST;
This isn't a bug, but the change from equality check to bit test might allow confusing flag combinations. The old code rejected map_flags > 2, which prevented BPF_NOEXIST | BPF_EXIST (flags=3) entirely. With the new bit test, userspace can pass both flags and BPF_NOEXIST takes precedence.
Should bpf_map_check_op_flags() validate mutual exclusivity of BPF_NOEXIST and BPF_EXIST, similar to how it validates BPF_F_CPU and BPF_F_ALL_CPUS at include/linux/bpf.h:3858?
--- AI reviewed your patch. Please fix the bug or email reply why it's not a bug. See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19674301388
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index c93a756e0..2ab4b60ff 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -200,8 +200,7 @@ int bpf_percpu_cgroup_storage_copy(struct bpf_map *_map, void *key, */ size = round_up(_map->value_size, 8); for_each_possible_cpu(cpu) {
bpf_long_memcpy(value + off,per_cpu_ptr(storage->percpu_buf, cpu), size);
off += size; }copy_map_value_long(_map, value + off, per_cpu_ptr(storage->percpu_buf, cpu));
Should this be calling check_and_init_map_value() after the copy, similar to how it's done in other percpu maps? For example, in bpf_percpu_array_copy():
for_each_possible_cpu(cpu) { copy_map_value_long(map, value + off, per_cpu_ptr(pptr, cpu)); check_and_init_map_value(map, value + off); off += size; }
And in percpu_hash lookup:
for_each_possible_cpu(cpu) { copy_map_value_long(&htab->map, value + off, per_cpu_ptr(pptr, cpu)); check_and_init_map_value(&htab->map, value + off); off += roundup_value_size; }
When copy_map_value_long() skips special fields like bpf_spin_lock or bpf_timer (which happens when map->record is non-NULL), those fields are left uninitialized in the destination buffer. The check_and_init_ map_value() call initializes these special fields via bpf_obj_init().
While BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE is deprecated and existing users don't appear to use special fields, the inconsistency with other percpu map implementations could be confusing. Also, the commit message states "No functional change intended", but copy_map_value_long() behaves differently from bpf_long_memcpy() when map->record is non-NULL.
@@ -234,8 +233,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *_map, void *key, */ size = round_up(_map->value_size, 8); for_each_possible_cpu(cpu) {
bpf_long_memcpy(per_cpu_ptr(storage->percpu_buf, cpu),value + off, size);
off += size; }copy_map_value_long(_map, per_cpu_ptr(storage->percpu_buf, cpu), value + off);
Same question here for the update path.
--- AI reviewed your patch. Please fix the bug or email reply why it's not a bug. See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19674301388
On Tue, Nov 25, 2025 at 7:00 AM Leon Hwang leon.hwang@linux.dev wrote:
Introduce BPF_F_ALL_CPUS flag support for percpu_hash and lru_percpu_hash maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs.
Introduce BPF_F_CPU flag support for percpu_hash and lru_percpu_hash maps to allow:
- update value for specified CPU for both update_elem and update_batch
APIs.
- lookup value for specified CPU for both lookup_elem and lookup_batch
APIs.
The BPF_F_CPU flag is passed via:
- map_flags along with embedded cpu info.
- elem_flags along with embedded cpu info.
Signed-off-by: Leon Hwang leon.hwang@linux.dev
v10 -> v11:
- Drop buggy '(u32)map_flags > BPF_F_ALL_CPUS' check in htab_map_check_update_flags().
why?
- Update 'map_flags != BPF_EXIST' to '!(map_flags & BPF_EXIST)' in __htab_lru_percpu_map_update_elem().
include/linux/bpf.h | 4 +- kernel/bpf/hashtab.c | 96 ++++++++++++++++++++++++++++++-------------- kernel/bpf/syscall.c | 2 +- 3 files changed, 69 insertions(+), 33 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 01a99e3a3e51..f79d2ae27335 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2761,7 +2761,7 @@ int map_set_for_each_callback_args(struct bpf_verifier_env *env, struct bpf_func_state *caller, struct bpf_func_state *callee);
-int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value); +int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value, u64 flags); int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value, u64 flags); int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value, u64 flags); @@ -3833,6 +3833,8 @@ static inline bool bpf_map_supports_cpu_flags(enum bpf_map_type map_type) { switch (map_type) { case BPF_MAP_TYPE_PERCPU_ARRAY:
case BPF_MAP_TYPE_PERCPU_HASH:case BPF_MAP_TYPE_LRU_PERCPU_HASH: return true; default: return false;diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index c8a9b27f8663..c768bf71d60f 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -932,7 +932,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) }
static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
void *value, bool onallcpus)
void *value, bool onallcpus, u64 map_flags){ void *ptr;
@@ -943,19 +943,28 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr, bpf_obj_free_fields(htab->map.record, ptr); } else { u32 size = round_up(htab->map.value_size, 8);
int off = 0, cpu;
void *val;int cpu;if (map_flags & BPF_F_CPU) {cpu = map_flags >> 32;ptr = per_cpu_ptr(pptr, cpu);copy_map_value(&htab->map, ptr, value);bpf_obj_free_fields(htab->map.record, ptr);return;} for_each_possible_cpu(cpu) { ptr = per_cpu_ptr(pptr, cpu);
copy_map_value_long(&htab->map, ptr, value + off);
val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu;copy_map_value(&htab->map, ptr, val); bpf_obj_free_fields(htab->map.record, ptr);
off += size; } }}
static void pcpu_init_value(struct bpf_htab *htab, void __percpu *pptr,
void *value, bool onallcpus)
void *value, bool onallcpus, u64 map_flags){ /* When not setting the initial value on all cpus, zero-fill element * values for other cpus. Otherwise, bpf program has no way to ensure @@ -973,7 +982,7 @@ static void pcpu_init_value(struct bpf_htab *htab, void __percpu *pptr, zero_map_value(&htab->map, per_cpu_ptr(pptr, cpu)); } } else {
pcpu_copy_value(htab, pptr, value, onallcpus);
pcpu_copy_value(htab, pptr, value, onallcpus, map_flags); }}
@@ -985,7 +994,7 @@ static bool fd_htab_map_needs_adjust(const struct bpf_htab *htab) static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, void *value, u32 key_size, u32 hash, bool percpu, bool onallcpus,
struct htab_elem *old_elem)
struct htab_elem *old_elem, u64 map_flags){ u32 size = htab->map.value_size; bool prealloc = htab_is_prealloc(htab); @@ -1043,7 +1052,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, pptr = *(void __percpu **)ptr; }
pcpu_init_value(htab, pptr, value, onallcpus);
pcpu_init_value(htab, pptr, value, onallcpus, map_flags); if (!prealloc) htab_elem_set_ptr(l_new, key_size, pptr);@@ -1147,7 +1156,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value, }
l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false,
l_old);
l_old, map_flags); if (IS_ERR(l_new)) { /* all pre-allocated elements are in use or memory exhausted */ ret = PTR_ERR(l_new);@@ -1249,6 +1258,15 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value return ret; }
+static int htab_map_check_update_flags(bool onallcpus, u64 map_flags) +{
if (unlikely(!onallcpus && map_flags > BPF_EXIST))return -EINVAL;if (unlikely(onallcpus && (map_flags & BPF_F_LOCK)))return -EINVAL;return 0;+}
static long htab_map_update_elem_in_place(struct bpf_map *map, void *key, void *value, u64 map_flags, bool percpu, bool onallcpus) @@ -1262,9 +1280,9 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key, u32 key_size, hash; int ret;
if (unlikely(map_flags > BPF_EXIST))/* unknown flags */return -EINVAL;
ret = htab_map_check_update_flags(onallcpus, map_flags);if (unlikely(ret))return ret; WARN_ON_ONCE(!bpf_rcu_lock_held());@@ -1289,7 +1307,7 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key, /* Update value in-place */ if (percpu) { pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
value, onallcpus);
value, onallcpus, map_flags); } else { void **inner_map_pptr = htab_elem_value(l_old, key_size);@@ -1298,7 +1316,7 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key, } } else { l_new = alloc_htab_elem(htab, key, value, key_size,
hash, percpu, onallcpus, NULL);
hash, percpu, onallcpus, NULL, map_flags); if (IS_ERR(l_new)) { ret = PTR_ERR(l_new); goto err;@@ -1324,9 +1342,9 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, u32 key_size, hash; int ret;
if (unlikely(map_flags > BPF_EXIST))/* unknown flags */return -EINVAL;
ret = htab_map_check_update_flags(onallcpus, map_flags);if (unlikely(ret))return ret; WARN_ON_ONCE(!bpf_rcu_lock_held());@@ -1342,7 +1360,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, * to remove older elem from htab and this removal * operation will need a bucket lock. */
if (map_flags != BPF_EXIST) {
if (!(map_flags & BPF_EXIST)) { l_new = prealloc_lru_pop(htab, key, hash); if (!l_new) return -ENOMEM;
It's not in the diff, but this is broken. You tried to allow BPF_EXIST combination here, but didn't update check_flags(),
so BPF_[NO]EXIST | BPF_F_CPU combination check_flags() will always return 0, so BPF_[NO]EXIST flag will make no difference.
When you add features, always always add unit tests. Patch 8 is not it. It's testing F_CPU. It doesn't check that BPF_EXIST | BPF_F_CPU correctly errors when an element doesn't exist.
v10 was close, but then you decided to add this BPF_EXIST feature and did it in a sloppy way. Why ? Focus on one thing only. Land it and then do the next one. 11 revisions and still no go... it is not a good sign.
pw-bot: cr
On Tue, Nov 25, 2025 at 6:59 AM Leon Hwang leon.hwang@linux.dev wrote:
Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags and check them for following APIs:
- 'map_lookup_elem()'
- 'map_update_elem()'
- 'generic_map_lookup_batch()'
- 'generic_map_update_batch()'
And, get the correct value size for these APIs.
Acked-by: Andrii Nakryiko andrii@kernel.org Signed-off-by: Leon Hwang leon.hwang@linux.dev
v10 -> v11:
- Use '(BPF_F_ALL_CPUS << 1) - 1' as allowed_flags in map_update_elem().
- Add BPF_EXIST to allowed_flags in generic_map_update_batch().
It should be mentioned in the commit log. Lines after --- don't stay in the log.
include/linux/bpf.h | 23 +++++++++++++++++++++- include/uapi/linux/bpf.h | 2 ++ kernel/bpf/syscall.c | 36 ++++++++++++++++++++-------------- tools/include/uapi/linux/bpf.h | 2 ++ 4 files changed, 47 insertions(+), 16 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 6498be4c44f8..d84af3719b59 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -3829,14 +3829,35 @@ bpf_prog_update_insn_ptrs(struct bpf_prog *prog, u32 *offsets, void *image) } #endif
+static inline bool bpf_map_supports_cpu_flags(enum bpf_map_type map_type) +{
return false;+}
static inline int bpf_map_check_op_flags(struct bpf_map *map, u64 flags, u64 allowed_flags) {
if (flags & ~allowed_flags)
u32 cpu;if ((u32)flags & ~allowed_flags) return -EINVAL; if ((flags & BPF_F_LOCK) && !btf_record_has_field(map->record, BPF_SPIN_LOCK)) return -EINVAL;if (!(flags & BPF_F_CPU) && flags >> 32)return -EINVAL;if (flags & (BPF_F_CPU | BPF_F_ALL_CPUS)) {if (!bpf_map_supports_cpu_flags(map->map_type))return -EINVAL;if ((flags & BPF_F_CPU) && (flags & BPF_F_ALL_CPUS))return -EINVAL;cpu = flags >> 32;if ((flags & BPF_F_CPU) && cpu >= num_possible_cpus())return -ERANGE;}return 0;}
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index f5713f59ac10..8b6279ca6e66 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1373,6 +1373,8 @@ enum { BPF_NOEXIST = 1, /* create new element if it didn't exist */ BPF_EXIST = 2, /* update existing element */ BPF_F_LOCK = 4, /* spin_lock-ed map_lookup/map_update */
BPF_F_CPU = 8, /* cpu flag for percpu maps, upper 32-bit of flags is a cpu number */BPF_F_ALL_CPUS = 16, /* update value across all CPUs for percpu maps */};
/* flags for BPF_MAP_CREATE command */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index cef8963d69f9..3c3e3b4095b9 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -133,12 +133,14 @@ bool bpf_map_write_active(const struct bpf_map *map) return atomic64_read(&map->writecnt) != 0; }
-static u32 bpf_map_value_size(const struct bpf_map *map) -{
if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)+static u32 bpf_map_value_size(const struct bpf_map *map, u64 flags) +{
if (flags & (BPF_F_CPU | BPF_F_ALL_CPUS))return map->value_size;else if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) return round_up(map->value_size, 8) * num_possible_cpus(); else if (IS_FD_MAP(map)) return sizeof(u32);@@ -1732,7 +1734,7 @@ static int map_lookup_elem(union bpf_attr *attr) if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) return -EPERM;
err = bpf_map_check_op_flags(map, attr->flags, BPF_F_LOCK);
err = bpf_map_check_op_flags(map, attr->flags, BPF_F_LOCK | BPF_F_CPU); if (err) return err;@@ -1740,7 +1742,7 @@ static int map_lookup_elem(union bpf_attr *attr) if (IS_ERR(key)) return PTR_ERR(key);
value_size = bpf_map_value_size(map);
value_size = bpf_map_value_size(map, attr->flags); err = -ENOMEM; value = kvmalloc(value_size, GFP_USER | __GFP_NOWARN);@@ -1781,6 +1783,7 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr) bpfptr_t uvalue = make_bpfptr(attr->value, uattr.is_kernel); struct bpf_map *map; void *key, *value;
u64 allowed_flags; u32 value_size; int err;@@ -1797,7 +1800,8 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr) goto err_put; }
err = bpf_map_check_op_flags(map, attr->flags, ~0);
allowed_flags = (BPF_F_ALL_CPUS << 1) - 1;
This is cryptic. Use allowed_flags = BPF_NOEXIST | BPF_EXIST | BPF_F_LOCK | BPF_F_CPU | BPF_F_ALL_CPUS;
On 26/11/25 07:11, Alexei Starovoitov wrote:
On Tue, Nov 25, 2025 at 7:00 AM Leon Hwang leon.hwang@linux.dev wrote:
[...]
@@ -1342,7 +1360,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, * to remove older elem from htab and this removal * operation will need a bucket lock. */
if (map_flags != BPF_EXIST) {
if (!(map_flags & BPF_EXIST)) { l_new = prealloc_lru_pop(htab, key, hash); if (!l_new) return -ENOMEM;It's not in the diff, but this is broken. You tried to allow BPF_EXIST combination here, but didn't update check_flags(),
so BPF_[NO]EXIST | BPF_F_CPU combination check_flags() will always return 0, so BPF_[NO]EXIST flag will make no difference.
When you add features, always always add unit tests. Patch 8 is not it. It's testing F_CPU. It doesn't check that BPF_EXIST | BPF_F_CPU correctly errors when an element doesn't exist.
v10 was close, but then you decided to add this BPF_EXIST feature and did it in a sloppy way. Why ? Focus on one thing only. Land it and then do the next one. 11 revisions and still no go... it is not a good sign.
Yeah, you're right.
The intention of v11 was solely to address the unstable lru_percpu_hash map test — not to introduce support for the BPF_EXIST combination.
Given that, the approach in v11 was not the right way to fix the instability. I'll investigate the underlying cause first and then work on a better fix.
Thanks, Leon
linux-kselftest-mirror@lists.linaro.org