After switching to memcg-based bpf memory accounting, the bpf memory is
charged to the loader's memcg by defaut, that causes unexpected issues for
us. For instance, the container of the loader-which loads the bpf programs
and pins them on bpffs-may restart after pinning the progs and maps. After
the restart, the pinned progs and maps won't belong to the new container
any more, while they actually belong to an offline memcg left by the
previous generation. That inconsistent behavior will make trouble for the
memory resource management for this container.
The reason why these progs and maps have to be persistent across multiple
generations is that these progs and maps are also used by other processes
which are not in this container. IOW, they can't be removed when this
container is restarted. Take a specific example, bpf program for clsact
qdisc is loaded by a agent running in a container, which not only loads
bpf program but also processes the data generated by this program and do
some other maintainace things.
In order to keep the charging behavior consistent, we used to consider a
way to recharge these pinned maps and progs again after the container is
restarted, but after the discussion[1] with Roman, we decided to go
another direction that don't charge them to the container in the first
place. TL;DR about the mentioned disccussion: recharging is not a generic
solution and it may take too much risk.
This patchset is the solution of no charge. Two flags are introduced in
union bpf_attr, one for bpf map and another for bpf prog. The user who
doesn't want to charge to current memcg can use these two flags. These two
flags are only permitted for sys admin as these memory will be accounted to
the root memcg only.
Patches #1~#8 are for bpf map. Patches #9~#12 are for bpf prog. Patch #13
and #14 are for selftests and also the examples of how to use them.
[1]. https://lwn.net/Articles/887180/
Yafang Shao (14):
bpf: Introduce no charge flag for bpf map
bpf: Only sys admin can set no charge flag
bpf: Enable no charge in map _CREATE_FLAG_MASK
bpf: Introduce new parameter bpf_attr in bpf_map_area_alloc
bpf: Allow no charge in bpf_map_area_alloc
bpf: Allow no charge for allocation not at map creation time
bpf: Allow no charge in map specific allocation
bpf: Aggregate flags for BPF_PROG_LOAD command
bpf: Add no charge flag for bpf prog
bpf: Only sys admin can set no charge flag for bpf prog
bpf: Set __GFP_ACCOUNT at the callsite of bpf_prog_alloc
bpf: Allow no charge for bpf prog
bpf: selftests: Add test case for BPF_F_NO_CHARTE
bpf: selftests: Add test case for BPF_F_PROG_NO_CHARGE
include/linux/bpf.h | 27 ++++++-
include/uapi/linux/bpf.h | 21 +++--
kernel/bpf/arraymap.c | 9 +--
kernel/bpf/bloom_filter.c | 7 +-
kernel/bpf/bpf_local_storage.c | 8 +-
kernel/bpf/bpf_struct_ops.c | 13 +--
kernel/bpf/core.c | 20 +++--
kernel/bpf/cpumap.c | 10 ++-
kernel/bpf/devmap.c | 14 ++--
kernel/bpf/hashtab.c | 14 ++--
kernel/bpf/local_storage.c | 4 +-
kernel/bpf/lpm_trie.c | 4 +-
kernel/bpf/queue_stack_maps.c | 5 +-
kernel/bpf/reuseport_array.c | 3 +-
kernel/bpf/ringbuf.c | 19 ++---
kernel/bpf/stackmap.c | 13 +--
kernel/bpf/syscall.c | 40 +++++++---
kernel/bpf/verifier.c | 2 +-
net/core/filter.c | 6 +-
net/core/sock_map.c | 8 +-
net/xdp/xskmap.c | 9 ++-
tools/include/uapi/linux/bpf.h | 21 +++--
.../selftests/bpf/map_tests/no_charg.c | 79 +++++++++++++++++++
.../selftests/bpf/prog_tests/no_charge.c | 49 ++++++++++++
24 files changed, 297 insertions(+), 108 deletions(-)
create mode 100644 tools/testing/selftests/bpf/map_tests/no_charg.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/no_charge.c
--
2.17.1
Hi,
On linux-next
cd tools/testing/selftests/futex && make clean -j 32
gives warning
make[1]: warning: jobserver unavailable: using -j1. Add '+' to parent
make rule.
The full logs with with different reproduction steps can be found here:
https://storage.staging.kernelci.org/next/master/next-20220310/x86_64/x86_6….
Usually this type of warning shouldn't come when $MAKE is being used
instead of make in Makefile.
Maybe `define CLEAN` inside override construct defined in parent
makefile is not getting jobsever information when child make process
executes. I've enabled verbose mode and tried with other makefile flags
(-p, -d etc) as well. Documentation mentions that if make is unable to
identify the child process correctly, this warning will appear.
Please share if you have any thoughts on it.
--
Muhammad Usama Anjum
Simplify the test_encl_bootstrap.S flow by using rip-relative addressing.
Compiler does the right thing here, and this removes dependency on where
TCS entries need to be located in the binary, i.e. allows the binary layout
changed freely in the future.
Cc: Reinette Chatre <reinette.chatre(a)intel.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko(a)kernel.org>
---
This has been in my mind for a while and since the kselftest is
seemingly growing, I thought it is better to get rid off such an
artificial limitation on the binary layout.
tools/testing/selftests/sgx/test_encl_bootstrap.S | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/sgx/test_encl_bootstrap.S b/tools/testing/selftests/sgx/test_encl_bootstrap.S
index 82fb0dfcbd23..1c1b5c6c4ffe 100644
--- a/tools/testing/selftests/sgx/test_encl_bootstrap.S
+++ b/tools/testing/selftests/sgx/test_encl_bootstrap.S
@@ -40,11 +40,7 @@
.text
encl_entry:
- # RBX contains the base address for TCS, which is the first address
- # inside the enclave for TCS #1 and one page into the enclave for
- # TCS #2. By adding the value of encl_stack to it, we get
- # the absolute address for the stack.
- lea (encl_stack)(%rbx), %rax
+ lea (encl_stack)(%rip), %rax
xchg %rsp, %rax
push %rax
--
2.35.1
From: Yosry Ahmed <yosryahmed(a)google.com>
[ Upstream commit 1c4debc443ef7037dcb7c4f08c33b9caebd21d2e ]
When building the vm selftests using clang, some errors are seen due to
having headers in the compilation command:
clang -Wall -I ../../../../usr/include -no-pie gup_test.c ../../../../mm/gup_test.h -lrt -lpthread -o .../tools/testing/selftests/vm/gup_test
clang: error: cannot specify -o when generating multiple output files
make[1]: *** [../lib.mk:146: .../tools/testing/selftests/vm/gup_test] Error 1
Rework to add the header files to LOCAL_HDRS before including ../lib.mk,
since the dependency is evaluated in '$(OUTPUT)/%:%.c $(LOCAL_HDRS)' in
file lib.mk.
Link: https://lkml.kernel.org/r/20220304000645.1888133-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed(a)google.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/vm/Makefile | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index d9605bd10f2d..acf5eaeef9ff 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -1,6 +1,8 @@
# SPDX-License-Identifier: GPL-2.0
# Makefile for vm selftests
+LOCAL_HDRS += $(selfdir)/vm/local_config.h $(top_srcdir)/mm/gup_test.h
+
include local_config.mk
uname_M := $(shell uname -m 2>/dev/null || echo not)
@@ -139,10 +141,6 @@ endif
$(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
-$(OUTPUT)/gup_test: ../../../../mm/gup_test.h
-
-$(OUTPUT)/hmm-tests: local_config.h
-
# HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
$(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS)
--
2.34.1