This patch series revisits the proposal for a GPU cgroup controller to
track and limit memory allocations by various device/allocator
subsystems. The patch series also contains a simple prototype to
illustrate how Android intends to implement DMA-BUF allocator
attribution using the GPU cgroup controller. The prototype does not
include resource limit enforcements.
Changelog:
v3:
Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
Use more common dual author commit message format per John Stultz
Remove android from binder changes title per Todd Kjos
Add a kselftest for this new behavior per Greg Kroah-Hartman
Include details on behavior for all combinations of kernel/userspace
versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
Fix pid and uid types in binder UAPI header
v2:
See the previous revision of this change submitted by Hridya Valsaraju
at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König. Pointers to struct gpucg and struct gpucg_device
tracking the current associations were added to the dma_buf struct to
achieve this.
Fix incorrect Kconfig help section indentation per Randy Dunlap.
History of the GPU cgroup controller
====================================
The GPU/DRM cgroup controller came into being when a consensus[1]
was reached that the resources it tracked were unsuitable to be integrated
into memcg. Originally, the proposed controller was specific to the DRM
subsystem and was intended to track GEM buffers and GPU-specific
resources[2]. In order to help establish a unified memory accounting model
for all GPU and all related subsystems, Daniel Vetter put forth a
suggestion to move it out of the DRM subsystem so that it can be used by
other DMA-BUF exporters as well[3]. This RFC proposes an interface that
does the same.
[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-…
[2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.co…
[3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
Hridya Valsaraju (5):
gpu: rfc: Proposal for a GPU cgroup controller
cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
memory
dmabuf: heaps: export system_heap buffers with GPU cgroup charging
dmabuf: Add gpu cgroup charge transfer function
binder: Add a buffer flag to relinquish ownership of fds
T.J. Mercier (3):
dmabuf: Use the GPU cgroup charge/uncharge APIs
binder: use __kernel_pid_t and __kernel_uid_t for userspace
selftests: Add binder cgroup gpu memory transfer test
Documentation/gpu/rfc/gpu-cgroup.rst | 183 +++++++
Documentation/gpu/rfc/index.rst | 4 +
drivers/android/binder.c | 26 +
drivers/dma-buf/dma-buf.c | 100 ++++
drivers/dma-buf/dma-heap.c | 27 +
drivers/dma-buf/heaps/system_heap.c | 3 +
include/linux/cgroup_gpu.h | 127 +++++
include/linux/cgroup_subsys.h | 4 +
include/linux/dma-buf.h | 22 +-
include/linux/dma-heap.h | 11 +
include/uapi/linux/android/binder.h | 5 +-
init/Kconfig | 7 +
kernel/cgroup/Makefile | 1 +
kernel/cgroup/gpu.c | 304 +++++++++++
.../selftests/drivers/android/binder/Makefile | 8 +
.../drivers/android/binder/binder_util.c | 254 +++++++++
.../drivers/android/binder/binder_util.h | 32 ++
.../selftests/drivers/android/binder/config | 4 +
.../binder/test_dmabuf_cgroup_transfer.c | 480 ++++++++++++++++++
19 files changed, 1598 insertions(+), 4 deletions(-)
create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst
create mode 100644 include/linux/cgroup_gpu.h
create mode 100644 kernel/cgroup/gpu.c
create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
create mode 100644 tools/testing/selftests/drivers/android/binder/config
create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
--
2.35.1.616.g0bdcbb4464-goog
After switching to memcg-based bpf memory accounting, the bpf memory is
charged to the loader's memcg by defaut, that causes unexpected issues for
us. For instance, the container of the loader-which loads the bpf programs
and pins them on bpffs-may restart after pinning the progs and maps. After
the restart, the pinned progs and maps won't belong to the new container
any more, while they actually belong to an offline memcg left by the
previous generation. That inconsistent behavior will make trouble for the
memory resource management for this container.
The reason why these progs and maps have to be persistent across multiple
generations is that these progs and maps are also used by other processes
which are not in this container. IOW, they can't be removed when this
container is restarted. Take a specific example, bpf program for clsact
qdisc is loaded by a agent running in a container, which not only loads
bpf program but also processes the data generated by this program and do
some other maintainace things.
In order to keep the charging behavior consistent, we used to consider a
way to recharge these pinned maps and progs again after the container is
restarted, but after the discussion[1] with Roman, we decided to go
another direction that don't charge them to the container in the first
place. TL;DR about the mentioned disccussion: recharging is not a generic
solution and it may take too much risk.
This patchset is the solution of no charge. Two flags are introduced in
union bpf_attr, one for bpf map and another for bpf prog. The user who
doesn't want to charge to current memcg can use these two flags. These two
flags are only permitted for sys admin as these memory will be accounted to
the root memcg only.
Patches #1~#8 are for bpf map. Patches #9~#12 are for bpf prog. Patch #13
and #14 are for selftests and also the examples of how to use them.
[1]. https://lwn.net/Articles/887180/
Yafang Shao (14):
bpf: Introduce no charge flag for bpf map
bpf: Only sys admin can set no charge flag
bpf: Enable no charge in map _CREATE_FLAG_MASK
bpf: Introduce new parameter bpf_attr in bpf_map_area_alloc
bpf: Allow no charge in bpf_map_area_alloc
bpf: Allow no charge for allocation not at map creation time
bpf: Allow no charge in map specific allocation
bpf: Aggregate flags for BPF_PROG_LOAD command
bpf: Add no charge flag for bpf prog
bpf: Only sys admin can set no charge flag for bpf prog
bpf: Set __GFP_ACCOUNT at the callsite of bpf_prog_alloc
bpf: Allow no charge for bpf prog
bpf: selftests: Add test case for BPF_F_NO_CHARTE
bpf: selftests: Add test case for BPF_F_PROG_NO_CHARGE
include/linux/bpf.h | 27 ++++++-
include/uapi/linux/bpf.h | 21 +++--
kernel/bpf/arraymap.c | 9 +--
kernel/bpf/bloom_filter.c | 7 +-
kernel/bpf/bpf_local_storage.c | 8 +-
kernel/bpf/bpf_struct_ops.c | 13 +--
kernel/bpf/core.c | 20 +++--
kernel/bpf/cpumap.c | 10 ++-
kernel/bpf/devmap.c | 14 ++--
kernel/bpf/hashtab.c | 14 ++--
kernel/bpf/local_storage.c | 4 +-
kernel/bpf/lpm_trie.c | 4 +-
kernel/bpf/queue_stack_maps.c | 5 +-
kernel/bpf/reuseport_array.c | 3 +-
kernel/bpf/ringbuf.c | 19 ++---
kernel/bpf/stackmap.c | 13 +--
kernel/bpf/syscall.c | 40 +++++++---
kernel/bpf/verifier.c | 2 +-
net/core/filter.c | 6 +-
net/core/sock_map.c | 8 +-
net/xdp/xskmap.c | 9 ++-
tools/include/uapi/linux/bpf.h | 21 +++--
.../selftests/bpf/map_tests/no_charg.c | 79 +++++++++++++++++++
.../selftests/bpf/prog_tests/no_charge.c | 49 ++++++++++++
24 files changed, 297 insertions(+), 108 deletions(-)
create mode 100644 tools/testing/selftests/bpf/map_tests/no_charg.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/no_charge.c
--
2.17.1
Hi,
On linux-next
cd tools/testing/selftests/futex && make clean -j 32
gives warning
make[1]: warning: jobserver unavailable: using -j1. Add '+' to parent
make rule.
The full logs with with different reproduction steps can be found here:
https://storage.staging.kernelci.org/next/master/next-20220310/x86_64/x86_6….
Usually this type of warning shouldn't come when $MAKE is being used
instead of make in Makefile.
Maybe `define CLEAN` inside override construct defined in parent
makefile is not getting jobsever information when child make process
executes. I've enabled verbose mode and tried with other makefile flags
(-p, -d etc) as well. Documentation mentions that if make is unable to
identify the child process correctly, this warning will appear.
Please share if you have any thoughts on it.
--
Muhammad Usama Anjum
Simplify the test_encl_bootstrap.S flow by using rip-relative addressing.
Compiler does the right thing here, and this removes dependency on where
TCS entries need to be located in the binary, i.e. allows the binary layout
changed freely in the future.
Cc: Reinette Chatre <reinette.chatre(a)intel.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko(a)kernel.org>
---
This has been in my mind for a while and since the kselftest is
seemingly growing, I thought it is better to get rid off such an
artificial limitation on the binary layout.
tools/testing/selftests/sgx/test_encl_bootstrap.S | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/sgx/test_encl_bootstrap.S b/tools/testing/selftests/sgx/test_encl_bootstrap.S
index 82fb0dfcbd23..1c1b5c6c4ffe 100644
--- a/tools/testing/selftests/sgx/test_encl_bootstrap.S
+++ b/tools/testing/selftests/sgx/test_encl_bootstrap.S
@@ -40,11 +40,7 @@
.text
encl_entry:
- # RBX contains the base address for TCS, which is the first address
- # inside the enclave for TCS #1 and one page into the enclave for
- # TCS #2. By adding the value of encl_stack to it, we get
- # the absolute address for the stack.
- lea (encl_stack)(%rbx), %rax
+ lea (encl_stack)(%rip), %rax
xchg %rsp, %rax
push %rax
--
2.35.1