This patch series revisits the proposal for a GPU cgroup controller to track and limit memory allocations by various device/allocator subsystems. The patch series also contains a simple prototype to illustrate how Android intends to implement DMA-BUF allocator attribution using the GPU cgroup controller. The prototype does not include resource limit enforcements.
History of the GPU cgroup controller ==================================== The GPU/DRM cgroup controller came into being when a consensus[1] was reached that the resources it tracked were unsuitable to be integrated into memcg. Originally, the proposed controller was specific to the DRM subsystem and was intended to track GEM buffers and GPU-specific resources[2]. In order to help establish a unified memory accounting model for all GPU and all related subsystems, Daniel Vetter put forth a suggestion to move it out of the DRM subsystem so that it can be used by other DMA-BUF exporters as well[3]. This RFC proposes an interface that does the same.
[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-b... [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com... [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
Hridya Valsaraju (6): gpu: rfc: Proposal for a GPU cgroup controller cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory dmabuf: heaps: Use the GPU cgroup charge/uncharge APIs dma-buf: Add DMA-BUF exporter op to charge a DMA-BUF to a cgroup. dmabuf: system_heap: implement dma-buf op for GPU cgroup charge transfer android: binder: Add a buffer flag to relinquish ownership of fds
Documentation/gpu/rfc/gpu-cgroup.rst | 192 +++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + drivers/android/binder.c | 32 +++ drivers/dma-buf/dma-heap.c | 27 +++ drivers/dma-buf/heaps/system_heap.c | 68 ++++++ include/linux/cgroup_gpu.h | 120 +++++++++++ include/linux/cgroup_subsys.h | 4 + include/linux/dma-buf.h | 18 ++ include/linux/dma-heap.h | 11 + include/uapi/linux/android/binder.h | 1 + init/Kconfig | 7 + kernel/cgroup/Makefile | 1 + kernel/cgroup/gpu.c | 305 +++++++++++++++++++++++++++ 13 files changed, 790 insertions(+) create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst create mode 100644 include/linux/cgroup_gpu.h create mode 100644 kernel/cgroup/gpu.c
This patch adds a proposal for a new GPU cgroup controller for accounting/limiting GPU and GPU-related memory allocations. The proposed controller is based on the DRM cgroup controller[1] and follows the design of the RDMA cgroup controller.
The new cgroup controller would: * Allow setting per-cgroup limits on the total size of buffers charged to it. * Allow setting per-device limits on the total size of buffers allocated by device within a cgroup. * Expose a per-device/allocator breakdown of the buffers charged to a cgroup.
The prototype in the following patches are only for memory accounting using the GPU cgroup controller and does not implement limit setting.
[1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com...
Signed-off-by: Hridya Valsaraju hridya@google.com ---
Hi all,
Here is the RFC documentation for the GPU cgroup controller that we talked about at LPC 2021 along with a prototype. I reached out to Tejun with the idea recently and he mentioned that cgroup-aware BPF(by Kenny Ho) or the new misc cgroup controller can also be considered as alternatives to track GPU resources. I am sending the RFC to the list to give everyone else a chance to chime in with their thoughts as well so that we can reach an agreement on how to proceed. Thanks in advance!
Regards, Hridya
Documentation/gpu/rfc/gpu-cgroup.rst | 192 +++++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 196 insertions(+) create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst
diff --git a/Documentation/gpu/rfc/gpu-cgroup.rst b/Documentation/gpu/rfc/gpu-cgroup.rst new file mode 100644 index 000000000000..9bff23007b22 --- /dev/null +++ b/Documentation/gpu/rfc/gpu-cgroup.rst @@ -0,0 +1,192 @@ +=================================== +GPU cgroup controller +=================================== + +Goals +===== +This document intends to outline a plan to create a cgroup v2 controller subsystem +for the per-cgroup accounting of device and system memory allocated by the GPU +and related subsystems. + +The new cgroup controller would: + +* Allow setting per-cgroup limits on the total size of buffers charged to it. + +* Allow setting per-device limits on the total size of buffers allocated by a + device/allocator within a cgroup. + +* Expose a per-device/allocator breakdown of the buffers charged to a cgroup. + +Alternatives Considered +======================= + +The following alternatives were considered: + +The memory cgroup controller +____________________________ + +1. As was noted in [1], memory accounting provided by the GPU cgroup +controller is not a good fit for integration into memcg due to the +differences in how accounting is performed. It implements a mechanism +for the allocator attribution of GPU and GPU-related memory by +charging each buffer to the cgroup of the process on behalf of which +the memory was allocated. The buffer stays charged to the cgroup until +it is freed regardless of whether the process retains any references +to it. On the other hand, the memory cgroup controller offers a more +fine-grained charging and uncharging behavior depending on the kind of +page being accounted. + +2. Memcg performs accounting in units of pages. In the DMA-BUF buffer sharing model, +a process takes a reference to the entire buffer(hence keeping it alive) even if +it is only accessing parts of it. Therefore, per-page memory tracking for DMA-BUF +memory accounting would only introduce additional overhead without any benefits. + +[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-b... + +Userspace service to keep track of buffer allocations and releases +__________________________________________________________________ + +1. There is no way for a userspace service to intercept all allocations and releases. +2. In case the process gets killed or restarted, we lose all accounting so far. + +UAPI +==== +When enabled, the new cgroup controller would create the following files in every cgroup. + +:: + + gpu.memory.current (R) + gpu.memory.max (R/W) + +gpu.memory.current is a read-only file and would contain per-device memory allocations +in a key-value format where key is a string representing the device name +and the value is the size of memory charged to the device in the cgroup in bytes. + +For example: + +:: + + cat /sys/kernel/fs/cgroup1/gpu.memory.current + dev1 4194304 + dev2 4194304 + +The string key for each device is set by the device driver when the device registers +with the GPU cgroup controller to participate in resource accounting(see section +'Design and Implementation' for more details). + +gpu.memory.max is a read/write file. It would show the current total +size limits on memory usage for the cgroup and the limits on total memory usage +for each allocator/device. + +Setting a total limit for a cgroup can be done as follows: + +:: + + echo “total 41943040” > /sys/kernel/fs/cgroup1/gpu.memory.max + +Setting a total limit for a particular device/allocator can be done as follows: + +:: + + echo “dev1 4194304” > /sys/kernel/fs/cgroup1/gpu.memory.max + +In this example, 'dev1' is the string key set by the device driver during +registration. + +Design and Implementation +========================= + +The cgroup controller would closely follow the design of the RDMA cgroup controller +subsystem where each cgroup maintains a list of resource pools. +Each resource pool contains a struct device and the counter to track current total, +and the maximum limit set for the device. + +The below code block is a preliminary estimation on how the core kernel data structures +and APIs would look like. + +.. code-block:: c + + /** + * The GPU cgroup controller data structure. + */ + struct gpucg { + struct cgroup_subsys_state css; + /* list of all resource pools that belong to this cgroup */ + struct list_head rpools; + }; + + struct gpucg_device { + /* + * list of various resource pools in various cgroups that the device is + * part of. + */ + struct list_head rpools; + /* list of all devices registered for GPU cgroup accounting */ + struct list_head dev_node; + /* name to be used as identifier for accounting and limit setting */ + const char *name; + }; + + struct gpucg_resource_pool { + /* The device whose resource usage is tracked by this resource pool */ + struct gpucg_device *device; + + /* list of all resource pools for the cgroup */ + struct list_head cg_node; + + /* + * list maintained by the gpucg_device to keep track of its + * resource pools + */ + struct list_head dev_node; + + /* tracks memory usage of the resource pool */ + struct page_counter total; + }; + + /** + * gpucg_register_device - Registers a device for memory accounting using the + * GPU cgroup controller. + * + * @device: The device to register for memory accounting. Must remain valid + * after registration. + * @name: Pointer to a string literal to denote the name of the device. + */ + void gpucg_register_device(struct gpucg_device *gpucg_dev, const char *name); + + /** + * gpucg_try_charge - charge memory to the specified gpucg and gpucg_device. + * + * @gpucg: The gpu cgroup to charge the memory to. + * @device: The device to charge the memory to. + * @usage: size of memory to charge in bytes. + * + * Return: returns 0 if the charging is successful and otherwise returns an + * error code. + */ + int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage); + + /** + * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_device. + * + * @gpucg: The gpu cgroup to uncharge the memory from. + * @device: The device to charge the memory from. + * @usage: size of memory to uncharge in bytes. + */ + void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage); + +Future Work +=========== +Additional GPU resources can be supported by adding new controller files. + +Upstreaming Plan +================ +* Decide on a UAPI that accommodates all use-cases for the upstream GPU ecosystem + as well as for Android. + +* Prototype the GPU cgroup controller and integrate its usage into the DMA-BUF + system heap. + +* Demonstrate its usage from userspace in the Android Open Space Project. + +* Send out RFCs to LKML for the GPU cgroup controller and iterate. diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..0a9bcd94e95d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst + +.. toctree:: + + gpu-cgroup.rst
The cgroup controller provides accounting for GPU and GPU-related memory allocations. The memory being accounted can be device memory or memory allocated from pools dedicated to serve GPU-related tasks.
This patch adds APIs to: -allow a device to register for memory accounting using the GPU cgroup controller. -charge and uncharge allocated memory to a cgroup.
When the cgroup controller is enabled, it would expose information about the memory allocated by each device(registered for GPU cgroup memory accounting) for each cgroup.
The API/UAPI can be extended to set per-device/total allocation limits in the future.
The cgroup controller has been named following the discussion in [1].
[1]: https://lore.kernel.org/amd-gfx/YCJp%2F%2FkMC7YjVMXv@phenom.ffwll.local/
Signed-off-by: Hridya Valsaraju hridya@google.com --- include/linux/cgroup_gpu.h | 120 +++++++++++++ include/linux/cgroup_subsys.h | 4 + init/Kconfig | 7 + kernel/cgroup/Makefile | 1 + kernel/cgroup/gpu.c | 305 ++++++++++++++++++++++++++++++++++ 5 files changed, 437 insertions(+) create mode 100644 include/linux/cgroup_gpu.h create mode 100644 kernel/cgroup/gpu.c
diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h new file mode 100644 index 000000000000..0ac303ce6179 --- /dev/null +++ b/include/linux/cgroup_gpu.h @@ -0,0 +1,120 @@ +/* SPDX-License-Identifier: MIT + * Copyright 2019 Advanced Micro Devices, Inc. + * Copyright (C) 2022 Google LLC. + */ +#ifndef _CGROUP_GPU_H +#define _CGROUP_GPU_H + +#include <linux/cgroup.h> +#include <linux/page_counter.h> + +#ifdef CONFIG_CGROUP_GPU + /* The GPU cgroup controller data structure */ +struct gpucg { + struct cgroup_subsys_state css; + /* list of all resource pools that belong to this cgroup */ + struct list_head rpools; +}; + +struct gpucg_device { + /* + * list of various resource pool in various cgroups that the device is + * part of. + */ + struct list_head rpools; + /* list of all devices registered for GPU cgroup accounting */ + struct list_head dev_node; + /* + * pointer to string literal to be used as identifier for accounting and + * limit setting + */ + const char *name; +}; + +/** + * css_to_gpucg - get the corresponding gpucg ref from a cgroup_subsys_state + * @css: the target cgroup_subsys_state + * + * Returns: gpu cgroup that contains the @css + */ +static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css) +{ + return css ? container_of(css, struct gpucg, css) : NULL; +} + +/** + * gpucg_get - get the gpucg reference that a task belongs to + * @task: the target task + * + * This increases the reference count of the css that the @task belongs to. + * + * Returns: reference to the gpu cgroup the task belongs to. + */ +static inline struct gpucg *gpucg_get(struct task_struct *task) +{ + if (!cgroup_subsys_enabled(gpu_cgrp_subsys)) + return NULL; + return css_to_gpucg(task_get_css(task, gpu_cgrp_id)); +} + +/** + * gpucg_put - put a gpucg reference + * @gpucg: the target gpucg + * + * Put a reference obtained via gpucg_get + */ +static inline void gpucg_put(struct gpucg *gpucg) +{ + if (gpucg) + css_put(&gpucg->css); +} + +/** + * gpucg_parent - find the parent of a gpu cgroup + * @cg: the target gpucg + * + * This does not increase the reference count of the parent cgroup + * + * Returns: parent gpu cgroup of @cg + */ +static inline struct gpucg *gpucg_parent(struct gpucg *cg) +{ + return css_to_gpucg(cg->css.parent); +} + +int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage); +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage); +void gpucg_register_device(struct gpucg_device *gpucg_dev, const char *name); +#else /* CONFIG_CGROUP_GPU */ + +struct gpucg; +struct gpucg_device; + +static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css) +{ + return NULL; +} + +static inline struct gpucg *gpucg_get(struct task_struct *task) +{ + return NULL; +} + +static inline void gpucg_put(struct gpucg *gpucg) {} + +static inline struct gpucg *gpucg_parent(struct gpucg *cg) +{ + return NULL; +} +static inline int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *device, + u64 usage) +{ + return 0; +} + +static inline void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, + u64 usage) {} +static inline void gpucg_register_device(struct gpucg_device *gpucg_dev, + const char *name) {} +#endif /* CONFIG_CGROUP_GPU */ +#endif /* _CGROUP_GPU_H */ diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index 445235487230..46a2a7b93c41 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -65,6 +65,10 @@ SUBSYS(rdma) SUBSYS(misc) #endif
+#if IS_ENABLED(CONFIG_CGROUP_GPU) +SUBSYS(gpu) +#endif + /* * The following subsystems are not supported on the default hierarchy. */ diff --git a/init/Kconfig b/init/Kconfig index cd23faa163d1..408910b21387 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -990,6 +990,13 @@ config BLK_CGROUP
See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
+config CGROUP_GPU + bool "gpu cgroup controller (EXPERIMENTAL)" + select PAGE_COUNTER + help + Provides accounting and limit setting for memory allocations by the GPU + and GPU-related subsystems. + config CGROUP_WRITEBACK bool depends on MEMCG && BLK_CGROUP diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile index 12f8457ad1f9..be95a5a532fc 100644 --- a/kernel/cgroup/Makefile +++ b/kernel/cgroup/Makefile @@ -7,3 +7,4 @@ obj-$(CONFIG_CGROUP_RDMA) += rdma.o obj-$(CONFIG_CPUSETS) += cpuset.o obj-$(CONFIG_CGROUP_MISC) += misc.o obj-$(CONFIG_CGROUP_DEBUG) += debug.o +obj-$(CONFIG_CGROUP_GPU) += gpu.o diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c new file mode 100644 index 000000000000..b171fae06b0d --- /dev/null +++ b/kernel/cgroup/gpu.c @@ -0,0 +1,305 @@ +// SPDX-License-Identifier: MIT +// Copyright 2019 Advanced Micro Devices, Inc. +// Copyright (C) 2022 Google LLC. + +#include <linux/cgroup.h> +#include <linux/cgroup_gpu.h> +#include <linux/page_counter.h> +#include <linux/seq_file.h> +#include <linux/slab.h> + +static struct gpucg *root_gpucg __read_mostly; + +/* + * Protects list of resource pools maintained on per cgroup basis + * and list of devices registered for memory accounting using the GPU cgroup + * controller. + */ +static DEFINE_MUTEX(gpucg_mutex); +static LIST_HEAD(gpucg_devices); + +struct gpucg_resource_pool { + /* The device whose resource usage is tracked by this resource pool */ + struct gpucg_device *device; + + /* list of all resource pools for the cgroup */ + struct list_head cg_node; + + /* + * list maintained by the gpucg_device to keep track of its + * resource pools + */ + struct list_head dev_node; + + /* tracks memory usage of the resource pool */ + struct page_counter total; +}; + +static void free_cg_rpool_locked(struct gpucg_resource_pool *rpool) +{ + lockdep_assert_held(&gpucg_mutex); + + list_del(&rpool->cg_node); + list_del(&rpool->dev_node); + kfree(rpool); +} + +static void gpucg_css_free(struct cgroup_subsys_state *css) +{ + struct gpucg_resource_pool *rpool, *tmp; + struct gpucg *gpucg = css_to_gpucg(css); + + // delete all resource pools + mutex_lock(&gpucg_mutex); + list_for_each_entry_safe(rpool, tmp, &gpucg->rpools, cg_node) + free_cg_rpool_locked(rpool); + mutex_unlock(&gpucg_mutex); + + kfree(gpucg); +} + +static struct cgroup_subsys_state * +gpucg_css_alloc(struct cgroup_subsys_state *parent_css) +{ + struct gpucg *gpucg, *parent; + + gpucg = kzalloc(sizeof(struct gpucg), GFP_KERNEL); + if (!gpucg) + return ERR_PTR(-ENOMEM); + + parent = css_to_gpucg(parent_css); + if (!parent) + root_gpucg = gpucg; + + INIT_LIST_HEAD(&gpucg->rpools); + + return &gpucg->css; +} + +static struct gpucg_resource_pool *find_cg_rpool_locked(struct gpucg *cg, + struct gpucg_device *device) + +{ + struct gpucg_resource_pool *pool; + + lockdep_assert_held(&gpucg_mutex); + + list_for_each_entry(pool, &cg->rpools, cg_node) + if (pool->device == device) + return pool; + + return NULL; +} + +static struct gpucg_resource_pool *init_cg_rpool(struct gpucg *cg, + struct gpucg_device *device) +{ + struct gpucg_resource_pool *rpool = kzalloc(sizeof(*rpool), + GFP_KERNEL); + if (!rpool) + return ERR_PTR(-ENOMEM); + + rpool->device = device; + + page_counter_init(&rpool->total, NULL); + INIT_LIST_HEAD(&rpool->cg_node); + INIT_LIST_HEAD(&rpool->dev_node); + list_add_tail(&rpool->cg_node, &cg->rpools); + list_add_tail(&rpool->dev_node, &device->rpools); + + return rpool; +} + +/** + * get_cg_rpool_locked - find the resource pool for the specified device and + * specified cgroup. If the resource pool does not exist for the cg, it is created + * in a hierarchical manner in the cgroup and its ancestor cgroups who do not + * already have a resource pool entry for the device. + * + * @cg: The cgroup to find the resource pool for. + * @device: The device associated with the returned resource pool. + * + * Return: return resource pool entry corresponding to the specified device in + * the specified cgroup (hierarchically creating them if not existing already). + * + */ +static struct gpucg_resource_pool * +get_cg_rpool_locked(struct gpucg *cg, struct gpucg_device *device) +{ + struct gpucg *parent_cg, *p, *stop_cg; + struct gpucg_resource_pool *rpool, *tmp_rpool; + struct gpucg_resource_pool *parent_rpool = NULL, *leaf_rpool = NULL; + + rpool = find_cg_rpool_locked(cg, device); + if (rpool) + return rpool; + + stop_cg = cg; + do { + rpool = init_cg_rpool(stop_cg, device); + if (IS_ERR(rpool)) + goto err; + + if (!leaf_rpool) + leaf_rpool = rpool; + + stop_cg = gpucg_parent(stop_cg); + if (!stop_cg) + break; + + rpool = find_cg_rpool_locked(stop_cg, device); + } while (!rpool); + + /* + * Re-initialize page counters of all rpools created in this invocation to + * enable hierarchical charging. + * stop_cg is the first ancestor cg who already had a resource pool for + * the device. It can also be NULL if no ancestors had a pre-existing + * resource pool for the device before this invocation. + */ + rpool = leaf_rpool; + for (p = cg; p != stop_cg; p = parent_cg) { + parent_cg = gpucg_parent(p); + if (!parent_cg) + break; + parent_rpool = find_cg_rpool_locked(parent_cg, device); + page_counter_init(&rpool->total, &parent_rpool->total); + + rpool = parent_rpool; + } + + return leaf_rpool; +err: + for (p = cg; p != stop_cg; p = gpucg_parent(p)) { + tmp_rpool = find_cg_rpool_locked(p, device); + free_cg_rpool_locked(tmp_rpool); + } + return rpool; +} + +/** + * gpucg_try_charge - charge memory to the specified gpucg and gpucg_device. + * Caller must hold a reference to @gpucg obtained through gpucg_get(). The size + * of the memory is rounded up to be a multiple of the page size. + * + * @gpucg: The gpu cgroup to charge the memory to. + * @device: The device to charge the memory to. + * @usage: size of memory to charge in bytes. + * + * Return: returns 0 if the charging is successful and otherwise returns an + * error code. + */ +int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage) +{ + struct page_counter *counter; + u64 nr_pages; + struct gpucg_resource_pool *rp; + int ret = 0; + + mutex_lock(&gpucg_mutex); + rp = get_cg_rpool_locked(gpucg, device); + /* + * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is + * freed and the caller is holding a reference to the gpucg. + */ + mutex_unlock(&gpucg_mutex); + + if (IS_ERR(rp)) + return PTR_ERR(rp); + + nr_pages = PAGE_ALIGN(usage) >> PAGE_SHIFT; + if (page_counter_try_charge(&rp->total, nr_pages, + &counter)) + css_get_many(&gpucg->css, nr_pages); + else + ret = -ENOMEM; + + return ret; +} + +/** + * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_device. + * The caller must hold a reference to @gpucg obtained through gpucg_get(). + * + * @gpucg: The gpu cgroup to uncharge the memory from. + * @device: The device to uncharge the memory from. + * @usage: size of memory to uncharge in bytes. + */ +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, + u64 usage) +{ + u64 nr_pages; + struct gpucg_resource_pool *rp; + + mutex_lock(&gpucg_mutex); + rp = find_cg_rpool_locked(gpucg, device); + /* + * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is + * freed and there are active refs on gpucg. + */ + mutex_unlock(&gpucg_mutex); + + if (unlikely(!rp)) { + pr_err("Resource pool not found, incorrect charge/uncharge ordering?\n"); + return; + } + + nr_pages = PAGE_ALIGN(usage) >> PAGE_SHIFT; + page_counter_uncharge(&rp->total, nr_pages); + css_put_many(&gpucg->css, nr_pages); +} + +/** + * gpucg_register_device - Registers a device for memory accounting using the + * GPU cgroup controller. + * + * @device: The device to register for memory accounting. + * @name: Pointer to a string literal to denote the name of the device. + * + * Both @device andd @name must remain valid. + */ +void gpucg_register_device(struct gpucg_device *device, const char *name) +{ + if (!device) + return; + + INIT_LIST_HEAD(&device->dev_node); + INIT_LIST_HEAD(&device->rpools); + + mutex_lock(&gpucg_mutex); + list_add_tail(&device->dev_node, &gpucg_devices); + mutex_unlock(&gpucg_mutex); + + device->name = name; +} + +static int gpucg_resource_show(struct seq_file *sf, void *v) +{ + struct gpucg_resource_pool *rpool; + struct gpucg *cg = css_to_gpucg(seq_css(sf)); + + mutex_lock(&gpucg_mutex); + list_for_each_entry(rpool, &cg->rpools, cg_node) { + seq_printf(sf, "%s %lu\n", rpool->device->name, + page_counter_read(&rpool->total) * PAGE_SIZE); + } + mutex_unlock(&gpucg_mutex); + + return 0; +} + +struct cftype files[] = { + { + .name = "memory.current", + .seq_show = gpucg_resource_show, + }, + { } /* terminate */ +}; + +struct cgroup_subsys gpu_cgrp_subsys = { + .css_alloc = gpucg_css_alloc, + .css_free = gpucg_css_free, + .early_init = false, + .legacy_cftypes = files, + .dfl_cftypes = files, +};
Hi--
On 1/14/22 17:06, Hridya Valsaraju wrote:
diff --git a/init/Kconfig b/init/Kconfig index cd23faa163d1..408910b21387 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -990,6 +990,13 @@ config BLK_CGROUP See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information. +config CGROUP_GPU
bool "gpu cgroup controller (EXPERIMENTAL)"
select PAGE_COUNTER
help
- Provides accounting and limit setting for memory allocations by the GPU
- and GPU-related subsystems.
Please follow coding-style for Kconfig files:
(from Documentation/process/coding-style.rst, section 10):
For all of the Kconfig* configuration files throughout the source tree, the indentation is somewhat different. Lines under a ``config`` definition are indented with one tab, while help text is indented an additional two spaces.
thanks.
On Wed, Jan 19, 2022 at 7:40 AM Randy Dunlap rdunlap@infradead.org wrote:
Hi--
On 1/14/22 17:06, Hridya Valsaraju wrote:
diff --git a/init/Kconfig b/init/Kconfig index cd23faa163d1..408910b21387 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -990,6 +990,13 @@ config BLK_CGROUP
See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
+config CGROUP_GPU
bool "gpu cgroup controller (EXPERIMENTAL)"
select PAGE_COUNTER
help
Provides accounting and limit setting for memory allocations by the GPU
and GPU-related subsystems.
Please follow coding-style for Kconfig files:
(from Documentation/process/coding-style.rst, section 10):
For all of the Kconfig* configuration files throughout the source tree, the indentation is somewhat different. Lines under a ``config`` definition are indented with one tab, while help text is indented an additional two spaces.
Thanks Randy, sounds good! Will fix it in the next version!
thanks.
-- ~Randy
This patch uses the GPU cgroup charge/uncharge APIs to charge buffers allocated by the DMA-BUF system heap to the processes who allocated them.
By doing so, it becomes possible to track who allocated/exported a DMA-BUF even after the allocating process drops all references to a buffer.
Signed-off-by: Hridya Valsaraju hridya@google.com --- drivers/dma-buf/dma-heap.c | 27 +++++++++++++++++++++++++++ drivers/dma-buf/heaps/system_heap.c | 25 +++++++++++++++++++++++++ include/linux/dma-heap.h | 11 +++++++++++ 3 files changed, 63 insertions(+)
diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index 56bf5ad01ad5..6e74690f4b83 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -6,6 +6,7 @@ * Copyright (C) 2019 Linaro Ltd. */
+#include <linux/cgroup_gpu.h> #include <linux/cdev.h> #include <linux/debugfs.h> #include <linux/device.h> @@ -30,6 +31,7 @@ * @heap_devt heap device node * @list list head connecting to list of heaps * @heap_cdev heap char device + * @gpucg_dev gpu cg device for memory accounting * * Represents a heap of memory from which buffers can be made. */ @@ -40,6 +42,9 @@ struct dma_heap { dev_t heap_devt; struct list_head list; struct cdev heap_cdev; +#ifdef CONFIG_CGROUP_GPU + struct gpucg_device gpucg_dev; +#endif };
static LIST_HEAD(heap_list); @@ -214,6 +219,26 @@ const char *dma_heap_get_name(struct dma_heap *heap) return heap->name; }
+#ifdef CONFIG_CGROUP_GPU +/** + * dma_heap_get_gpucg_dev() - get struct gpucg_device for the heap. + * @heap: DMA-Heap to get the gpucg_device struct for. + * + * Returns: + * The gpucg_device struct for the heap. NULL if the GPU cgroup controller is + * not enabled. + */ +struct gpucg_device *dma_heap_get_gpucg_dev(struct dma_heap *heap) +{ + return &heap->gpucg_dev; +} +#else +struct gpucg_device *dma_heap_get_gpucg_dev(struct dma_heap *heap) +{ + return NULL; +} +#endif + struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info) { struct dma_heap *heap, *h, *err_ret; @@ -286,6 +311,8 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info) list_add(&heap->list, &heap_list); mutex_unlock(&heap_list_lock);
+ gpucg_register_device(dma_heap_get_gpucg_dev(heap), exp_info->name); + return heap;
err2: diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index ab7fd896d2c4..adfdc8c576f2 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -31,6 +31,7 @@ struct system_heap_buffer { struct sg_table sg_table; int vmap_cnt; void *vaddr; + struct gpucg *gpucg; };
struct dma_heap_attachment { @@ -296,6 +297,13 @@ static void system_heap_dma_buf_release(struct dma_buf *dmabuf) __free_pages(page, compound_order(page)); } sg_free_table(table); + + gpucg_uncharge(buffer->gpucg, + dma_heap_get_gpucg_dev(buffer->heap), + buffer->len); + + gpucg_put(buffer->gpucg); + kfree(buffer); }
@@ -356,6 +364,16 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap, mutex_init(&buffer->lock); buffer->heap = heap; buffer->len = len; + buffer->gpucg = gpucg_get(current); + + ret = gpucg_try_charge(buffer->gpucg, + dma_heap_get_gpucg_dev(buffer->heap), + len); + if (ret) { + gpucg_put(buffer->gpucg); + kfree(buffer); + return ERR_PTR(ret); + }
INIT_LIST_HEAD(&pages); i = 0; @@ -413,6 +431,13 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap, free_buffer: list_for_each_entry_safe(page, tmp_page, &pages, lru) __free_pages(page, compound_order(page)); + + gpucg_uncharge(buffer->gpucg, + dma_heap_get_gpucg_dev(buffer->heap), + buffer->len); + + gpucg_put(buffer->gpucg); + kfree(buffer);
return ERR_PTR(ret); diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h index 0c05561cad6e..e447a61d054e 100644 --- a/include/linux/dma-heap.h +++ b/include/linux/dma-heap.h @@ -10,6 +10,7 @@ #define _DMA_HEAPS_H
#include <linux/cdev.h> +#include <linux/cgroup_gpu.h> #include <linux/types.h>
struct dma_heap; @@ -59,6 +60,16 @@ void *dma_heap_get_drvdata(struct dma_heap *heap); */ const char *dma_heap_get_name(struct dma_heap *heap);
+/** + * dma_heap_get_gpucg_dev() - get a pointer to the struct gpucg_device for the + * heap. + * @heap: DMA-Heap to retrieve gpucg_device for. + * + * Returns: + * The gpucg_device struct for the heap. + */ +struct gpucg_device *dma_heap_get_gpucg_dev(struct dma_heap *heap); + /** * dma_heap_add - adds a heap to dmabuf heaps * @exp_info: information needed to register this heap
The optional exporter op provides a way for processes to transfer charge of a buffer to a different process. This is essential for the cases where a central allocator process does allocations for various subsystems, hands over the fd to the client who requested the memory and drops all references to the allocated memory.
Signed-off-by: Hridya Valsaraju hridya@google.com --- include/linux/dma-buf.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 7ab50076e7a6..d5e52f81cc6f 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -13,6 +13,7 @@ #ifndef __DMA_BUF_H__ #define __DMA_BUF_H__
+#include <linux/cgroup_gpu.h> #include <linux/dma-buf-map.h> #include <linux/file.h> #include <linux/err.h> @@ -285,6 +286,23 @@ struct dma_buf_ops {
int (*vmap)(struct dma_buf *dmabuf, struct dma_buf_map *map); void (*vunmap)(struct dma_buf *dmabuf, struct dma_buf_map *map); + + /** + * @charge_to_cgroup: + * + * This is called by an exporter to charge a buffer to the specified + * cgroup. The caller must hold a reference to @gpucg obtained via + * gpucg_get(). The DMA-BUF will be uncharged from the cgroup it is + * currently charged to before being charged to @gpucg. The caller must + * belong to the cgroup the buffer is currently charged to. + * + * This callback is optional. + * + * Returns: + * + * 0 on success or negative error code on failure. + */ + int (*charge_to_cgroup)(struct dma_buf *dmabuf, struct gpucg *gpucg); };
/**
Am 15.01.22 um 02:06 schrieb Hridya Valsaraju:
The optional exporter op provides a way for processes to transfer charge of a buffer to a different process. This is essential for the cases where a central allocator process does allocations for various subsystems, hands over the fd to the client who requested the memory and drops all references to the allocated memory.
Signed-off-by: Hridya Valsaraju hridya@google.com
include/linux/dma-buf.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 7ab50076e7a6..d5e52f81cc6f 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -13,6 +13,7 @@ #ifndef __DMA_BUF_H__ #define __DMA_BUF_H__ +#include <linux/cgroup_gpu.h> #include <linux/dma-buf-map.h> #include <linux/file.h> #include <linux/err.h> @@ -285,6 +286,23 @@ struct dma_buf_ops { int (*vmap)(struct dma_buf *dmabuf, struct dma_buf_map *map); void (*vunmap)(struct dma_buf *dmabuf, struct dma_buf_map *map);
- /**
* @charge_to_cgroup:
*
* This is called by an exporter to charge a buffer to the specified
* cgroup.
Well that sentence makes absolutely no sense at all.
The dma_buf_ops are supposed to be called by the DMA-buf subsystem on behalves of the importer and never by the exporter itself.
I hope that this is just a documentation mixup.
Regards, Christian.
The caller must hold a reference to @gpucg obtained via
* gpucg_get(). The DMA-BUF will be uncharged from the cgroup it is
* currently charged to before being charged to @gpucg. The caller must
* belong to the cgroup the buffer is currently charged to.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or negative error code on failure.
*/
- int (*charge_to_cgroup)(struct dma_buf *dmabuf, struct gpucg *gpucg); };
/**
On Sun, Jan 16, 2022 at 11:46 PM Christian König christian.koenig@amd.com wrote:
Am 15.01.22 um 02:06 schrieb Hridya Valsaraju:
The optional exporter op provides a way for processes to transfer charge of a buffer to a different process. This is essential for the cases where a central allocator process does allocations for various subsystems, hands over the fd to the client who requested the memory and drops all references to the allocated memory.
Signed-off-by: Hridya Valsaraju hridya@google.com
include/linux/dma-buf.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 7ab50076e7a6..d5e52f81cc6f 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -13,6 +13,7 @@ #ifndef __DMA_BUF_H__ #define __DMA_BUF_H__
+#include <linux/cgroup_gpu.h> #include <linux/dma-buf-map.h> #include <linux/file.h> #include <linux/err.h> @@ -285,6 +286,23 @@ struct dma_buf_ops {
int (*vmap)(struct dma_buf *dmabuf, struct dma_buf_map *map); void (*vunmap)(struct dma_buf *dmabuf, struct dma_buf_map *map);
/**
* @charge_to_cgroup:
*
* This is called by an exporter to charge a buffer to the specified
* cgroup.
Well that sentence makes absolutely no sense at all.
The dma_buf_ops are supposed to be called by the DMA-buf subsystem on behalves of the importer and never by the exporter itself.
I hope that this is just a documentation mixup.
Thank you for taking a look Christian!
Yes, that was poor wording, sorry about that. It should instead say that the op would be called by the process the buffer is currently charged to in order to transfer the buffer's charge to a different cgroup. This is helpful in the case where a process acts as an allocator for multiple client processes and we would like the allocated buffers to be charged to the clients who requested their allocation(instead of the allocating process as is the default behavior). In Android, the graphics allocator HAL process[1] does most of the graphics allocations on behalf of various clients. After allocation, the HAL process passes the fd to the client over binder IPC and the binder driver invokes the charge_to_cgroup() DMA-BUF op to uncharge the buffer from the HAL process and charge it to the client process instead.
[1]: https://source.android.com/devices/graphics/arch-bq-gralloc
Regards, Hridya
Regards, Christian.
The caller must hold a reference to @gpucg obtained via
* gpucg_get(). The DMA-BUF will be uncharged from the cgroup it is
* currently charged to before being charged to @gpucg. The caller must
* belong to the cgroup the buffer is currently charged to.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or negative error code on failure.
*/
int (*charge_to_cgroup)(struct dma_buf *dmabuf, struct gpucg *gpucg);
};
/**
On Tue, Jan 18, 2022 at 10:54:16AM -0800, Hridya Valsaraju wrote:
On Sun, Jan 16, 2022 at 11:46 PM Christian König christian.koenig@amd.com wrote:
Am 15.01.22 um 02:06 schrieb Hridya Valsaraju:
The optional exporter op provides a way for processes to transfer charge of a buffer to a different process. This is essential for the cases where a central allocator process does allocations for various subsystems, hands over the fd to the client who requested the memory and drops all references to the allocated memory.
Signed-off-by: Hridya Valsaraju hridya@google.com
include/linux/dma-buf.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 7ab50076e7a6..d5e52f81cc6f 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -13,6 +13,7 @@ #ifndef __DMA_BUF_H__ #define __DMA_BUF_H__
+#include <linux/cgroup_gpu.h> #include <linux/dma-buf-map.h> #include <linux/file.h> #include <linux/err.h> @@ -285,6 +286,23 @@ struct dma_buf_ops {
int (*vmap)(struct dma_buf *dmabuf, struct dma_buf_map *map); void (*vunmap)(struct dma_buf *dmabuf, struct dma_buf_map *map);
/**
* @charge_to_cgroup:
*
* This is called by an exporter to charge a buffer to the specified
* cgroup.
Well that sentence makes absolutely no sense at all.
The dma_buf_ops are supposed to be called by the DMA-buf subsystem on behalves of the importer and never by the exporter itself.
I hope that this is just a documentation mixup.
Thank you for taking a look Christian!
Yes, that was poor wording, sorry about that. It should instead say that the op would be called by the process the buffer is currently charged to in order to transfer the buffer's charge to a different cgroup. This is helpful in the case where a process acts as an allocator for multiple client processes and we would like the allocated buffers to be charged to the clients who requested their allocation(instead of the allocating process as is the default behavior). In Android, the graphics allocator HAL process[1] does most of the graphics allocations on behalf of various clients. After allocation, the HAL process passes the fd to the client over binder IPC and the binder driver invokes the charge_to_cgroup() DMA-BUF op to uncharge the buffer from the HAL process and charge it to the client process instead.
For that use-case, do we really need to have the vfunc abstraction and force all exporters to do something reasonable with it?
I think just storing the cgrpus gpu memory bucket this is charged against and doing this in a generic way would be a lot better.
That way we can also easily add other neat features in the future, like e.g. ttm could take care of charge-assignement automatically maybe, or we could print the current cgroups charge relationship in the sysfs info file. Or anything else really.
I do feel that in general for gpu memory cgroups to be useful, we should really have memory pools as a fairly strong concept. Otherwise every driver/allocator/thing is going to come up with their own, and very likely incompatible interpretation. And we end up with a supposed generic cgroups interface which cannot actually be used in a driver/vendor agnostic way at all. -Daniel
Regards, Hridya
Regards, Christian.
The caller must hold a reference to @gpucg obtained via
* gpucg_get(). The DMA-BUF will be uncharged from the cgroup it is
* currently charged to before being charged to @gpucg. The caller must
* belong to the cgroup the buffer is currently charged to.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or negative error code on failure.
*/
int (*charge_to_cgroup)(struct dma_buf *dmabuf, struct gpucg *gpucg);
};
/**
Am 19.01.22 um 16:54 schrieb Daniel Vetter:
On Tue, Jan 18, 2022 at 10:54:16AM -0800, Hridya Valsaraju wrote:
On Sun, Jan 16, 2022 at 11:46 PM Christian König christian.koenig@amd.com wrote:
Am 15.01.22 um 02:06 schrieb Hridya Valsaraju:
The optional exporter op provides a way for processes to transfer charge of a buffer to a different process. This is essential for the cases where a central allocator process does allocations for various subsystems, hands over the fd to the client who requested the memory and drops all references to the allocated memory.
Signed-off-by: Hridya Valsaraju hridya@google.com
include/linux/dma-buf.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 7ab50076e7a6..d5e52f81cc6f 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -13,6 +13,7 @@ #ifndef __DMA_BUF_H__ #define __DMA_BUF_H__
+#include <linux/cgroup_gpu.h> #include <linux/dma-buf-map.h> #include <linux/file.h> #include <linux/err.h> @@ -285,6 +286,23 @@ struct dma_buf_ops {
int (*vmap)(struct dma_buf *dmabuf, struct dma_buf_map *map); void (*vunmap)(struct dma_buf *dmabuf, struct dma_buf_map *map);
/**
* @charge_to_cgroup:
*
* This is called by an exporter to charge a buffer to the specified
* cgroup.
Well that sentence makes absolutely no sense at all.
The dma_buf_ops are supposed to be called by the DMA-buf subsystem on behalves of the importer and never by the exporter itself.
I hope that this is just a documentation mixup.
Thank you for taking a look Christian!
Yes, that was poor wording, sorry about that. It should instead say that the op would be called by the process the buffer is currently charged to in order to transfer the buffer's charge to a different cgroup. This is helpful in the case where a process acts as an allocator for multiple client processes and we would like the allocated buffers to be charged to the clients who requested their allocation(instead of the allocating process as is the default behavior). In Android, the graphics allocator HAL process[1] does most of the graphics allocations on behalf of various clients. After allocation, the HAL process passes the fd to the client over binder IPC and the binder driver invokes the charge_to_cgroup() DMA-BUF op to uncharge the buffer from the HAL process and charge it to the client process instead.
For that use-case, do we really need to have the vfunc abstraction and force all exporters to do something reasonable with it?
I was about to write up a similar answer, but more from the technical side.
Why in the world should that be done on the DMA-buf object as a communication function between importer and exporter?
That design makes absolutely no sense at all to me.
Regards, Christian.
I think just storing the cgrpus gpu memory bucket this is charged against and doing this in a generic way would be a lot better.
That way we can also easily add other neat features in the future, like e.g. ttm could take care of charge-assignement automatically maybe, or we could print the current cgroups charge relationship in the sysfs info file. Or anything else really.
I do feel that in general for gpu memory cgroups to be useful, we should really have memory pools as a fairly strong concept. Otherwise every driver/allocator/thing is going to come up with their own, and very likely incompatible interpretation. And we end up with a supposed generic cgroups interface which cannot actually be used in a driver/vendor agnostic way at all. -Daniel
Regards, Hridya
Regards, Christian.
The caller must hold a reference to @gpucg obtained via
* gpucg_get(). The DMA-BUF will be uncharged from the cgroup it is
* currently charged to before being charged to @gpucg. The caller must
* belong to the cgroup the buffer is currently charged to.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or negative error code on failure.
*/
int (*charge_to_cgroup)(struct dma_buf *dmabuf, struct gpucg *gpucg);
};
/**
On Wed, Jan 19, 2022 at 7:58 AM Christian König christian.koenig@amd.com wrote:
Am 19.01.22 um 16:54 schrieb Daniel Vetter:
On Tue, Jan 18, 2022 at 10:54:16AM -0800, Hridya Valsaraju wrote:
On Sun, Jan 16, 2022 at 11:46 PM Christian König christian.koenig@amd.com wrote:
Am 15.01.22 um 02:06 schrieb Hridya Valsaraju:
The optional exporter op provides a way for processes to transfer charge of a buffer to a different process. This is essential for the cases where a central allocator process does allocations for various subsystems, hands over the fd to the client who requested the memory and drops all references to the allocated memory.
Signed-off-by: Hridya Valsaraju hridya@google.com
include/linux/dma-buf.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 7ab50076e7a6..d5e52f81cc6f 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -13,6 +13,7 @@ #ifndef __DMA_BUF_H__ #define __DMA_BUF_H__
+#include <linux/cgroup_gpu.h> #include <linux/dma-buf-map.h> #include <linux/file.h> #include <linux/err.h> @@ -285,6 +286,23 @@ struct dma_buf_ops {
int (*vmap)(struct dma_buf *dmabuf, struct dma_buf_map *map); void (*vunmap)(struct dma_buf *dmabuf, struct dma_buf_map *map);
/**
* @charge_to_cgroup:
*
* This is called by an exporter to charge a buffer to the specified
* cgroup.
Well that sentence makes absolutely no sense at all.
The dma_buf_ops are supposed to be called by the DMA-buf subsystem on behalves of the importer and never by the exporter itself.
I hope that this is just a documentation mixup.
Thank you for taking a look Christian!
Yes, that was poor wording, sorry about that. It should instead say that the op would be called by the process the buffer is currently charged to in order to transfer the buffer's charge to a different cgroup. This is helpful in the case where a process acts as an allocator for multiple client processes and we would like the allocated buffers to be charged to the clients who requested their allocation(instead of the allocating process as is the default behavior). In Android, the graphics allocator HAL process[1] does most of the graphics allocations on behalf of various clients. After allocation, the HAL process passes the fd to the client over binder IPC and the binder driver invokes the charge_to_cgroup() DMA-BUF op to uncharge the buffer from the HAL process and charge it to the client process instead.
For that use-case, do we really need to have the vfunc abstraction and force all exporters to do something reasonable with it?
I was about to write up a similar answer, but more from the technical side.
Why in the world should that be done on the DMA-buf object as a communication function between importer and exporter?
That design makes absolutely no sense at all to me.
Regards, Christian.
I think just storing the cgrpus gpu memory bucket this is charged against and doing this in a generic way would be a lot better.
That way we can also easily add other neat features in the future, like e.g. ttm could take care of charge-assignement automatically maybe, or we could print the current cgroups charge relationship in the sysfs info file. Or anything else really.
Thank you for the comments Daniel and Christian! I made the charge/uncharge/transfer part of the exporter implementation since it provided exporters a choice on whether they wanted to enable cgroup memory accounting for the buffers they were exporting. I also see the benefits of making the charge/uncharge/transfer generic by moving it to the DMA-BUF framework like you are suggesting though. We will move to a more generic design in the next version of the RFC.
Regards, Hridya
I do feel that in general for gpu memory cgroups to be useful, we should really have memory pools as a fairly strong concept. Otherwise every driver/allocator/thing is going to come up with their own, and very likely incompatible interpretation. And we end up with a supposed generic cgroups interface which cannot actually be used in a driver/vendor agnostic way at all. -Daniel
Regards, Hridya
Regards, Christian.
The caller must hold a reference to @gpucg obtained via
* gpucg_get(). The DMA-BUF will be uncharged from the cgroup it is
* currently charged to before being charged to @gpucg. The caller must
* belong to the cgroup the buffer is currently charged to.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or negative error code on failure.
*/
int (*charge_to_cgroup)(struct dma_buf *dmabuf, struct gpucg *gpucg);
};
/**
The DMA-BUF op can be invoked when a process that allocated a buffer relinquishes its ownership and passes it over to another process.
Signed-off-by: Hridya Valsaraju hridya@google.com --- drivers/dma-buf/heaps/system_heap.c | 43 +++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+)
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index adfdc8c576f2..70f5b98f1157 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -307,6 +307,48 @@ static void system_heap_dma_buf_release(struct dma_buf *dmabuf) kfree(buffer); }
+#ifdef CONFIG_CGROUP_GPU +static int system_heap_dma_buf_charge(struct dma_buf *dmabuf, struct gpucg *gpucg) +{ + struct gpucg *current_gpucg; + struct gpucg_device *gpucg_dev; + struct system_heap_buffer *buffer = dmabuf->priv; + size_t len = buffer->len; + int ret = 0; + + /* + * Check that the process requesting the transfer is the same as the one + * to whom the buffer is currently charged to. + */ + current_gpucg = gpucg_get(current); + if (current_gpucg != buffer->gpucg) + ret = -EPERM; + + gpucg_put(current_gpucg); + if (ret) + return ret; + + gpucg_dev = dma_heap_get_gpucg_dev(buffer->heap); + + ret = gpucg_try_charge(gpucg, gpucg_dev, len); + if (ret) + return ret; + + /* uncharge the buffer from the cgroup its currently charged to. */ + gpucg_uncharge(buffer->gpucg, gpucg_dev, buffer->len); + gpucg_put(buffer->gpucg); + + buffer->gpucg = gpucg; + + return 0; +} +#else +static int system_heap_dma_buf_charge(struct dma_buf *dmabuf, struct gpucg *gpucg) +{ + return 0; +} +#endif + static const struct dma_buf_ops system_heap_buf_ops = { .attach = system_heap_attach, .detach = system_heap_detach, @@ -318,6 +360,7 @@ static const struct dma_buf_ops system_heap_buf_ops = { .vmap = system_heap_vmap, .vunmap = system_heap_vunmap, .release = system_heap_dma_buf_release, + .charge_to_cgroup = system_heap_dma_buf_charge, };
static struct page *alloc_largest_available(unsigned long size,
This patch introduces a buffer flag BINDER_BUFFER_FLAG_SENDER_NO_NEED that a process sending an fd array to another process over binder IPC can set to relinquish ownership of the fds being sent for memory accounting purposes. If the flag is found to be set during the fd array translation and the fd is for a DMA-BUF, the buffer is uncharged from the sender's cgroup and charged to the receiving process's cgroup instead.
It is upto the sending process to ensure that it closes the fds regardless of whether the transfer failed or succeeded.
Most graphics shared memory allocations in Android are done by the graphics allocator HAL process. On requests from clients, the HAL process allocates memory and sends the fds to the clients over binder IPC. The graphics allocator HAL will not retain any references to the buffers. When the HAL sets the BINDER_BUFFER_FLAG_SENDER_NO_NEED for fd arrays holding DMA-BUF fds, the gpu cgroup controller will be able to correctly charge the buffers to the client processes instead of the graphics allocator HAL.
Signed-off-by: Hridya Valsaraju hridya@google.com --- drivers/android/binder.c | 32 +++++++++++++++++++++++++++++ include/uapi/linux/android/binder.h | 1 + 2 files changed, 33 insertions(+)
diff --git a/drivers/android/binder.c b/drivers/android/binder.c index 5497797ab258..83082fd1ab6a 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -42,6 +42,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/dma-buf.h> #include <linux/fdtable.h> #include <linux/file.h> #include <linux/freezer.h> @@ -2482,8 +2483,11 @@ static int binder_translate_fd_array(struct list_head *pf_head, { binder_size_t fdi, fd_buf_size; binder_size_t fda_offset; + bool transfer_gpu_charge = false; const void __user *sender_ufda_base; struct binder_proc *proc = thread->proc; + struct binder_proc *target_proc = t->to_proc; + int ret;
fd_buf_size = sizeof(u32) * fda->num_fds; @@ -2520,8 +2524,15 @@ static int binder_translate_fd_array(struct list_head *pf_head, if (ret) return ret;
+ if (IS_ENABLED(CONFIG_CGROUP_GPU) && + parent->flags & BINDER_BUFFER_FLAG_SENDER_NO_NEED) + transfer_gpu_charge = true; + for (fdi = 0; fdi < fda->num_fds; fdi++) { u32 fd; + struct dma_buf *dmabuf; + struct gpucg *gpucg; + binder_size_t offset = fda_offset + fdi * sizeof(fd); binder_size_t sender_uoffset = fdi * sizeof(fd);
@@ -2531,6 +2542,27 @@ static int binder_translate_fd_array(struct list_head *pf_head, in_reply_to); if (ret) return ret > 0 ? -EINVAL : ret; + + if (!transfer_gpu_charge) + continue; + + dmabuf = dma_buf_get(fd); + if (IS_ERR(dmabuf)) + continue; + + if (dmabuf->ops->charge_to_cgroup) { + gpucg = gpucg_get(target_proc->tsk); + ret = dmabuf->ops->charge_to_cgroup(dmabuf, gpucg); + if (ret) { + pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d", + proc->pid, thread->pid, target_proc->pid); + gpucg_put(gpucg); + } + } else { + pr_warn("%d:%d DMA-BUF exporter %s is not configured correctly for GPU cgroup memory accounting", + proc->pid, thread->pid, dmabuf->exp_name); + } + dma_buf_put(dmabuf); } return 0; } diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h index ad619623571e..c85f0014c341 100644 --- a/include/uapi/linux/android/binder.h +++ b/include/uapi/linux/android/binder.h @@ -137,6 +137,7 @@ struct binder_buffer_object {
enum { BINDER_BUFFER_FLAG_HAS_PARENT = 0x01, + BINDER_BUFFER_FLAG_SENDER_NO_NEED = 0x02, };
/* struct binder_fd_array_object - object describing an array of fds in a buffer
linaro-mm-sig@lists.linaro.org