Hi,
This series is the follow-up of the discussion that John and I had a few
months ago here:
https://lore.kernel.org/all/CANDhNCquJn6bH3KxKf65BWiTYLVqSd9892-xtFDHHqqyrr…
The initial problem we were discussing was that I'm currently working on
a platform which has a memory layout with ECC enabled. However, enabling
the ECC has a number of drawbacks on that platform: lower performance,
increased memory usage, etc. So for things like framebuffers, the
trade-off isn't great and thus there's a memory region with ECC disabled
to allocate from for such use cases.
After a suggestion from John, I chose to start using heap allocations
flags to allow for userspace to ask for a particular ECC setup. This is
then backed by a new heap type that runs from reserved memory chunks
flagged as such, and the existing DT properties to specify the ECC
properties.
We could also easily extend this mechanism to support more flags, or
through a new ioctl to discover which flags a given heap supports.
I submitted a draft PR to the DT schema for the bindings used in this
PR:
https://github.com/devicetree-org/dt-schema/pull/138
Let me know what you think,
Maxime
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Maxime Ripard (8):
dma-buf: heaps: Introduce a new heap for reserved memory
of: Add helper to retrieve ECC memory bits
dma-buf: heaps: Import uAPI header
dma-buf: heaps: Add ECC protection flags
dma-buf: heaps: system: Remove global variable
dma-buf: heaps: system: Handle ECC flags
dma-buf: heaps: cma: Handle ECC flags
dma-buf: heaps: carveout: Handle ECC flags
drivers/dma-buf/dma-heap.c | 4 +
drivers/dma-buf/heaps/Kconfig | 8 +
drivers/dma-buf/heaps/Makefile | 1 +
drivers/dma-buf/heaps/carveout_heap.c | 330 ++++++++++++++++++++++++++++++++++
drivers/dma-buf/heaps/cma_heap.c | 10 ++
drivers/dma-buf/heaps/system_heap.c | 29 ++-
include/linux/dma-heap.h | 2 +
include/linux/of.h | 25 +++
include/uapi/linux/dma-heap.h | 5 +-
9 files changed, 407 insertions(+), 7 deletions(-)
---
base-commit: a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
change-id: 20240515-dma-buf-ecc-heap-28a311d2c94e
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
On Wed, Jul 10, 2024 at 8:08 AM Lei Liu <liulei.rjpt(a)vivo.com> wrote:
>
>
> on 2024/7/10 22:48, Christian König wrote:
> > Am 10.07.24 um 16:35 schrieb Lei Liu:
> >>
> >> on 2024/7/10 22:14, Christian König wrote:
> >>> Am 10.07.24 um 15:57 schrieb Lei Liu:
> >>>> Use vm_insert_page to establish a mapping for the memory allocated
> >>>> by dmabuf, thus supporting direct I/O read and write; and fix the
> >>>> issue of incorrect memory statistics after mapping dmabuf memory.
> >>>
> >>> Well big NAK to that! Direct I/O is intentionally disabled on DMA-bufs.
> >>
> >> Hello! Could you explain why direct_io is disabled on DMABUF? Is
> >> there any historical reason for this?
> >
> > It's basically one of the most fundamental design decision of DMA-Buf.
> > The attachment/map/fence model DMA-buf uses is not really compatible
> > with direct I/O on the underlying pages.
>
> Thank you! Is there any related documentation on this? I would like to
> understand and learn more about the fundamental reasons for the lack of
> support.
Hi Lei and Christian,
This is now the third request I've seen from three different companies
who are interested in this, but the others are not for reasons of read
performance that you mention in the commit message on your first
patch. Someone else at Google ran a comparison between a normal read()
and a direct I/O read() into a preallocated user buffer and found that
with large readahead (16 MB) the throughput can actually be slightly
higher than direct I/O. If you have concerns about read performance,
have you tried increasing the readahead size?
The other motivation is to load a gajillion byte file from disk into a
dmabuf without evicting the entire contents of pagecache while doing
so. Something like this (which does not currently work because read()
tries to GUP on the dmabuf memory as you mention):
static int dmabuf_heap_alloc(int heap_fd, size_t len)
{
struct dma_heap_allocation_data data = {
.len = len,
.fd = 0,
.fd_flags = O_RDWR | O_CLOEXEC,
.heap_flags = 0,
};
int ret = ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &data);
if (ret < 0)
return ret;
return data.fd;
}
int main(int, char **argv)
{
const char *file_path = argv[1];
printf("File: %s\n", file_path);
int file_fd = open(file_path, O_RDONLY | O_DIRECT);
struct stat st;
stat(file_path, &st);
ssize_t file_size = st.st_size;
ssize_t aligned_size = (file_size + 4095) & ~4095;
printf("File size: %zd Aligned size: %zd\n", file_size, aligned_size);
int heap_fd = open("/dev/dma_heap/system", O_RDONLY);
int dmabuf_fd = dmabuf_heap_alloc(heap_fd, aligned_size);
void *vm = mmap(nullptr, aligned_size, PROT_READ | PROT_WRITE,
MAP_SHARED, dmabuf_fd, 0);
printf("VM at 0x%lx\n", (unsigned long)vm);
dma_buf_sync sync_flags { DMA_BUF_SYNC_START |
DMA_BUF_SYNC_READ | DMA_BUF_SYNC_WRITE };
ioctl(dmabuf_fd, DMA_BUF_IOCTL_SYNC, &sync_flags);
ssize_t rc = read(file_fd, vm, file_size);
printf("Read: %zd %s\n", rc, rc < 0 ? strerror(errno) : "");
sync_flags.flags = DMA_BUF_SYNC_END | DMA_BUF_SYNC_READ |
DMA_BUF_SYNC_WRITE;
ioctl(dmabuf_fd, DMA_BUF_IOCTL_SYNC, &sync_flags);
}
Or replace the mmap() + read() with sendfile().
So I would also like to see the above code (or something else similar)
be able to work and I understand some of the reasons why it currently
does not, but I don't understand why we should actively prevent this
type of behavior entirely.
Best,
T.J.
> >
> >>>
> >>> We already discussed enforcing that in the DMA-buf framework and
> >>> this patch probably means that we should really do that.
> >>>
> >>> Regards,
> >>> Christian.
> >>
> >> Thank you for your response. With the application of AI large model
> >> edgeification, we urgently need support for direct_io on DMABUF to
> >> read some very large files. Do you have any new solutions or plans
> >> for this?
> >
> > We have seen similar projects over the years and all of those turned
> > out to be complete shipwrecks.
> >
> > There is currently a patch set under discussion to give the network
> > subsystem DMA-buf support. If you are interest in network direct I/O
> > that could help.
>
> Is there a related introduction link for this patch?
>
> >
> > Additional to that a lot of GPU drivers support userptr usages, e.g.
> > to import malloced memory into the GPU driver. You can then also do
> > direct I/O on that malloced memory and the kernel will enforce correct
> > handling with the GPU driver through MMU notifiers.
> >
> > But as far as I know a general DMA-buf based solution isn't possible.
>
> 1.The reason we need to use DMABUF memory here is that we need to share
> memory between the CPU and APU. Currently, only DMABUF memory is
> suitable for this purpose. Additionally, we need to read very large files.
>
> 2. Are there any other solutions for this? Also, do you have any plans
> to support direct_io for DMABUF memory in the future?
>
> >
> > Regards,
> > Christian.
> >
> >>
> >> Regards,
> >> Lei Liu.
> >>
> >>>
> >>>>
> >>>> Lei Liu (2):
> >>>> mm: dmabuf_direct_io: Support direct_io for memory allocated by
> >>>> dmabuf
> >>>> mm: dmabuf_direct_io: Fix memory statistics error for dmabuf
> >>>> allocated
> >>>> memory with direct_io support
> >>>>
> >>>> drivers/dma-buf/heaps/system_heap.c | 5 +++--
> >>>> fs/proc/task_mmu.c | 8 +++++++-
> >>>> include/linux/mm.h | 1 +
> >>>> mm/memory.c | 15 ++++++++++-----
> >>>> mm/rmap.c | 9 +++++----
> >>>> 5 files changed, 26 insertions(+), 12 deletions(-)
> >>>>
> >>>
> >
On Thu, Jun 20, 2024 at 3:52 PM Hans Verkuil <hverkuil-cisco(a)xs4all.nl> wrote:
>
> On 19/06/2024 06:19, Tomasz Figa wrote:
> > On Wed, Jun 19, 2024 at 1:24 AM Nicolas Dufresne <nicolas(a)ndufresne.ca> wrote:
> >>
> >> Le mardi 18 juin 2024 à 16:47 +0900, Tomasz Figa a écrit :
> >>> Hi TaoJiang,
> >>>
> >>> On Tue, Jun 18, 2024 at 4:30 PM TaoJiang <tao.jiang_2(a)nxp.com> wrote:
> >>>>
> >>>> From: Ming Qian <ming.qian(a)nxp.com>
> >>>>
> >>>> When the memory type is VB2_MEMORY_DMABUF, the v4l2 device can't know
> >>>> whether the dma buffer is coherent or synchronized.
> >>>>
> >>>> The videobuf2-core will skip cache syncs as it think the DMA exporter
> >>>> should take care of cache syncs
> >>>>
> >>>> But in fact it's likely that the client doesn't
> >>>> synchronize the dma buf before qbuf() or after dqbuf(). and it's
> >>>> difficult to find this type of error directly.
> >>>>
> >>>> I think it's helpful that videobuf2-core can call
> >>>> dma_buf_end_cpu_access() and dma_buf_begin_cpu_access() to handle the
> >>>> cache syncs.
> >>>>
> >>>> Signed-off-by: Ming Qian <ming.qian(a)nxp.com>
> >>>> Signed-off-by: TaoJiang <tao.jiang_2(a)nxp.com>
> >>>> ---
> >>>> .../media/common/videobuf2/videobuf2-core.c | 22 +++++++++++++++++++
> >>>> 1 file changed, 22 insertions(+)
> >>>>
> >>>
> >>> Sorry, that patch is incorrect. I believe you're misunderstanding the
> >>> way DMA-buf buffers should be managed in the userspace. It's the
> >>> userspace responsibility to call the DMA_BUF_IOCTL_SYNC ioctl [1] to
> >>> signal start and end of CPU access to the kernel and imply necessary
> >>> cache synchronization.
> >>>
> >>> [1] https://docs.kernel.org/driver-api/dma-buf.html#dma-buffer-ioctls
> >>>
> >>> So, really sorry, but it's a NAK.
> >>
> >>
> >>
> >> This patch *could* make sense if it was inside UVC Driver as an example, as this
> >> driver can import dmabuf, to CPU memcpy, and does omits the required sync calls
> >> (unless that got added recently, I can easily have missed it).
> >
> > Yeah, currently V4L2 drivers don't call the in-kernel
> > dma_buf_{begin,end}_cpu_access() when they need to access the buffers
> > from the CPU, while my quick grep [1] reveals that we have 68 files
> > retrieving plane vaddr by calling vb2_plane_vaddr() (not necessarily a
> > 100% guarantee of CPU access being done, but rather likely so).
> >
> > I also repeated the same thing with VB2_DMABUF [2] and tried to
> > attribute both lists to specific drivers (by retaining the path until
> > the first - or _ [3]; which seemed to be relatively accurate), leading
> > to the following drivers that claim support for DMABUF while also
> > retrieving plane vaddr (without proper synchronization - no drivers
> > currently call any begin/end CPU access):
> >
> > i2c/video
> > pci/bt8xx/bttv
> > pci/cobalt/cobalt
> > pci/cx18/cx18
> > pci/tw5864/tw5864
> > pci/tw686x/tw686x
> > platform/allegro
> > platform/amphion/vpu
> > platform/chips
> > platform/intel/pxa
> > platform/marvell/mcam
> > platform/mediatek/jpeg/mtk
> > platform/mediatek/vcodec/decoder/mtk
> > platform/mediatek/vcodec/encoder/mtk
> > platform/nuvoton/npcm
> > platform/nvidia/tegra
> > platform/nxp/imx
> > platform/renesas/rcar
> > platform/renesas/vsp1/vsp1
> > platform/rockchip/rkisp1/rkisp1
> > platform/samsung/exynos4
> > platform/samsung/s5p
> > platform/st/sti/delta/delta
> > platform/st/sti/hva/hva
> > platform/verisilicon/hantro
> > usb/au0828/au0828
> > usb/cx231xx/cx231xx
> > usb/dvb
> > usb/em28xx/em28xx
> > usb/gspca/gspca.c
> > usb/hackrf/hackrf.c
> > usb/stk1160/stk1160
> > usb/uvc/uvc
> >
> > which means we potentially have ~30 drivers which likely don't handle
> > imported DMABUFs correctly (there is still a chance that DMABUF is
> > advertised for one queue, while vaddr is used for another).
> >
> > I think we have two options:
> > 1) add vb2_{begin/end}_cpu_access() helpers, carefully audit each
> > driver and add calls to those
>
> I actually started on that 9 (!) years ago:
>
> https://git.linuxtv.org/hverkuil/media_tree.git/log/?h=vb2-cpu-access
>
> If memory serves, the main problem was that there were some drivers where
> it wasn't clear what should be done. In the end I never continued this
> work since nobody complained about it.
>
> This patch series adds vb2_plane_begin/end_cpu_access() functions,
> replaces all calls to vb2_plane_vaddr() in drivers to the new functions,
> and at the end removes vb2_plane_vaddr() altogether.
>
> > 2) take a heavy gun approach and just call vb2_begin_cpu_access()
> > whenever vb2_plane_vaddr() is called and then vb2_end_cpu_access()
> > whenever vb2_buffer_done() is called (if begin was called before).
> >
> > The latter has the disadvantage of drivers not having control over the
> > timing of the cache sync, so could end up with less than optimal
> > performance. Also there could be some more complex cases, where the
> > driver needs to mix DMA and CPU accesses to the buffer, so the fixed
> > sequence just wouldn't work for them. (But then they just wouldn't
> > work today either.)
> >
> > Hans, Marek, do you have any thoughts? (I'd personally just go with 2
> > and if any driver in the future needs something else, they could call
> > begin/end CPU access manually.)
>
> I prefer 1. If nothing else, that makes it easy to identify drivers
> that do such things.
>
> But perhaps a mix is possible: if a VB2 flag is set by the driver, then
> approach 2 is used. That might help with the drivers where it isn't clear
> what they should do. Although perhaps this can all be done in the driver
> itself: instead of vb2_plane_vaddr they call vb2_begin_cpu_access for the
> whole buffer, and at buffer_done time they call vb2_end_cpu_access. Should
> work just as well for the very few drivers that need this.
That's a good point. I guess we don't really need to dig so much into
those drivers in this case. Just mechanically do the same for all of
them (+/- maybe checking for some obvious corner cases which don't
need the extra calls). Let me see if I can give it a stab.
Best,
Tomasz
>
> Regards,
>
> Hans
>
> >
> > [1] git grep vb2_plane_vaddr | cut -d":" -f 1 | sort | uniq
> > [2] git grep VB2_DMABUF | cut -d":" -f 1 | sort | uniq
> > [3] by running [1] and [2] through | cut -d"-" -f 1 | cut -d"_" -f 1 | uniq
> >
> > Best,
> > Tomasz
> >
> >>
> >> But generally speaking, bracketing all driver with CPU access synchronization
> >> does not make sense indeed, so I second the rejection.
> >>
> >> Nicolas
> >>
> >>>
> >>> Best regards,
> >>> Tomasz
> >>>
> >>>> diff --git a/drivers/media/common/videobuf2/videobuf2-core.c b/drivers/media/common/videobuf2/videobuf2-core.c
> >>>> index 358f1fe42975..4734ff9cf3ce 100644
> >>>> --- a/drivers/media/common/videobuf2/videobuf2-core.c
> >>>> +++ b/drivers/media/common/videobuf2/videobuf2-core.c
> >>>> @@ -340,6 +340,17 @@ static void __vb2_buf_mem_prepare(struct vb2_buffer *vb)
> >>>> vb->synced = 1;
> >>>> for (plane = 0; plane < vb->num_planes; ++plane)
> >>>> call_void_memop(vb, prepare, vb->planes[plane].mem_priv);
> >>>> +
> >>>> + if (vb->memory != VB2_MEMORY_DMABUF)
> >>>> + return;
> >>>> + for (plane = 0; plane < vb->num_planes; ++plane) {
> >>>> + struct dma_buf *dbuf = vb->planes[plane].dbuf;
> >>>> +
> >>>> + if (!dbuf)
> >>>> + continue;
> >>>> +
> >>>> + dma_buf_end_cpu_access(dbuf, vb->vb2_queue->dma_dir);
> >>>> + }
> >>>> }
> >>>>
> >>>> /*
> >>>> @@ -356,6 +367,17 @@ static void __vb2_buf_mem_finish(struct vb2_buffer *vb)
> >>>> vb->synced = 0;
> >>>> for (plane = 0; plane < vb->num_planes; ++plane)
> >>>> call_void_memop(vb, finish, vb->planes[plane].mem_priv);
> >>>> +
> >>>> + if (vb->memory != VB2_MEMORY_DMABUF)
> >>>> + return;
> >>>> + for (plane = 0; plane < vb->num_planes; ++plane) {
> >>>> + struct dma_buf *dbuf = vb->planes[plane].dbuf;
> >>>> +
> >>>> + if (!dbuf)
> >>>> + continue;
> >>>> +
> >>>> + dma_buf_begin_cpu_access(dbuf, vb->vb2_queue->dma_dir);
> >>>> + }
> >>>> }
> >>>>
> >>>> /*
> >>>> --
> >>>> 2.43.0-rc1
> >>>>
> >>
> >
>
Am 10.07.24 um 16:35 schrieb Lei Liu:
>
> 在 2024/7/10 22:14, Christian König 写道:
>> Am 10.07.24 um 15:57 schrieb Lei Liu:
>>> Use vm_insert_page to establish a mapping for the memory allocated
>>> by dmabuf, thus supporting direct I/O read and write; and fix the
>>> issue of incorrect memory statistics after mapping dmabuf memory.
>>
>> Well big NAK to that! Direct I/O is intentionally disabled on DMA-bufs.
>
> Hello! Could you explain why direct_io is disabled on DMABUF? Is there
> any historical reason for this?
It's basically one of the most fundamental design decision of DMA-Buf.
The attachment/map/fence model DMA-buf uses is not really compatible
with direct I/O on the underlying pages.
>>
>> We already discussed enforcing that in the DMA-buf framework and this
>> patch probably means that we should really do that.
>>
>> Regards,
>> Christian.
>
> Thank you for your response. With the application of AI large model
> edgeification, we urgently need support for direct_io on DMABUF to
> read some very large files. Do you have any new solutions or plans for
> this?
We have seen similar projects over the years and all of those turned out
to be complete shipwrecks.
There is currently a patch set under discussion to give the network
subsystem DMA-buf support. If you are interest in network direct I/O
that could help.
Additional to that a lot of GPU drivers support userptr usages, e.g. to
import malloced memory into the GPU driver. You can then also do direct
I/O on that malloced memory and the kernel will enforce correct handling
with the GPU driver through MMU notifiers.
But as far as I know a general DMA-buf based solution isn't possible.
Regards,
Christian.
>
> Regards,
> Lei Liu.
>
>>
>>>
>>> Lei Liu (2):
>>> mm: dmabuf_direct_io: Support direct_io for memory allocated by
>>> dmabuf
>>> mm: dmabuf_direct_io: Fix memory statistics error for dmabuf
>>> allocated
>>> memory with direct_io support
>>>
>>> drivers/dma-buf/heaps/system_heap.c | 5 +++--
>>> fs/proc/task_mmu.c | 8 +++++++-
>>> include/linux/mm.h | 1 +
>>> mm/memory.c | 15 ++++++++++-----
>>> mm/rmap.c | 9 +++++----
>>> 5 files changed, 26 insertions(+), 12 deletions(-)
>>>
>>
On Mon, Jul 8, 2024 at 6:47 AM Zenghui Yu <yuzenghui(a)huawei.com> wrote:
>
> Even if a vgem device is configured in, we will skip the import_vgem_fd()
> test almost every time.
>
> TAP version 13
> 1..11
> # Testing heap: system
> # =======================================
> # Testing allocation and importing:
> ok 1 # SKIP Could not open vgem -1
>
> The problem is that we use the DRM_IOCTL_VERSION ioctl to query the driver
> version information but leave the name field a non-null-terminated string.
> Terminate it properly to actually test against the vgem device.
Hm yeah. Looks like drm_copy_field resets version.name to the actual
size of the name in the case of truncation, so maybe worth checking
that too in case there is a name like "vgemfoo" that gets converted to
"vgem\0" by this?
>
> Signed-off-by: Zenghui Yu <yuzenghui(a)huawei.com>
> ---
> tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
> index 5f541522364f..2fcc74998fa9 100644
> --- a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
> +++ b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
> @@ -32,6 +32,8 @@ static int check_vgem(int fd)
> if (ret)
> return 0;
>
> + name[4] = '\0';
> +
> return !strcmp(name, "vgem");
> }
>
> --
> 2.33.0
>
On Thu, 4 Jul 2024 at 00:40, Amirreza Zarrabi <quic_azarrabi(a)quicinc.com> wrote:
>
>
>
> On 7/3/2024 10:13 PM, Dmitry Baryshkov wrote:
> > On Tue, Jul 02, 2024 at 10:57:36PM GMT, Amirreza Zarrabi wrote:
> >> Qualcomm TEE hosts Trusted Applications and Services that run in the
> >> secure world. Access to these resources is provided using object
> >> capabilities. A TEE client with access to the capability can invoke
> >> the object and request a service. Similarly, TEE can request a service
> >> from nonsecure world with object capabilities that are exported to secure
> >> world.
> >>
> >> We provide qcom_tee_object which represents an object in both secure
> >> and nonsecure world. TEE clients can invoke an instance of qcom_tee_object
> >> to access TEE. TEE can issue a callback request to nonsecure world
> >> by invoking an instance of qcom_tee_object in nonsecure world.
> >
> > Please see Documentation/process/submitting-patches.rst on how to write
> > commit messages.
>
> Ack.
>
> >
> >>
> >> Any driver in nonsecure world that is interested to export a struct (or a
> >> service object) to TEE, requires to embed an instance of qcom_tee_object in
> >> the relevant struct and implements the dispatcher function which is called
> >> when TEE invoked the service object.
> >>
> >> We also provids simplified API which implements the Qualcomm TEE transport
> >> protocol. The implementation is independent from any services that may
> >> reside in nonsecure world.
> >
> > "also" usually means that it should go to a separate commit.
>
> I will split this patch to multiple smaller ones.
>
[...]
> >
> >> + } in, out;
> >> +};
> >> +
> >> +int qcom_tee_object_do_invoke(struct qcom_tee_object_invoke_ctx *oic,
> >> + struct qcom_tee_object *object, unsigned long op, struct qcom_tee_arg u[], int *result);
> >
> > What's the difference between a result that gets returned by the
> > function and the result that gets retuned via the pointer?
>
> The function result, is local to kernel, for instance memory allocation failure,
> or failure to issue the smc call. The result in pointer, is the remote result,
> for instance return value from TA, or the TEE itself.
>
> I'll use better name, e.g. 'remote_result'?
See how this is handled by other parties. For example, PSCI. If you
have a standard set of return codes, translate them to -ESOMETHING in
your framework and let everybody else see only the standard errors.
--
With best wishes
Dmitry
On Mon, Jul 01, 2024 at 11:26:34PM -0700, Andrew Morton wrote:
> No, I do think the cast is useful:
>
> struct page *page = dma_fence_chain_alloc();
>
> will presently generate a warning. We want this. Your change will
> remove that useful warning.
>
>
> Unrelatedly: there is no earthly reason why this is implemented as a
> macro. A static inline function would be so much better. Why do we
> keep doing this.
Agreed with all of the above. Adding the dmabuf maintainers.