Adding TEE mailing list and maintainers to the CC list.
Amirreza, please include them in future even if you are not going to use
the framework.
On Wed, Jul 10, 2024 at 09:16:48AM GMT, Amirreza Zarrabi wrote:
>
>
> On 7/3/2024 9:36 PM, Dmitry Baryshkov wrote:
> > On Tue, Jul 02, 2024 at 10:57:35PM GMT, Amirreza Zarrabi wrote:
> >> Qualcomm TEE hosts Trusted Applications (TAs) and services that run in
> >> the secure world. Access to these resources is provided using MinkIPC.
> >> MinkIPC is a capability-based synchronous message passing facility. It
> >> allows code executing in one domain to invoke objects running in other
> >> domains. When a process holds a reference to an object that lives in
> >> another domain, that object reference is a capability. Capabilities
> >> allow us to separate implementation of policies from implementation of
> >> the transport.
> >>
> >> As part of the upstreaming of the object invoke driver (called SMC-Invoke
> >> driver), we need to provide a reasonable kernel API and UAPI. The clear
> >> option is to use TEE subsystem and write a back-end driver, however the
> >> TEE subsystem doesn't fit with the design of Qualcomm TEE.
> >>
>
> To answer your "general comment", maybe a bit of background :).
>
> Traditionally, policy enforcement is based on access-control models,
> either (1) access-control list or (2) capability [0]. A capability is an
> opaque ("non-forge-able") object reference that grants the holder the
> right to perform certain operations on the object (e.g. Read, Write,
> Execute, or Grant). Capabilities are preferred mechanism for representing
> a policy, due to their fine-grained representation of access right, inline
> with
> (P1) the principle of least privilege [1], and
> (P2) the ability to avoid the confused deputy problem [2].
>
> [0] Jack B. Dennis and Earl C. Van Horn. 1966. Programming Semantics for
> Multiprogrammed Computations. Commun. ACM 9 (1966), 143–155.
>
> [1] Jerome H. Saltzer and Michael D. Schroeder. 1975. The Protection of
> Information in Computer Systems. Proc. IEEE 63 (1975), 1278–1308.
>
> [2] Norm Hardy. 1988. The Confused Deputy (or Why Capabilities Might Have
> Been Invented). ACM Operating Systems Review 22, 4 (1988), 36–38.
>
> For MinkIPC, an object represents a TEE or TA service. The reference to
> the object is the "handle" that is returned from TEE (let's call it
> TEE-Handle). The supported operations are "service invocation" (similar
> to Execute), and "sharing access to a service" (similar to Grant).
> Anyone with access to the TEE-Handle can invoke the service or pass the
> TEE-Handle to someone else to access the same service.
>
> The responsibility of the MinkIPC framework is to hide the TEE-Handle,
> so that the client can not forge it, and allow the owner of the handle
> to transfer it to other clients as it wishes. Using a file descriptor
> table we can achieve that. We wrap the TEE-Handle as a FD and let the
> client invoke FD (e.g. using IOCTL), or transfer the FD (e.g. using
> UNIX socket).
>
> As a side note, for the sake of completeness, capabilities are fundamentally
> a "discretionary mechanism", as the holder of the object reference has the
> ability to share it with others. A secure system requires "mandatory
> enforcement" (i.e. ability to revoke authority and ability to control
> the authority propagation). This is out of scope for the MinkIPC.
> MinkIPC is only interested in P1 and P2 (mention above).
>
>
> >> Does TEE subsystem fit requirements of a capability based system?
> >> -----------------------------------------------------------------
> >> In TEE subsystem, to invoke a function:
> >> - client should open a device file "/dev/teeX",
> >> - create a session with a TA, and
> >> - invoke the functions in that session.
> >>
> >> 1. The privilege to invoke a function is determined by a session. If a
> >> client has a session, it cannot share it with other clients. Even if
> >> it does, it is not fine-grained enough, i.e. either all accessible
> >> functions/resources in a session or none. Assume a scenario when a client
> >> wants to grant a permission to invoke just a function that it has the rights,
> >> to another client.
> >>
> >> The "all or nothing" for sharing sessions is not in line with our
> >> capability system: "if you own a capability, you should be able to grant
> >> or share it".
> >
> > Can you please be more specific here? What kind of sharing is expected
> > on the user side of it?
>
> In MinkIPC, after authenticating a client credential, a TA (or TEE) may
> return multiple TEE-Handles, each representing a service that the client
> has privilege to access. The client should be able to "individually"
> reference each TEE-Handle, e.g. to invoke and share it (as per capability-
> based system requirements).
>
> If we use TEE subsystem, which has a session based design, all TEE-Handles
> are meaningful with respect to the session in which they are allocated,
> hence the use of "__u32 session" in "struct tee_ioctl_invoke_arg".
>
> Here, we have a contradiction with MinkIPC. We may ignore the session
> and say "even though a TEE-Handle is allocated in a session but it is also
> valid outside a session", i.e. the session-id in TEE uapi becomes redundant
> (a case of divergence from definition).
>
> >
> >> 2. In TEE subsystem, resources are managed in a context. Every time a
> >> client opens "/dev/teeX", a new context is created to keep track of
> >> the allocated resources, including opened sessions and remote objects. Any
> >> effort for sharing resources between two independent clients requires
> >> involvement of context manager, i.e. the back-end driver. This requires
> >> implementing some form of policy in the back-end driver.
> >
> > What kind of resource sharing?
>
> TEE subsystem "rightfully" allocates a context each time a client opens
> a device file. This context pass around to the backend driver to identify
> independent clients that opened the device file.
>
> The context is used by backend driver to keep track of the resources. Type
> of resources are TEE driver dependent. As an example of resource in TEE
> subsystem, you can look into 'shm' register and unregister (specially,
> see comment in function 'shm_alloc_helper').
>
> For MinkIPC, all clients are treated the same and the TEE-Handles are
> representative of the resources, accessible "globally" if a client has the
> capability for them. In kernel, clients access an object if they have
> access to "qcom_tee_object", in userspace, clients access an object if
> they have the FD wrapper for the TEE-Handle.
>
> If we use context, instead of the file descriptor table, any form of object
> transfer requires involvement of the backend driver. If we use the file
> descriptor table, contexts are becoming useless for MinkIPC (i.e.
> 'ctx->data' will "always" be null).
>
> >
> >> 3. The TEE subsystem supports two type of memory sharing:
> >> - per-device memory pools, and
> >> - user defined memory references.
> >> User defined memory references are private to the application and cannot
> >> be shared. Memory allocated from per-device "shared" pools are accessible
> >> using a file descriptor. It can be mapped by any process if it has
> >> access to it. This means, we cannot provide the resource isolation
> >> between two clients. Assume a scenario when a client wants to allocate a
> >> memory (which is shared with TEE) from an "isolated" pool and share it
> >> with another client, without the right to access the contents of memory.
> >
> > This doesn't explain, why would it want to share such memory with
> > another client.
>
> Ok, I believe there is a misunderstanding here. I did not try to justify
> specific usecase. We want to separate the memory allocation from the
> framework. This way, how the memory is obtained, e.g. it is allocated
> (1) from an isolated pool, (2) a shared pool, (3) a secure heap,
> (4) a system dma-heap, (5) process address space, or (6) other memory
> with "different constraints", becomes independent.
>
> We introduced "memory object" type. User implements a kernel service
> using "qcom_tee_object" to represent the memory object. We have an
> implementation of memory objects based on dma-buf.
>
> >
> >> 4. The kernel API provided by TEE subsystem does not support a kernel
> >> supplicant. Adding support requires an execution context (e.g. a
> >> kernel thread) due to the TEE subsystem design. tee_driver_ops supports
> >> only "send" and "receive" callbacks and to deliver a request, someone
> >> should wait on "receive".
> >
> > There is nothing wrong here, but maybe I'm misunderstanding something.
>
> I agree. But, I am trying to re-emphasize how useful TEE subsystem is
> for MinkIPC. For kernel services, we solely rely on the backend driver.
> For instance, to expose RPMB service we will use "qcom_tee_object".
> So there is nothing provided by the framework to simplify the service
> development.
>
> >
> >> We need a callback to "dispatch" or "handle" a request in the context of
> >> the client thread. It should redirect a request to a kernel service or
> >> a user supplicant. In TEE subsystem such requirement should be implemented
> >> in TEE back-end driver, independent from the TEE subsystem.
> >>
> >> 5. The UAPI provided by TEE subsystem is similar to the GPTEE Client
> >> interface. This interface is not suitable for a capability system.
> >> For instance, there is no session in a capability system which means
> >> either its should not be used, or we should overload its definition.
> >
> > General comment: maybe adding more detailed explanation of how the
> > capabilities are aquired and how they can be used might make sense.
> >
> > BTW. It might be my imperfect English, but each time I see the word
> > 'capability' I'm thinking that some is capable of doing something. I
> > find it hard to use 'capability' for the reference to another object.
> >
>
> Explained at the top :).
>
> >>
> >> Can we use TEE subsystem?
> >> -------------------------
> >> There are workarounds for some of the issues above. The question is if we
> >> should define our own UAPI or try to use a hack-y way of fitting into
> >> the TEE subsystem. I am using word hack-y, as most of the workaround
> >> involves:
> >>
> >> - "diverging from the definition". For instance, ignoring the session
> >> open and close ioctl calls or use file descriptors for all remote
> >> resources (as, fd is the closet to capability) which undermines the
> >> isolation provided by the contexts,
> >>
> >> - "overloading the variables". For instance, passing object ID as file
> >> descriptors in a place of session ID, or
> >>
> >> - "bypass TEE subsystem". For instance, extensively rely on meta
> >> parameters or push everything (e.g. kernel services) to the back-end
> >> driver, which means leaving almost all TEE subsystem unused.
> >>
> >> We cannot take the full benefits of TEE subsystem and may need to
> >> implement most of the requirements in the back-end driver. Also, as
> >> discussed above, the UAPI is not suitable for capability-based use cases.
> >> We proposed a new set of ioctl calls for SMC-Invoke driver.
> >>
> >> In this series we posted three patches. We implemented a transport
> >> driver that provides qcom_tee_object. Any object on secure side is
> >> represented with an instance of qcom_tee_object and any struct exposed
> >> to TEE should embed an instance of qcom_tee_object. Any, support for new
> >> services, e.g. memory object, RPMB, userspace clients or supplicants are
> >> implemented independently from the driver.
> >>
> >> We have a simple memory object and a user driver that uses
> >> qcom_tee_object.
> >
> > Could you please point out any user for the uAPI? I'd like to understand
> > how does it from from the userspace point of view.
>
> Sure :), I'll write up a test patch and send it in next series.
>
> Summary.
>
> TEE framework provides some nice facilities, including:
> - uapi and ioctl interface,
> - marshaling parameters and context management,
> - memory mapping and sharing, and
> - TEE bus and TA drivers.
>
> For, MinkIPC, we will not use any of them. The only usable piece, is uapi
> interface which is not suitable for MinkIPC, as discussed above.
>
> >
> >>
> >> Signed-off-by: Amirreza Zarrabi <quic_azarrabi(a)quicinc.com>
> >> ---
> >> Amirreza Zarrabi (3):
> >> firmware: qcom: implement object invoke support
> >> firmware: qcom: implement memory object support for TEE
> >> firmware: qcom: implement ioctl for TEE object invocation
> >>
> >> drivers/firmware/qcom/Kconfig | 36 +
> >> drivers/firmware/qcom/Makefile | 2 +
> >> drivers/firmware/qcom/qcom_object_invoke/Makefile | 12 +
> >> drivers/firmware/qcom/qcom_object_invoke/async.c | 142 +++
> >> drivers/firmware/qcom/qcom_object_invoke/core.c | 1139 ++++++++++++++++++
> >> drivers/firmware/qcom/qcom_object_invoke/core.h | 186 +++
> >> .../qcom/qcom_object_invoke/qcom_scm_invoke.c | 22 +
> >> .../firmware/qcom/qcom_object_invoke/release_wq.c | 90 ++
> >> .../qcom/qcom_object_invoke/xts/mem_object.c | 406 +++++++
> >> .../qcom_object_invoke/xts/object_invoke_uapi.c | 1231 ++++++++++++++++++++
> >> include/linux/firmware/qcom/qcom_object_invoke.h | 233 ++++
> >> include/uapi/misc/qcom_tee.h | 117 ++
> >> 12 files changed, 3616 insertions(+)
> >> ---
> >> base-commit: 74564adfd3521d9e322cfc345fdc132df80f3c79
> >> change-id: 20240702-qcom-tee-object-and-ioctls-6f52fde03485
> >>
> >> Best regards,
> >> --
> >> Amirreza Zarrabi <quic_azarrabi(a)quicinc.com>
> >>
> >
--
With best wishes
Dmitry
On Thu, Jul 18, 2024 at 09:51:39AM +0800, Huan Yang wrote:
> Yes, actually, if dma-buf want's to copy_file_range from a file, it need
> change something in vfs_copy_file_range:
No, it doesn't. copy_file_range is specifically designed to copy inside
a single file system as already mentioned. The generic offload for
copying between arbitrary FDs is splice and the sendfile convenience
wrapper around it
On Tue, Jul 16, 2024 at 06:14:48PM +0800, Huan Yang wrote:
>
> 在 2024/7/16 17:31, Daniel Vetter 写道:
> > [你通常不会收到来自 daniel.vetter(a)ffwll.ch 的电子邮件。请访问 https://aka.ms/LearnAboutSenderIdentification,以了解这一点为什么很重要]
> >
> > On Tue, Jul 16, 2024 at 10:48:40AM +0800, Huan Yang wrote:
> > > I just research the udmabuf, Please correct me if I'm wrong.
> > >
> > > 在 2024/7/15 20:32, Christian König 写道:
> > > > Am 15.07.24 um 11:11 schrieb Daniel Vetter:
> > > > > On Thu, Jul 11, 2024 at 11:00:02AM +0200, Christian König wrote:
> > > > > > Am 11.07.24 um 09:42 schrieb Huan Yang:
> > > > > > > Some user may need load file into dma-buf, current
> > > > > > > way is:
> > > > > > > 1. allocate a dma-buf, get dma-buf fd
> > > > > > > 2. mmap dma-buf fd into vaddr
> > > > > > > 3. read(file_fd, vaddr, fsz)
> > > > > > > This is too heavy if fsz reached to GB.
> > > > > > You need to describe a bit more why that is to heavy. I can only
> > > > > > assume you
> > > > > > need to save memory bandwidth and avoid the extra copy with the CPU.
> > > > > >
> > > > > > > This patch implement a feature called DMA_HEAP_IOCTL_ALLOC_READ_FILE.
> > > > > > > User need to offer a file_fd which you want to load into
> > > > > > > dma-buf, then,
> > > > > > > it promise if you got a dma-buf fd, it will contains the file content.
> > > > > > Interesting idea, that has at least more potential than trying
> > > > > > to enable
> > > > > > direct I/O on mmap()ed DMA-bufs.
> > > > > >
> > > > > > The approach with the new IOCTL might not work because it is a very
> > > > > > specialized use case.
> > > > > >
> > > > > > But IIRC there was a copy_file_range callback in the file_operations
> > > > > > structure you could use for that. I'm just not sure when and how
> > > > > > that's used
> > > > > > with the copy_file_range() system call.
> > > > > I'm not sure any of those help, because internally they're all still
> > > > > based
> > > > > on struct page (or maybe in the future on folios). And that's the thing
> > > > > dma-buf can't give you, at least without peaking behind the curtain.
> > > > >
> > > > > I think an entirely different option would be malloc+udmabuf. That
> > > > > essentially handles the impendence-mismatch between direct I/O and
> > > > > dma-buf
> > > > > on the dma-buf side. The downside is that it'll make the permanently
> > > > > pinned memory accounting and tracking issues even more apparent, but I
> > > > > guess eventually we do need to sort that one out.
> > > > Oh, very good idea!
> > > > Just one minor correction: it's not malloc+udmabuf, but rather
> > > > create_memfd()+udmabuf.
> > Hm right, it's create_memfd() + mmap(memfd) + udmabuf
> >
> > > > And you need to complete your direct I/O before creating the udmabuf
> > > > since that reference will prevent direct I/O from working.
> > > udmabuf will pin all pages, so, if returned fd, can't trigger direct I/O
> > > (same as dmabuf). So, must complete read before pin it.
> > Why does pinning prevent direct I/O? I haven't tested, but I'd expect the
> > rdma folks would be really annoyed if that's the case ...
> >
> > > But current way is use `memfd_pin_folios` to boost alloc and pin, so maybe
> > > need suit it.
> > >
> > >
> > > I currently doubt that the udmabuf solution is suitable for our
> > > gigabyte-level read operations.
> > >
> > > 1. The current mmap operation uses faulting, so frequent page faults will be
> > > triggered during reads, resulting in a lot of context switching overhead.
> > >
> > > 2. current udmabuf size limit is 64MB, even can change, maybe not good to
> > > use in large size?
> > Yeah that's just a figleaf so we don't have to bother about the accounting
> > issue.
> >
> > > 3. The migration and adaptation of the driver is also a challenge, and
> > > currently, we are unable to control it.
> > Why does a udmabuf fd not work instead of any other dmabuf fd? That
> > shouldn't matter for the consuming driver ...
>
> Hmm, our production's driver provider by other oem. I see many of they
> implement
>
> their own dma_buf_ops. These may not be generic and may require them to
> reimplement.
Yeah, for exporting a buffer object allocated by that driver. But any
competent gles/vk stack also supports importing dma-buf, and that should
work with udmabuf exactly the same way as with a dma-buf allocated from
the system heap.
> > > Perhaps implementing `copy_file_range` would be more suitable for us.
> > See my other mail, fundamentally these all rely on struct page being
> > present, and dma-buf doesn't give you that. Which means you need to go
> > below the dma-buf abstraction. And udmabuf is pretty much the thing for
> > that, because it wraps normal struct page memory into a dmabuf.
> Yes, udmabuf give this, I am very interested in whether the page provided by
> udmabuf can trigger direct I/O.
>
> So, I'll give a test and report soon.
> >
> > And copy_file_range on the underlying memfd might already work, I haven't
> > checked though.
>
> I have doubts.
>
> I recently tested and found that I need to modify many places in
> vfs_copy_file_range in order to run the copy file range with DMA_BUF fd.(I
> have managed to get it working,
I'm talking about memfd, not dma-buf here. I think copy_file_range to
dma-buf is as architecturally unsound as allowing O_DIRECT on the dma-buf
mmap.
Cheers, Sima
> but I don't think the implementation is good enough, so I can't provide the
> source code.)
>
> Maybe memfd can work or not, let's give it a test.:)
>
> Anyway, it's a good idea too. I currently need to focus on whether it can be
> achieved, as well as the performance comparison.
>
> >
> > Cheers, Sima
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch/
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Tue, Jul 16, 2024 at 10:48:40AM +0800, Huan Yang wrote:
> I just research the udmabuf, Please correct me if I'm wrong.
>
> 在 2024/7/15 20:32, Christian König 写道:
> > Am 15.07.24 um 11:11 schrieb Daniel Vetter:
> > > On Thu, Jul 11, 2024 at 11:00:02AM +0200, Christian König wrote:
> > > > Am 11.07.24 um 09:42 schrieb Huan Yang:
> > > > > Some user may need load file into dma-buf, current
> > > > > way is:
> > > > > 1. allocate a dma-buf, get dma-buf fd
> > > > > 2. mmap dma-buf fd into vaddr
> > > > > 3. read(file_fd, vaddr, fsz)
> > > > > This is too heavy if fsz reached to GB.
> > > > You need to describe a bit more why that is to heavy. I can only
> > > > assume you
> > > > need to save memory bandwidth and avoid the extra copy with the CPU.
> > > >
> > > > > This patch implement a feature called DMA_HEAP_IOCTL_ALLOC_READ_FILE.
> > > > > User need to offer a file_fd which you want to load into
> > > > > dma-buf, then,
> > > > > it promise if you got a dma-buf fd, it will contains the file content.
> > > > Interesting idea, that has at least more potential than trying
> > > > to enable
> > > > direct I/O on mmap()ed DMA-bufs.
> > > >
> > > > The approach with the new IOCTL might not work because it is a very
> > > > specialized use case.
> > > >
> > > > But IIRC there was a copy_file_range callback in the file_operations
> > > > structure you could use for that. I'm just not sure when and how
> > > > that's used
> > > > with the copy_file_range() system call.
> > > I'm not sure any of those help, because internally they're all still
> > > based
> > > on struct page (or maybe in the future on folios). And that's the thing
> > > dma-buf can't give you, at least without peaking behind the curtain.
> > >
> > > I think an entirely different option would be malloc+udmabuf. That
> > > essentially handles the impendence-mismatch between direct I/O and
> > > dma-buf
> > > on the dma-buf side. The downside is that it'll make the permanently
> > > pinned memory accounting and tracking issues even more apparent, but I
> > > guess eventually we do need to sort that one out.
> >
> > Oh, very good idea!
> > Just one minor correction: it's not malloc+udmabuf, but rather
> > create_memfd()+udmabuf.
Hm right, it's create_memfd() + mmap(memfd) + udmabuf
> > And you need to complete your direct I/O before creating the udmabuf
> > since that reference will prevent direct I/O from working.
>
> udmabuf will pin all pages, so, if returned fd, can't trigger direct I/O
> (same as dmabuf). So, must complete read before pin it.
Why does pinning prevent direct I/O? I haven't tested, but I'd expect the
rdma folks would be really annoyed if that's the case ...
> But current way is use `memfd_pin_folios` to boost alloc and pin, so maybe
> need suit it.
>
>
> I currently doubt that the udmabuf solution is suitable for our
> gigabyte-level read operations.
>
> 1. The current mmap operation uses faulting, so frequent page faults will be
> triggered during reads, resulting in a lot of context switching overhead.
>
> 2. current udmabuf size limit is 64MB, even can change, maybe not good to
> use in large size?
Yeah that's just a figleaf so we don't have to bother about the accounting
issue.
> 3. The migration and adaptation of the driver is also a challenge, and
> currently, we are unable to control it.
Why does a udmabuf fd not work instead of any other dmabuf fd? That
shouldn't matter for the consuming driver ...
> Perhaps implementing `copy_file_range` would be more suitable for us.
See my other mail, fundamentally these all rely on struct page being
present, and dma-buf doesn't give you that. Which means you need to go
below the dma-buf abstraction. And udmabuf is pretty much the thing for
that, because it wraps normal struct page memory into a dmabuf.
And copy_file_range on the underlying memfd might already work, I haven't
checked though.
Cheers, Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Hi,
On Wed, Jul 10, 2024 at 1:17 AM Amirreza Zarrabi
<quic_azarrabi(a)quicinc.com> wrote:
>
>
>
> On 7/3/2024 9:36 PM, Dmitry Baryshkov wrote:
> > On Tue, Jul 02, 2024 at 10:57:35PM GMT, Amirreza Zarrabi wrote:
> >> Qualcomm TEE hosts Trusted Applications (TAs) and services that run in
> >> the secure world. Access to these resources is provided using MinkIPC.
> >> MinkIPC is a capability-based synchronous message passing facility. It
> >> allows code executing in one domain to invoke objects running in other
> >> domains. When a process holds a reference to an object that lives in
> >> another domain, that object reference is a capability. Capabilities
> >> allow us to separate implementation of policies from implementation of
> >> the transport.
> >>
> >> As part of the upstreaming of the object invoke driver (called SMC-Invoke
> >> driver), we need to provide a reasonable kernel API and UAPI. The clear
> >> option is to use TEE subsystem and write a back-end driver, however the
> >> TEE subsystem doesn't fit with the design of Qualcomm TEE.
> >>
>
> To answer your "general comment", maybe a bit of background :).
>
> Traditionally, policy enforcement is based on access-control models,
> either (1) access-control list or (2) capability [0]. A capability is an
> opaque ("non-forge-able") object reference that grants the holder the
> right to perform certain operations on the object (e.g. Read, Write,
> Execute, or Grant). Capabilities are preferred mechanism for representing
> a policy, due to their fine-grained representation of access right, inline
> with
> (P1) the principle of least privilege [1], and
> (P2) the ability to avoid the confused deputy problem [2].
>
> [0] Jack B. Dennis and Earl C. Van Horn. 1966. Programming Semantics for
> Multiprogrammed Computations. Commun. ACM 9 (1966), 143–155.
>
> [1] Jerome H. Saltzer and Michael D. Schroeder. 1975. The Protection of
> Information in Computer Systems. Proc. IEEE 63 (1975), 1278–1308.
>
> [2] Norm Hardy. 1988. The Confused Deputy (or Why Capabilities Might Have
> Been Invented). ACM Operating Systems Review 22, 4 (1988), 36–38.
>
> For MinkIPC, an object represents a TEE or TA service. The reference to
> the object is the "handle" that is returned from TEE (let's call it
> TEE-Handle). The supported operations are "service invocation" (similar
> to Execute), and "sharing access to a service" (similar to Grant).
> Anyone with access to the TEE-Handle can invoke the service or pass the
> TEE-Handle to someone else to access the same service.
>
> The responsibility of the MinkIPC framework is to hide the TEE-Handle,
> so that the client can not forge it, and allow the owner of the handle
> to transfer it to other clients as it wishes. Using a file descriptor
> table we can achieve that. We wrap the TEE-Handle as a FD and let the
> client invoke FD (e.g. using IOCTL), or transfer the FD (e.g. using
> UNIX socket).
>
> As a side note, for the sake of completeness, capabilities are fundamentally
> a "discretionary mechanism", as the holder of the object reference has the
> ability to share it with others. A secure system requires "mandatory
> enforcement" (i.e. ability to revoke authority and ability to control
> the authority propagation). This is out of scope for the MinkIPC.
> MinkIPC is only interested in P1 and P2 (mention above).
This is still quite abstract. We have tried to avoid inventing yet
another IPC mechanism in the TEE subsystem. But that's not written in
stone if it turns out there's a use case that needs it.
>
>
> >> Does TEE subsystem fit requirements of a capability based system?
> >> -----------------------------------------------------------------
> >> In TEE subsystem, to invoke a function:
> >> - client should open a device file "/dev/teeX",
> >> - create a session with a TA, and
> >> - invoke the functions in that session.
> >>
> >> 1. The privilege to invoke a function is determined by a session. If a
> >> client has a session, it cannot share it with other clients. Even if
> >> it does, it is not fine-grained enough, i.e. either all accessible
> >> functions/resources in a session or none. Assume a scenario when a client
> >> wants to grant a permission to invoke just a function that it has the rights,
> >> to another client.
> >>
> >> The "all or nothing" for sharing sessions is not in line with our
> >> capability system: "if you own a capability, you should be able to grant
> >> or share it".
> >
> > Can you please be more specific here? What kind of sharing is expected
> > on the user side of it?
>
> In MinkIPC, after authenticating a client credential, a TA (or TEE) may
> return multiple TEE-Handles, each representing a service that the client
> has privilege to access. The client should be able to "individually"
> reference each TEE-Handle, e.g. to invoke and share it (as per capability-
> based system requirements).
>
> If we use TEE subsystem, which has a session based design, all TEE-Handles
> are meaningful with respect to the session in which they are allocated,
> hence the use of "__u32 session" in "struct tee_ioctl_invoke_arg".
>
> Here, we have a contradiction with MinkIPC. We may ignore the session
> and say "even though a TEE-Handle is allocated in a session but it is also
> valid outside a session", i.e. the session-id in TEE uapi becomes redundant
> (a case of divergence from definition).
Only the backend drivers put a meaning to a session, the TEE subsystem
doesn't enforce anything. All fields but num_params and params in
struct tee_ioctl_invoke_arg are only interpreted by the backend driver
if I recall correctly. Using the fields for something completely
different would be confusing so if struct tee_ioctl_invoke_arg isn't
matching well enough we might need a new IOCTL for whatever you have
in mind.
>
> >
> >> 2. In TEE subsystem, resources are managed in a context. Every time a
> >> client opens "/dev/teeX", a new context is created to keep track of
> >> the allocated resources, including opened sessions and remote objects. Any
> >> effort for sharing resources between two independent clients requires
> >> involvement of context manager, i.e. the back-end driver. This requires
> >> implementing some form of policy in the back-end driver.
> >
> > What kind of resource sharing?
>
> TEE subsystem "rightfully" allocates a context each time a client opens
> a device file. This context pass around to the backend driver to identify
> independent clients that opened the device file.
>
> The context is used by backend driver to keep track of the resources. Type
> of resources are TEE driver dependent. As an example of resource in TEE
> subsystem, you can look into 'shm' register and unregister (specially,
> see comment in function 'shm_alloc_helper').
>
> For MinkIPC, all clients are treated the same and the TEE-Handles are
> representative of the resources, accessible "globally" if a client has the
> capability for them. In kernel, clients access an object if they have
> access to "qcom_tee_object", in userspace, clients access an object if
> they have the FD wrapper for the TEE-Handle.
So if a client has a file descriptor representing a TEE-Handle, then
it has the capability to access a TEE-object? Is the kernel
controlling anything more about these capabilities?
>
> If we use context, instead of the file descriptor table, any form of object
> transfer requires involvement of the backend driver. If we use the file
> descriptor table, contexts are becoming useless for MinkIPC (i.e.
> 'ctx->data' will "always" be null).
You still need to open a device to be able to create TEE-handles.
>
> >
> >> 3. The TEE subsystem supports two type of memory sharing:
> >> - per-device memory pools, and
> >> - user defined memory references.
> >> User defined memory references are private to the application and cannot
> >> be shared. Memory allocated from per-device "shared" pools are accessible
> >> using a file descriptor. It can be mapped by any process if it has
> >> access to it. This means, we cannot provide the resource isolation
> >> between two clients. Assume a scenario when a client wants to allocate a
> >> memory (which is shared with TEE) from an "isolated" pool and share it
> >> with another client, without the right to access the contents of memory.
> >
> > This doesn't explain, why would it want to share such memory with
> > another client.
>
> Ok, I believe there is a misunderstanding here. I did not try to justify
> specific usecase. We want to separate the memory allocation from the
> framework. This way, how the memory is obtained, e.g. it is allocated
> (1) from an isolated pool, (2) a shared pool, (3) a secure heap,
> (4) a system dma-heap, (5) process address space, or (6) other memory
> with "different constraints", becomes independent.
Especially points 3 and 4 are of great interest for the TEE Subsystem.
>
> We introduced "memory object" type. User implements a kernel service
> using "qcom_tee_object" to represent the memory object. We have an
> implementation of memory objects based on dma-buf.
Do you have an idea of what it would take to extend to TEE subsystem
to cover this?
>
> >
> >> 4. The kernel API provided by TEE subsystem does not support a kernel
> >> supplicant. Adding support requires an execution context (e.g. a
> >> kernel thread) due to the TEE subsystem design. tee_driver_ops supports
> >> only "send" and "receive" callbacks and to deliver a request, someone
> >> should wait on "receive".
So far we haven't needed a kernel thread, but if you need one feel
free to propose something.
> >
> > There is nothing wrong here, but maybe I'm misunderstanding something.
>
> I agree. But, I am trying to re-emphasize how useful TEE subsystem is
> for MinkIPC. For kernel services, we solely rely on the backend driver.
> For instance, to expose RPMB service we will use "qcom_tee_object".
> So there is nothing provided by the framework to simplify the service
> development.
The same is true for all backend drivers.
>
> >
> >> We need a callback to "dispatch" or "handle" a request in the context of
> >> the client thread. It should redirect a request to a kernel service or
> >> a user supplicant. In TEE subsystem such requirement should be implemented
> >> in TEE back-end driver, independent from the TEE subsystem.
> >>
> >> 5. The UAPI provided by TEE subsystem is similar to the GPTEE Client
> >> interface. This interface is not suitable for a capability system.
> >> For instance, there is no session in a capability system which means
> >> either its should not be used, or we should overload its definition.
Not using the session field doesn't seem like such a big obstacle.
Overloading it for something different might be messy. We can add a
new IOCTL if needed as I mentioned above.
> >
> > General comment: maybe adding more detailed explanation of how the
> > capabilities are aquired and how they can be used might make sense.
> >
> > BTW. It might be my imperfect English, but each time I see the word
> > 'capability' I'm thinking that some is capable of doing something. I
> > find it hard to use 'capability' for the reference to another object.
> >
>
> Explained at the top :).
>
> >>
> >> Can we use TEE subsystem?
> >> -------------------------
> >> There are workarounds for some of the issues above. The question is if we
> >> should define our own UAPI or try to use a hack-y way of fitting into
> >> the TEE subsystem. I am using word hack-y, as most of the workaround
> >> involves:
Instead of hack-y workarounds, we should consider extending the TEE
subsystem as needed.
> >>
> >> - "diverging from the definition". For instance, ignoring the session
> >> open and close ioctl calls or use file descriptors for all remote
> >> resources (as, fd is the closet to capability) which undermines the
> >> isolation provided by the contexts,
> >>
> >> - "overloading the variables". For instance, passing object ID as file
> >> descriptors in a place of session ID, or
struct qcom_tee_object_invoke_arg and struct tee_ioctl_invoke_arg are
quite similar, there are only a few more fields in the latter and we
are missing a TEE_IOCTL_PARAM_ATTR_TYPE_OBJECT. Does it make sense to
have a direction on objects?
> >>
> >> - "bypass TEE subsystem". For instance, extensively rely on meta
> >> parameters or push everything (e.g. kernel services) to the back-end
> >> driver, which means leaving almost all TEE subsystem unused.
The TEE subsystem is largely "bypassed" by all backend drivers, with
the exception of some SHM handling.
I'm sure the TEE subsystem can be extended to handle the "common" part
of SHM handling needed by QTEE.
> >>
> >> We cannot take the full benefits of TEE subsystem and may need to
> >> implement most of the requirements in the back-end driver. Also, as
> >> discussed above, the UAPI is not suitable for capability-based use cases.
> >> We proposed a new set of ioctl calls for SMC-Invoke driver.
> >>
> >> In this series we posted three patches. We implemented a transport
> >> driver that provides qcom_tee_object. Any object on secure side is
> >> represented with an instance of qcom_tee_object and any struct exposed
> >> to TEE should embed an instance of qcom_tee_object. Any, support for new
> >> services, e.g. memory object, RPMB, userspace clients or supplicants are
> >> implemented independently from the driver.
> >>
> >> We have a simple memory object and a user driver that uses
> >> qcom_tee_object.
> >
> > Could you please point out any user for the uAPI? I'd like to understand
> > how does it from from the userspace point of view.
>
> Sure :), I'll write up a test patch and send it in next series.
>
> Summary.
>
> TEE framework provides some nice facilities, including:
> - uapi and ioctl interface,
> - marshaling parameters and context management,
> - memory mapping and sharing, and
> - TEE bus and TA drivers.
>
> For, MinkIPC, we will not use any of them. The only usable piece, is uapi
> interface which is not suitable for MinkIPC, as discussed above.
I hope that we can change that. :-)
For instance, extending the TEE subsystem with the memory-sharing QTEE
needs could be useful for other TEE drivers.
Cheers,
Jens
We already teach lockdep that dma_resv nests within drm_modeset_lock,
but there's a lot more: All drm kms ioctl rely on being able to
put/get_user while holding modeset locks, so we really need a
might_fault in there too to complete the picture. Add it.
Motivated by a syzbot report that blew up on bcachefs doing an
unconditional console_lock way deep in the locking hierarchy, and
lockdep only noticing the depency loop in a drm ioctl instead of much
earlier. This annotation will make sure such issues have a much harder
time escaping.
References: https://lore.kernel.org/dri-devel/00000000000073db8b061cd43496@google.com/
Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
---
drivers/gpu/drm/drm_mode_config.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/drm_mode_config.c b/drivers/gpu/drm/drm_mode_config.c
index 568972258222..37d2e0a4ef4b 100644
--- a/drivers/gpu/drm/drm_mode_config.c
+++ b/drivers/gpu/drm/drm_mode_config.c
@@ -456,6 +456,8 @@ int drmm_mode_config_init(struct drm_device *dev)
if (ret == -EDEADLK)
ret = drm_modeset_backoff(&modeset_ctx);
+ might_fault();
+
ww_acquire_init(&resv_ctx, &reservation_ww_class);
ret = dma_resv_lock(&resv, &resv_ctx);
if (ret == -EDEADLK)
--
2.45.2
Am 10.07.24 um 15:57 schrieb Lei Liu:
> Use vm_insert_page to establish a mapping for the memory allocated
> by dmabuf, thus supporting direct I/O read and write; and fix the
> issue of incorrect memory statistics after mapping dmabuf memory.
Well big NAK to that! Direct I/O is intentionally disabled on DMA-bufs.
We already discussed enforcing that in the DMA-buf framework and this
patch probably means that we should really do that.
Regards,
Christian.
>
> Lei Liu (2):
> mm: dmabuf_direct_io: Support direct_io for memory allocated by dmabuf
> mm: dmabuf_direct_io: Fix memory statistics error for dmabuf allocated
> memory with direct_io support
>
> drivers/dma-buf/heaps/system_heap.c | 5 +++--
> fs/proc/task_mmu.c | 8 +++++++-
> include/linux/mm.h | 1 +
> mm/memory.c | 15 ++++++++++-----
> mm/rmap.c | 9 +++++----
> 5 files changed, 26 insertions(+), 12 deletions(-)
>
Am 12.07.24 um 09:52 schrieb Huan Yang:
>
> 在 2024/7/12 15:41, Christian König 写道:
>> Am 12.07.24 um 09:29 schrieb Huan Yang:
>>> Hi Christian,
>>>
>>> 在 2024/7/12 15:10, Christian König 写道:
>>>> Am 12.07.24 um 04:14 schrieb Huan Yang:
>>>>> 在 2024/7/12 9:59, Huan Yang 写道:
>>>>>> Hi Christian,
>>>>>>
>>>>>> 在 2024/7/11 19:39, Christian König 写道:
>>>>>>> Am 11.07.24 um 11:18 schrieb Huan Yang:
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>> Thanks for your reply.
>>>>>>>>
>>>>>>>> 在 2024/7/11 17:00, Christian König 写道:
>>>>>>>>> Am 11.07.24 um 09:42 schrieb Huan Yang:
>>>>>>>>>> Some user may need load file into dma-buf, current
>>>>>>>>>> way is:
>>>>>>>>>> 1. allocate a dma-buf, get dma-buf fd
>>>>>>>>>> 2. mmap dma-buf fd into vaddr
>>>>>>>>>> 3. read(file_fd, vaddr, fsz)
>>>>>>>>>> This is too heavy if fsz reached to GB.
>>>>>>>>>
>>>>>>>>> You need to describe a bit more why that is to heavy. I can
>>>>>>>>> only assume you need to save memory bandwidth and avoid the
>>>>>>>>> extra copy with the CPU.
>>>>>>>>
>>>>>>>> Sorry for the oversimplified explanation. But, yes, you're
>>>>>>>> right, we want to avoid this.
>>>>>>>>
>>>>>>>> As we are dealing with embedded devices, the available memory
>>>>>>>> and computing power for users are usually limited.(The maximum
>>>>>>>> available memory is currently
>>>>>>>>
>>>>>>>> 24GB, typically ranging from 8-12GB. )
>>>>>>>>
>>>>>>>> Also, the CPU computing power is also usually in short supply,
>>>>>>>> due to limited battery capacity and limited heat dissipation
>>>>>>>> capabilities.
>>>>>>>>
>>>>>>>> So, we hope to avoid ineffective paths as much as possible.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> This patch implement a feature called
>>>>>>>>>> DMA_HEAP_IOCTL_ALLOC_READ_FILE.
>>>>>>>>>> User need to offer a file_fd which you want to load into
>>>>>>>>>> dma-buf, then,
>>>>>>>>>> it promise if you got a dma-buf fd, it will contains the file
>>>>>>>>>> content.
>>>>>>>>>
>>>>>>>>> Interesting idea, that has at least more potential than trying
>>>>>>>>> to enable direct I/O on mmap()ed DMA-bufs.
>>>>>>>>>
>>>>>>>>> The approach with the new IOCTL might not work because it is a
>>>>>>>>> very specialized use case.
>>>>>>>>
>>>>>>>> Thank you for your advice. maybe the "read file" behavior can
>>>>>>>> be attached to an existing allocation?
>>>>>>>
>>>>>>> The point is there are already system calls to do something like
>>>>>>> that.
>>>>>>>
>>>>>>> See copy_file_range()
>>>>>>> (https://man7.org/linux/man-pages/man2/copy_file_range.2.html)
>>>>>>> and send_file()
>>>>>>> (https://man7.org/linux/man-pages/man2/sendfile.2.html).
>>>>>>
>>>>>> That's helpfull to learn it, thanks.
>>>>>>
>>>>>> In terms of only DMA-BUF supporting direct I/O,
>>>>>> copy_file_range/send_file may help to achieve this functionality.
>>>>>>
>>>>>> However, my patchset also aims to achieve parallel copying of
>>>>>> file contents while allocating the DMA-BUF, which is something
>>>>>> that the current set of calls may not be able to accomplish.
>>>>
>>>> And exactly that is a no-go. Use the existing IOCTLs and system
>>>> calls instead they should have similar performance when done right.
>>>
>>> Get it, but In my testing process, even without memory pressure, it
>>> takes about 60ms to allocate a 3GB DMA-BUF. When there is
>>> significant memory pressure, the allocation time for a 3GB
>>
>> Well exactly that doesn't make sense. Even if you read the content of
>> the DMA-buf from a file you still need to allocate it first.
>
> Yes, need allocate first, but in kernelspace, no need to wait all
> memory allocated done and then trigger file load.
That doesn't really make sense. Allocating a large bunch of memory is
more efficient than allocating less multiple times because of cache
locality for example.
You could of course hide latency caused by operations to reduce memory
pressure when you have a specific use case, but you don't need to use an
in kernel implementation for that.
Question is do you have clear on allocation or clear on free enabled?
> This patchset use `batch` to done(default 128MB), ever 128MB
> allocated, vmap and get vaddr, then trigger this vaddr load file's
> target pos content.
Again that sounds really not ideal to me. Creating the vmap alone is
complete unnecessary overhead.
>> So the question is why should reading and allocating it at the same
>> time be better in any way?
>
> Memory pressure will trigger reclaim, it must to wait.(ms) Asume I
> already allocated 512MB(need 3G) without enter slowpath,
>
> Even I need to enter slowpath to allocated remain memory, the already
> allocated memory is using load file content.(Save time compare to
> allocated done and read)
>
> The time difference between them can be expressed by the formula:
>
> 1. Allocate dmabuf time + file load time -- for original
>
> 2. first prepare batch time + Max(file load time, allocate remain
> dma-buf time) + latest batch prepare time -- for new
>
> When the file reaches the gigabyte level, the significant difference
> between the two can be clearly observed.
I have strong doubts about that. The method you describe above is
actually really inefficient.
First of all you create a memory mapping just to load data, that is
superfluous and TLB flushes are usually extremely costly. Both for
userspace as well as kernel.
I strongly suggest to try to use copy_file_range() instead. But could be
that copy_file_range() doesn't even work right now because of some
restrictions, never tried that on a DMA-buf.
When that works as far as I can see what could still be saved on
overhead is the following:
1. Clearing of memory on allocation. That could potentially be done with
delayed allocation or clear on free instead.
2. CPU copy between the I/O target buffer and the DMA-buf backing pages.
In theory it should be possible to avoid that by implementing the
copy_file_range() callback, but I'm not 100% sure.
Regards,
Christian.
>
>>
>> Regards,
>> Christian.
>>
>>>
>>>
>>> DMA-BUF can increase to 300ms-1s. (The above test times can also
>>> demonstrate the difference.)
>>>
>>> But, talk is cheap, I agree to research use existing way to
>>> implements it and give a test.
>>>
>>> I'll show this if I done .
>>>
>>> Thanks for your suggestions.
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> You can see cover-letter, here are the normal test and this
>>>>> IOCTL's compare in memory pressure, even if buffered I/O in this
>>>>> ioctl can have 50% improve by parallel.
>>>>>
>>>>> dd a 3GB file for test, 12G RAM phone, UFS4.0, stressapptest 4G
>>>>> memory pressure.
>>>>>
>>>>> 1. original
>>>>> ```shel
>>>>> # create a model file
>>>>> dd if=/dev/zero of=./model.txt bs=1M count=3072
>>>>> # drop page cache
>>>>> echo 3 > /proc/sys/vm/drop_caches
>>>>> ./dmabuf-heap-file-read mtk_mm-uncached normal
>>>>>
>>>>>> result is total cost 13087213847ns
>>>>>
>>>>> ```
>>>>>
>>>>> 2.DMA_HEAP_IOCTL_ALLOC_AND_READ O_DIRECT
>>>>> ```shel
>>>>> # create a model file
>>>>> dd if=/dev/zero of=./model.txt bs=1M count=3072
>>>>> # drop page cache
>>>>> echo 3 > /proc/sys/vm/drop_caches
>>>>> ./dmabuf-heap-file-read mtk_mm-uncached direct_io
>>>>>
>>>>>> result is total cost 2902386846ns
>>>>>
>>>>> # use direct_io_check can check the content if is same to file.
>>>>> ```
>>>>>
>>>>> 3. DMA_HEAP_IOCTL_ALLOC_AND_READ BUFFER I/O
>>>>> ```shel
>>>>> # create a model file
>>>>> dd if=/dev/zero of=./model.txt bs=1M count=3072
>>>>> # drop page cache
>>>>> echo 3 > /proc/sys/vm/drop_caches
>>>>> ./dmabuf-heap-file-read mtk_mm-uncached normal_io
>>>>>
>>>>>> result is total cost 5735579385ns
>>>>>
>>>>> ```
>>>>>
>>>>>>
>>>>>> Perhaps simply returning the DMA-BUF file descriptor and then
>>>>>> implementing copy_file_range, while populating the memory and
>>>>>> content during the copy process, could achieve this? At present,
>>>>>> it seems that it will be quite complex - We need to ensure that
>>>>>> only the returned DMA-BUF file descriptor will fail in case of
>>>>>> memory not fill, like mmap, vmap, attach, and so on.
>>>>>>
>>>>>>>
>>>>>>> What we probably could do is to internally optimize those.
>>>>>>>
>>>>>>>> I am currently creating a new ioctl to remind the user that
>>>>>>>> memory is being allocated and read, and I am also unsure
>>>>>>>>
>>>>>>>> whether it is appropriate to add additional parameters to the
>>>>>>>> existing allocate behavior.
>>>>>>>>
>>>>>>>> Please, give me more suggestion. Thanks.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> But IIRC there was a copy_file_range callback in the
>>>>>>>>> file_operations structure you could use for that. I'm just not
>>>>>>>>> sure when and how that's used with the copy_file_range()
>>>>>>>>> system call.
>>>>>>>>
>>>>>>>> Sorry, I'm not familiar with this, but I will look into it.
>>>>>>>> However, this type of callback function is not currently
>>>>>>>> implemented when exporting
>>>>>>>>
>>>>>>>> the dma_buf file, which means that I need to implement the
>>>>>>>> callback for it?
>>>>>>>
>>>>>>> If I'm not completely mistaken the copy_file_range, splice_read
>>>>>>> and splice_write callbacks on the struct file_operations
>>>>>>> (https://elixir.bootlin.com/linux/v6.10-rc7/source/include/linux/fs.h#L1999).
>>>>>>>
>>>>>>> Can be used to implement what you want to do.
>>>>>> Yes.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Notice, file_fd depends on user how to open this file. So,
>>>>>>>>>> both buffer
>>>>>>>>>> I/O and Direct I/O is supported.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Huan Yang <link(a)vivo.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/dma-buf/dma-heap.c | 525
>>>>>>>>>> +++++++++++++++++++++++++++++++++-
>>>>>>>>>> include/linux/dma-heap.h | 57 +++-
>>>>>>>>>> include/uapi/linux/dma-heap.h | 32 +++
>>>>>>>>>> 3 files changed, 611 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/dma-buf/dma-heap.c
>>>>>>>>>> b/drivers/dma-buf/dma-heap.c
>>>>>>>>>> index 2298ca5e112e..abe17281adb8 100644
>>>>>>>>>> --- a/drivers/dma-buf/dma-heap.c
>>>>>>>>>> +++ b/drivers/dma-buf/dma-heap.c
>>>>>>>>>> @@ -15,9 +15,11 @@
>>>>>>>>>> #include <linux/list.h>
>>>>>>>>>> #include <linux/slab.h>
>>>>>>>>>> #include <linux/nospec.h>
>>>>>>>>>> +#include <linux/highmem.h>
>>>>>>>>>> #include <linux/uaccess.h>
>>>>>>>>>> #include <linux/syscalls.h>
>>>>>>>>>> #include <linux/dma-heap.h>
>>>>>>>>>> +#include <linux/vmalloc.h>
>>>>>>>>>> #include <uapi/linux/dma-heap.h>
>>>>>>>>>> #define DEVNAME "dma_heap"
>>>>>>>>>> @@ -43,12 +45,462 @@ struct dma_heap {
>>>>>>>>>> struct cdev heap_cdev;
>>>>>>>>>> };
>>>>>>>>>> +/**
>>>>>>>>>> + * struct dma_heap_file - wrap the file, read task for
>>>>>>>>>> dma_heap allocate use.
>>>>>>>>>> + * @file: file to read from.
>>>>>>>>>> + *
>>>>>>>>>> + * @cred: kthread use, user cred copy to use for the
>>>>>>>>>> read.
>>>>>>>>>> + *
>>>>>>>>>> + * @max_batch: maximum batch size to read, if collect
>>>>>>>>>> match batch,
>>>>>>>>>> + * trigger read, default 128MB, must below file
>>>>>>>>>> size.
>>>>>>>>>> + *
>>>>>>>>>> + * @fsz: file size.
>>>>>>>>>> + *
>>>>>>>>>> + * @direct: use direct IO?
>>>>>>>>>> + */
>>>>>>>>>> +struct dma_heap_file {
>>>>>>>>>> + struct file *file;
>>>>>>>>>> + struct cred *cred;
>>>>>>>>>> + size_t max_batch;
>>>>>>>>>> + size_t fsz;
>>>>>>>>>> + bool direct;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * struct dma_heap_file_work - represents a dma_heap file
>>>>>>>>>> read real work.
>>>>>>>>>> + * @vaddr: contigous virtual address alloc by vmap,
>>>>>>>>>> file read need.
>>>>>>>>>> + *
>>>>>>>>>> + * @start_size: file read start offset, same to
>>>>>>>>>> @dma_heap_file_task->roffset.
>>>>>>>>>> + *
>>>>>>>>>> + * @need_size: file read need size, same to
>>>>>>>>>> @dma_heap_file_task->rsize.
>>>>>>>>>> + *
>>>>>>>>>> + * @heap_file: file wrapper.
>>>>>>>>>> + *
>>>>>>>>>> + * @list: child node of @dma_heap_file_control->works.
>>>>>>>>>> + *
>>>>>>>>>> + * @refp: same @dma_heap_file_task->ref, if end of
>>>>>>>>>> read, put ref.
>>>>>>>>>> + *
>>>>>>>>>> + * @failp: if any work io failed, set it true, pointp
>>>>>>>>>> @dma_heap_file_task->fail.
>>>>>>>>>> + */
>>>>>>>>>> +struct dma_heap_file_work {
>>>>>>>>>> + void *vaddr;
>>>>>>>>>> + ssize_t start_size;
>>>>>>>>>> + ssize_t need_size;
>>>>>>>>>> + struct dma_heap_file *heap_file;
>>>>>>>>>> + struct list_head list;
>>>>>>>>>> + atomic_t *refp;
>>>>>>>>>> + bool *failp;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * struct dma_heap_file_task - represents a dma_heap file
>>>>>>>>>> read process
>>>>>>>>>> + * @ref: current file work counter, if zero, allocate
>>>>>>>>>> and read
>>>>>>>>>> + * done.
>>>>>>>>>> + *
>>>>>>>>>> + * @roffset: last read offset, current prepared work'
>>>>>>>>>> begin file
>>>>>>>>>> + * start offset.
>>>>>>>>>> + *
>>>>>>>>>> + * @rsize: current allocated page size use to read,
>>>>>>>>>> if reach rbatch,
>>>>>>>>>> + * trigger commit.
>>>>>>>>>> + *
>>>>>>>>>> + * @rbatch: current prepared work's batch, below
>>>>>>>>>> @dma_heap_file's
>>>>>>>>>> + * batch.
>>>>>>>>>> + *
>>>>>>>>>> + * @heap_file: current dma_heap_file
>>>>>>>>>> + *
>>>>>>>>>> + * @parray: used for vmap, size is @dma_heap_file's
>>>>>>>>>> batch's number
>>>>>>>>>> + * pages.(this is maximum). Due to single thread
>>>>>>>>>> file read,
>>>>>>>>>> + * one page array reuse each work prepare is OK.
>>>>>>>>>> + * Each index in parray is PAGE_SIZE.(vmap need)
>>>>>>>>>> + *
>>>>>>>>>> + * @pindex: current allocated page filled in
>>>>>>>>>> @parray's index.
>>>>>>>>>> + *
>>>>>>>>>> + * @fail: any work failed when file read?
>>>>>>>>>> + *
>>>>>>>>>> + * dma_heap_file_task is the production of file read, will
>>>>>>>>>> prepare each work
>>>>>>>>>> + * during allocate dma_buf pages, if match current batch,
>>>>>>>>>> then trigger commit
>>>>>>>>>> + * and prepare next work. After all batch queued, user going
>>>>>>>>>> on prepare dma_buf
>>>>>>>>>> + * and so on, but before return dma_buf fd, need to wait
>>>>>>>>>> file read end and
>>>>>>>>>> + * check read result.
>>>>>>>>>> + */
>>>>>>>>>> +struct dma_heap_file_task {
>>>>>>>>>> + atomic_t ref;
>>>>>>>>>> + size_t roffset;
>>>>>>>>>> + size_t rsize;
>>>>>>>>>> + size_t rbatch;
>>>>>>>>>> + struct dma_heap_file *heap_file;
>>>>>>>>>> + struct page **parray;
>>>>>>>>>> + unsigned int pindex;
>>>>>>>>>> + bool fail;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * struct dma_heap_file_control - global control of dma_heap
>>>>>>>>>> file read.
>>>>>>>>>> + * @works: @dma_heap_file_work's list head.
>>>>>>>>>> + *
>>>>>>>>>> + * @lock: only lock for @works.
>>>>>>>>>> + *
>>>>>>>>>> + * @threadwq: wait queue for @work_thread, if commit
>>>>>>>>>> work, @work_thread
>>>>>>>>>> + * wakeup and read this work's file contains.
>>>>>>>>>> + *
>>>>>>>>>> + * @workwq: used for main thread wait for file read
>>>>>>>>>> end, if allocation
>>>>>>>>>> + * end before file read. @dma_heap_file_task ref
>>>>>>>>>> effect this.
>>>>>>>>>> + *
>>>>>>>>>> + * @work_thread: file read kthread. the
>>>>>>>>>> dma_heap_file_task work's consumer.
>>>>>>>>>> + *
>>>>>>>>>> + * @heap_fwork_cachep: @dma_heap_file_work's cachep, it's
>>>>>>>>>> alloc/free frequently.
>>>>>>>>>> + *
>>>>>>>>>> + * @nr_work: global number of how many work committed.
>>>>>>>>>> + */
>>>>>>>>>> +struct dma_heap_file_control {
>>>>>>>>>> + struct list_head works;
>>>>>>>>>> + spinlock_t lock;
>>>>>>>>>> + wait_queue_head_t threadwq;
>>>>>>>>>> + wait_queue_head_t workwq;
>>>>>>>>>> + struct task_struct *work_thread;
>>>>>>>>>> + struct kmem_cache *heap_fwork_cachep;
>>>>>>>>>> + atomic_t nr_work;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +static struct dma_heap_file_control *heap_fctl;
>>>>>>>>>> static LIST_HEAD(heap_list);
>>>>>>>>>> static DEFINE_MUTEX(heap_list_lock);
>>>>>>>>>> static dev_t dma_heap_devt;
>>>>>>>>>> static struct class *dma_heap_class;
>>>>>>>>>> static DEFINE_XARRAY_ALLOC(dma_heap_minors);
>>>>>>>>>> +/**
>>>>>>>>>> + * map_pages_to_vaddr - map each scatter page into
>>>>>>>>>> contiguous virtual address.
>>>>>>>>>> + * @heap_ftask: prepared and need to commit's work.
>>>>>>>>>> + *
>>>>>>>>>> + * Cached pages need to trigger file read, this function map
>>>>>>>>>> each scatter page
>>>>>>>>>> + * into contiguous virtual address, so that file read can
>>>>>>>>>> easy use.
>>>>>>>>>> + * Now that we get vaddr page, cached pages can return to
>>>>>>>>>> original user, so we
>>>>>>>>>> + * will not effect dma-buf export even if file read not end.
>>>>>>>>>> + */
>>>>>>>>>> +static void *map_pages_to_vaddr(struct dma_heap_file_task
>>>>>>>>>> *heap_ftask)
>>>>>>>>>> +{
>>>>>>>>>> + return vmap(heap_ftask->parray, heap_ftask->pindex, VM_MAP,
>>>>>>>>>> + PAGE_KERNEL);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +bool dma_heap_prepare_file_read(struct dma_heap_file_task
>>>>>>>>>> *heap_ftask,
>>>>>>>>>> + struct page *page)
>>>>>>>>>> +{
>>>>>>>>>> + struct page **array = heap_ftask->parray;
>>>>>>>>>> + int index = heap_ftask->pindex;
>>>>>>>>>> + int num = compound_nr(page), i;
>>>>>>>>>> + unsigned long sz = page_size(page);
>>>>>>>>>> +
>>>>>>>>>> + heap_ftask->rsize += sz;
>>>>>>>>>> + for (i = 0; i < num; ++i)
>>>>>>>>>> + array[index++] = &page[i];
>>>>>>>>>> + heap_ftask->pindex = index;
>>>>>>>>>> +
>>>>>>>>>> + return heap_ftask->rsize >= heap_ftask->rbatch;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static struct dma_heap_file_work *
>>>>>>>>>> +init_file_work(struct dma_heap_file_task *heap_ftask)
>>>>>>>>>> +{
>>>>>>>>>> + struct dma_heap_file_work *heap_fwork;
>>>>>>>>>> + struct dma_heap_file *heap_file = heap_ftask->heap_file;
>>>>>>>>>> +
>>>>>>>>>> + if (READ_ONCE(heap_ftask->fail))
>>>>>>>>>> + return NULL;
>>>>>>>>>> +
>>>>>>>>>> + heap_fwork =
>>>>>>>>>> kmem_cache_alloc(heap_fctl->heap_fwork_cachep, GFP_KERNEL);
>>>>>>>>>> + if (unlikely(!heap_fwork))
>>>>>>>>>> + return NULL;
>>>>>>>>>> +
>>>>>>>>>> + heap_fwork->vaddr = map_pages_to_vaddr(heap_ftask);
>>>>>>>>>> + if (unlikely(!heap_fwork->vaddr)) {
>>>>>>>>>> + kmem_cache_free(heap_fctl->heap_fwork_cachep, heap_fwork);
>>>>>>>>>> + return NULL;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + heap_fwork->heap_file = heap_file;
>>>>>>>>>> + heap_fwork->start_size = heap_ftask->roffset;
>>>>>>>>>> + heap_fwork->need_size = heap_ftask->rsize;
>>>>>>>>>> + heap_fwork->refp = &heap_ftask->ref;
>>>>>>>>>> + heap_fwork->failp = &heap_ftask->fail;
>>>>>>>>>> + atomic_inc(&heap_ftask->ref);
>>>>>>>>>> + return heap_fwork;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void destroy_file_work(struct dma_heap_file_work
>>>>>>>>>> *heap_fwork)
>>>>>>>>>> +{
>>>>>>>>>> + vunmap(heap_fwork->vaddr);
>>>>>>>>>> + atomic_dec(heap_fwork->refp);
>>>>>>>>>> + wake_up(&heap_fctl->workwq);
>>>>>>>>>> +
>>>>>>>>>> + kmem_cache_free(heap_fctl->heap_fwork_cachep, heap_fwork);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +int dma_heap_submit_file_read(struct dma_heap_file_task
>>>>>>>>>> *heap_ftask)
>>>>>>>>>> +{
>>>>>>>>>> + struct dma_heap_file_work *heap_fwork =
>>>>>>>>>> init_file_work(heap_ftask);
>>>>>>>>>> + struct page *last = NULL;
>>>>>>>>>> + struct dma_heap_file *heap_file = heap_ftask->heap_file;
>>>>>>>>>> + size_t start = heap_ftask->roffset;
>>>>>>>>>> + struct file *file = heap_file->file;
>>>>>>>>>> + size_t fsz = heap_file->fsz;
>>>>>>>>>> +
>>>>>>>>>> + if (unlikely(!heap_fwork))
>>>>>>>>>> + return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> + /**
>>>>>>>>>> + * If file size is not page aligned, direct io can't
>>>>>>>>>> process the tail.
>>>>>>>>>> + * So, if reach to tail, remain the last page use buffer
>>>>>>>>>> read.
>>>>>>>>>> + */
>>>>>>>>>> + if (heap_file->direct && start + heap_ftask->rsize > fsz) {
>>>>>>>>>> + heap_fwork->need_size -= PAGE_SIZE;
>>>>>>>>>> + last = heap_ftask->parray[heap_ftask->pindex - 1];
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + spin_lock(&heap_fctl->lock);
>>>>>>>>>> + list_add_tail(&heap_fwork->list, &heap_fctl->works);
>>>>>>>>>> + spin_unlock(&heap_fctl->lock);
>>>>>>>>>> + atomic_inc(&heap_fctl->nr_work);
>>>>>>>>>> +
>>>>>>>>>> + wake_up(&heap_fctl->threadwq);
>>>>>>>>>> +
>>>>>>>>>> + if (last) {
>>>>>>>>>> + char *buf, *pathp;
>>>>>>>>>> + ssize_t err;
>>>>>>>>>> + void *buffer;
>>>>>>>>>> +
>>>>>>>>>> + buf = kmalloc(PATH_MAX, GFP_KERNEL);
>>>>>>>>>> + if (unlikely(!buf))
>>>>>>>>>> + return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> + start = PAGE_ALIGN_DOWN(fsz);
>>>>>>>>>> +
>>>>>>>>>> + pathp = file_path(file, buf, PATH_MAX);
>>>>>>>>>> + if (IS_ERR(pathp)) {
>>>>>>>>>> + kfree(buf);
>>>>>>>>>> + return PTR_ERR(pathp);
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + buffer = kmap_local_page(last); // use page's kaddr.
>>>>>>>>>> + err = kernel_read_file_from_path(pathp, start, &buffer,
>>>>>>>>>> + fsz - start, &fsz,
>>>>>>>>>> + READING_POLICY);
>>>>>>>>>> + kunmap_local(buffer);
>>>>>>>>>> + kfree(buf);
>>>>>>>>>> + if (err < 0) {
>>>>>>>>>> + pr_err("failed to use buffer kernel_read_file
>>>>>>>>>> %s, err=%ld, [%ld, %ld], f_sz=%ld\n",
>>>>>>>>>> + pathp, err, start, fsz, fsz);
>>>>>>>>>> +
>>>>>>>>>> + return err;
>>>>>>>>>> + }
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + heap_ftask->roffset += heap_ftask->rsize;
>>>>>>>>>> + heap_ftask->rsize = 0;
>>>>>>>>>> + heap_ftask->pindex = 0;
>>>>>>>>>> + heap_ftask->rbatch = min_t(size_t,
>>>>>>>>>> + PAGE_ALIGN(fsz) - heap_ftask->roffset,
>>>>>>>>>> + heap_ftask->rbatch);
>>>>>>>>>> + return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +bool dma_heap_wait_for_file_read(struct dma_heap_file_task
>>>>>>>>>> *heap_ftask)
>>>>>>>>>> +{
>>>>>>>>>> + wait_event_freezable(heap_fctl->workwq,
>>>>>>>>>> + atomic_read(&heap_ftask->ref) == 0);
>>>>>>>>>> + return heap_ftask->fail;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +bool dma_heap_destroy_file_read(struct dma_heap_file_task
>>>>>>>>>> *heap_ftask)
>>>>>>>>>> +{
>>>>>>>>>> + bool fail;
>>>>>>>>>> +
>>>>>>>>>> + dma_heap_wait_for_file_read(heap_ftask);
>>>>>>>>>> + fail = heap_ftask->fail;
>>>>>>>>>> + kvfree(heap_ftask->parray);
>>>>>>>>>> + kfree(heap_ftask);
>>>>>>>>>> + return fail;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +struct dma_heap_file_task *
>>>>>>>>>> +dma_heap_declare_file_read(struct dma_heap_file *heap_file)
>>>>>>>>>> +{
>>>>>>>>>> + struct dma_heap_file_task *heap_ftask =
>>>>>>>>>> + kzalloc(sizeof(*heap_ftask), GFP_KERNEL);
>>>>>>>>>> + if (unlikely(!heap_ftask))
>>>>>>>>>> + return NULL;
>>>>>>>>>> +
>>>>>>>>>> + /**
>>>>>>>>>> + * Batch is the maximum size which we prepare work will
>>>>>>>>>> meet.
>>>>>>>>>> + * So, direct alloc this number's page array is OK.
>>>>>>>>>> + */
>>>>>>>>>> + heap_ftask->parray = kvmalloc_array(heap_file->max_batch
>>>>>>>>>> >> PAGE_SHIFT,
>>>>>>>>>> + sizeof(struct page *), GFP_KERNEL);
>>>>>>>>>> + if (unlikely(!heap_ftask->parray))
>>>>>>>>>> + goto put;
>>>>>>>>>> +
>>>>>>>>>> + heap_ftask->heap_file = heap_file;
>>>>>>>>>> + heap_ftask->rbatch = heap_file->max_batch;
>>>>>>>>>> + return heap_ftask;
>>>>>>>>>> +put:
>>>>>>>>>> + kfree(heap_ftask);
>>>>>>>>>> + return NULL;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void __work_this_io(struct dma_heap_file_work
>>>>>>>>>> *heap_fwork)
>>>>>>>>>> +{
>>>>>>>>>> + struct dma_heap_file *heap_file = heap_fwork->heap_file;
>>>>>>>>>> + struct file *file = heap_file->file;
>>>>>>>>>> + ssize_t start = heap_fwork->start_size;
>>>>>>>>>> + ssize_t size = heap_fwork->need_size;
>>>>>>>>>> + void *buffer = heap_fwork->vaddr;
>>>>>>>>>> + const struct cred *old_cred;
>>>>>>>>>> + ssize_t err;
>>>>>>>>>> +
>>>>>>>>>> + // use real task's cred to read this file.
>>>>>>>>>> + old_cred = override_creds(heap_file->cred);
>>>>>>>>>> + err = kernel_read_file(file, start, &buffer, size,
>>>>>>>>>> &heap_file->fsz,
>>>>>>>>>> + READING_POLICY);
>>>>>>>>>> + if (err < 0) {
>>>>>>>>>> + pr_err("use kernel_read_file, err=%ld, [%ld, %ld],
>>>>>>>>>> f_sz=%ld\n",
>>>>>>>>>> + err, start, (start + size), heap_file->fsz);
>>>>>>>>>> + WRITE_ONCE(*heap_fwork->failp, true);
>>>>>>>>>> + }
>>>>>>>>>> + // recovery to my cred.
>>>>>>>>>> + revert_creds(old_cred);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static int dma_heap_file_control_thread(void *data)
>>>>>>>>>> +{
>>>>>>>>>> + struct dma_heap_file_control *heap_fctl =
>>>>>>>>>> + (struct dma_heap_file_control *)data;
>>>>>>>>>> + struct dma_heap_file_work *worker, *tmp;
>>>>>>>>>> + int nr_work;
>>>>>>>>>> +
>>>>>>>>>> + LIST_HEAD(pages);
>>>>>>>>>> + LIST_HEAD(workers);
>>>>>>>>>> +
>>>>>>>>>> + while (true) {
>>>>>>>>>> + wait_event_freezable(heap_fctl->threadwq,
>>>>>>>>>> + atomic_read(&heap_fctl->nr_work) > 0);
>>>>>>>>>> +recheck:
>>>>>>>>>> + spin_lock(&heap_fctl->lock);
>>>>>>>>>> + list_splice_init(&heap_fctl->works, &workers);
>>>>>>>>>> + spin_unlock(&heap_fctl->lock);
>>>>>>>>>> +
>>>>>>>>>> + if (unlikely(kthread_should_stop())) {
>>>>>>>>>> + list_for_each_entry_safe(worker, tmp, &workers,
>>>>>>>>>> list) {
>>>>>>>>>> + list_del(&worker->list);
>>>>>>>>>> + destroy_file_work(worker);
>>>>>>>>>> + }
>>>>>>>>>> + break;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + nr_work = 0;
>>>>>>>>>> + list_for_each_entry_safe(worker, tmp, &workers, list) {
>>>>>>>>>> + ++nr_work;
>>>>>>>>>> + list_del(&worker->list);
>>>>>>>>>> + __work_this_io(worker);
>>>>>>>>>> +
>>>>>>>>>> + destroy_file_work(worker);
>>>>>>>>>> + }
>>>>>>>>>> + atomic_sub(nr_work, &heap_fctl->nr_work);
>>>>>>>>>> +
>>>>>>>>>> + if (atomic_read(&heap_fctl->nr_work) > 0)
>>>>>>>>>> + goto recheck;
>>>>>>>>>> + }
>>>>>>>>>> + return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +size_t dma_heap_file_size(struct dma_heap_file *heap_file)
>>>>>>>>>> +{
>>>>>>>>>> + return heap_file->fsz;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static int prepare_dma_heap_file(struct dma_heap_file
>>>>>>>>>> *heap_file, int file_fd,
>>>>>>>>>> + size_t batch)
>>>>>>>>>> +{
>>>>>>>>>> + struct file *file;
>>>>>>>>>> + size_t fsz;
>>>>>>>>>> + int ret;
>>>>>>>>>> +
>>>>>>>>>> + file = fget(file_fd);
>>>>>>>>>> + if (!file)
>>>>>>>>>> + return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> + fsz = i_size_read(file_inode(file));
>>>>>>>>>> + if (fsz < batch) {
>>>>>>>>>> + ret = -EINVAL;
>>>>>>>>>> + goto err;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + /**
>>>>>>>>>> + * Selinux block our read, but actually we are reading
>>>>>>>>>> the stand-in
>>>>>>>>>> + * for this file.
>>>>>>>>>> + * So save current's cred and when going to read,
>>>>>>>>>> override mine, and
>>>>>>>>>> + * end of read, revert.
>>>>>>>>>> + */
>>>>>>>>>> + heap_file->cred = prepare_kernel_cred(current);
>>>>>>>>>> + if (unlikely(!heap_file->cred)) {
>>>>>>>>>> + ret = -ENOMEM;
>>>>>>>>>> + goto err;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + heap_file->file = file;
>>>>>>>>>> + heap_file->max_batch = batch;
>>>>>>>>>> + heap_file->fsz = fsz;
>>>>>>>>>> +
>>>>>>>>>> + heap_file->direct = file->f_flags & O_DIRECT;
>>>>>>>>>> +
>>>>>>>>>> +#define DMA_HEAP_SUGGEST_DIRECT_IO_SIZE (1UL << 30)
>>>>>>>>>> + if (!heap_file->direct && fsz >=
>>>>>>>>>> DMA_HEAP_SUGGEST_DIRECT_IO_SIZE)
>>>>>>>>>> + pr_warn("alloc read file better to use O_DIRECT to
>>>>>>>>>> read larget file\n");
>>>>>>>>>> +
>>>>>>>>>> + return 0;
>>>>>>>>>> +
>>>>>>>>>> +err:
>>>>>>>>>> + fput(file);
>>>>>>>>>> + return ret;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void destroy_dma_heap_file(struct dma_heap_file
>>>>>>>>>> *heap_file)
>>>>>>>>>> +{
>>>>>>>>>> + fput(heap_file->file);
>>>>>>>>>> + put_cred(heap_file->cred);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static int dma_heap_buffer_alloc_read_file(struct dma_heap
>>>>>>>>>> *heap, int file_fd,
>>>>>>>>>> + size_t batch, unsigned int fd_flags,
>>>>>>>>>> + unsigned int heap_flags)
>>>>>>>>>> +{
>>>>>>>>>> + struct dma_buf *dmabuf;
>>>>>>>>>> + int fd;
>>>>>>>>>> + struct dma_heap_file heap_file;
>>>>>>>>>> +
>>>>>>>>>> + fd = prepare_dma_heap_file(&heap_file, file_fd, batch);
>>>>>>>>>> + if (fd)
>>>>>>>>>> + goto error_file;
>>>>>>>>>> +
>>>>>>>>>> + dmabuf = heap->ops->allocate_read_file(heap, &heap_file,
>>>>>>>>>> fd_flags,
>>>>>>>>>> + heap_flags);
>>>>>>>>>> + if (IS_ERR(dmabuf)) {
>>>>>>>>>> + fd = PTR_ERR(dmabuf);
>>>>>>>>>> + goto error;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + fd = dma_buf_fd(dmabuf, fd_flags);
>>>>>>>>>> + if (fd < 0) {
>>>>>>>>>> + dma_buf_put(dmabuf);
>>>>>>>>>> + /* just return, as put will call release and that
>>>>>>>>>> will free */
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> +error:
>>>>>>>>>> + destroy_dma_heap_file(&heap_file);
>>>>>>>>>> +error_file:
>>>>>>>>>> + return fd;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> static int dma_heap_buffer_alloc(struct dma_heap *heap,
>>>>>>>>>> size_t len,
>>>>>>>>>> u32 fd_flags,
>>>>>>>>>> u64 heap_flags)
>>>>>>>>>> @@ -93,6 +545,38 @@ static int dma_heap_open(struct inode
>>>>>>>>>> *inode, struct file *file)
>>>>>>>>>> return 0;
>>>>>>>>>> }
>>>>>>>>>> +static long dma_heap_ioctl_allocate_read_file(struct file
>>>>>>>>>> *file, void *data)
>>>>>>>>>> +{
>>>>>>>>>> + struct dma_heap_allocation_file_data
>>>>>>>>>> *heap_allocation_file = data;
>>>>>>>>>> + struct dma_heap *heap = file->private_data;
>>>>>>>>>> + int fd;
>>>>>>>>>> +
>>>>>>>>>> + if (heap_allocation_file->fd ||
>>>>>>>>>> !heap_allocation_file->file_fd)
>>>>>>>>>> + return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> + if (heap_allocation_file->fd_flags &
>>>>>>>>>> ~DMA_HEAP_VALID_FD_FLAGS)
>>>>>>>>>> + return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> + if (heap_allocation_file->heap_flags &
>>>>>>>>>> ~DMA_HEAP_VALID_HEAP_FLAGS)
>>>>>>>>>> + return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> + if (!heap->ops->allocate_read_file)
>>>>>>>>>> + return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> + fd = dma_heap_buffer_alloc_read_file(
>>>>>>>>>> + heap, heap_allocation_file->file_fd,
>>>>>>>>>> + heap_allocation_file->batch ?
>>>>>>>>>> + PAGE_ALIGN(heap_allocation_file->batch) :
>>>>>>>>>> + DEFAULT_ADI_BATCH,
>>>>>>>>>> + heap_allocation_file->fd_flags,
>>>>>>>>>> + heap_allocation_file->heap_flags);
>>>>>>>>>> + if (fd < 0)
>>>>>>>>>> + return fd;
>>>>>>>>>> +
>>>>>>>>>> + heap_allocation_file->fd = fd;
>>>>>>>>>> + return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> static long dma_heap_ioctl_allocate(struct file *file, void
>>>>>>>>>> *data)
>>>>>>>>>> {
>>>>>>>>>> struct dma_heap_allocation_data *heap_allocation = data;
>>>>>>>>>> @@ -121,6 +605,7 @@ static long
>>>>>>>>>> dma_heap_ioctl_allocate(struct file *file, void *data)
>>>>>>>>>> static unsigned int dma_heap_ioctl_cmds[] = {
>>>>>>>>>> DMA_HEAP_IOCTL_ALLOC,
>>>>>>>>>> + DMA_HEAP_IOCTL_ALLOC_AND_READ,
>>>>>>>>>> };
>>>>>>>>>> static long dma_heap_ioctl(struct file *file, unsigned
>>>>>>>>>> int ucmd,
>>>>>>>>>> @@ -170,6 +655,9 @@ static long dma_heap_ioctl(struct file
>>>>>>>>>> *file, unsigned int ucmd,
>>>>>>>>>> case DMA_HEAP_IOCTL_ALLOC:
>>>>>>>>>> ret = dma_heap_ioctl_allocate(file, kdata);
>>>>>>>>>> break;
>>>>>>>>>> + case DMA_HEAP_IOCTL_ALLOC_AND_READ:
>>>>>>>>>> + ret = dma_heap_ioctl_allocate_read_file(file, kdata);
>>>>>>>>>> + break;
>>>>>>>>>> default:
>>>>>>>>>> ret = -ENOTTY;
>>>>>>>>>> goto err;
>>>>>>>>>> @@ -316,11 +804,44 @@ static int dma_heap_init(void)
>>>>>>>>>> dma_heap_class = class_create(DEVNAME);
>>>>>>>>>> if (IS_ERR(dma_heap_class)) {
>>>>>>>>>> - unregister_chrdev_region(dma_heap_devt,
>>>>>>>>>> NUM_HEAP_MINORS);
>>>>>>>>>> - return PTR_ERR(dma_heap_class);
>>>>>>>>>> + ret = PTR_ERR(dma_heap_class);
>>>>>>>>>> + goto fail_class;
>>>>>>>>>> }
>>>>>>>>>> dma_heap_class->devnode = dma_heap_devnode;
>>>>>>>>>> + heap_fctl = kzalloc(sizeof(*heap_fctl), GFP_KERNEL);
>>>>>>>>>> + if (unlikely(!heap_fctl)) {
>>>>>>>>>> + ret = -ENOMEM;
>>>>>>>>>> + goto fail_alloc;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + INIT_LIST_HEAD(&heap_fctl->works);
>>>>>>>>>> + init_waitqueue_head(&heap_fctl->threadwq);
>>>>>>>>>> + init_waitqueue_head(&heap_fctl->workwq);
>>>>>>>>>> +
>>>>>>>>>> + heap_fctl->work_thread =
>>>>>>>>>> kthread_run(dma_heap_file_control_thread,
>>>>>>>>>> + heap_fctl, "heap_fwork_t");
>>>>>>>>>> + if (IS_ERR(heap_fctl->work_thread)) {
>>>>>>>>>> + ret = -ENOMEM;
>>>>>>>>>> + goto fail_thread;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + heap_fctl->heap_fwork_cachep =
>>>>>>>>>> KMEM_CACHE(dma_heap_file_work, 0);
>>>>>>>>>> + if (unlikely(!heap_fctl->heap_fwork_cachep)) {
>>>>>>>>>> + ret = -ENOMEM;
>>>>>>>>>> + goto fail_cache;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> return 0;
>>>>>>>>>> +
>>>>>>>>>> +fail_cache:
>>>>>>>>>> + kthread_stop(heap_fctl->work_thread);
>>>>>>>>>> +fail_thread:
>>>>>>>>>> + kfree(heap_fctl);
>>>>>>>>>> +fail_alloc:
>>>>>>>>>> + class_destroy(dma_heap_class);
>>>>>>>>>> +fail_class:
>>>>>>>>>> + unregister_chrdev_region(dma_heap_devt, NUM_HEAP_MINORS);
>>>>>>>>>> + return ret;
>>>>>>>>>> }
>>>>>>>>>> subsys_initcall(dma_heap_init);
>>>>>>>>>> diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
>>>>>>>>>> index 064bad725061..9c25383f816c 100644
>>>>>>>>>> --- a/include/linux/dma-heap.h
>>>>>>>>>> +++ b/include/linux/dma-heap.h
>>>>>>>>>> @@ -12,12 +12,17 @@
>>>>>>>>>> #include <linux/cdev.h>
>>>>>>>>>> #include <linux/types.h>
>>>>>>>>>> +#define DEFAULT_ADI_BATCH (128 << 20)
>>>>>>>>>> +
>>>>>>>>>> struct dma_heap;
>>>>>>>>>> +struct dma_heap_file_task;
>>>>>>>>>> +struct dma_heap_file;
>>>>>>>>>> /**
>>>>>>>>>> * struct dma_heap_ops - ops to operate on a given heap
>>>>>>>>>> * @allocate: allocate dmabuf and return struct
>>>>>>>>>> dma_buf ptr
>>>>>>>>>> - *
>>>>>>>>>> + * @allocate_read_file: allocate dmabuf and read file, then
>>>>>>>>>> return struct
>>>>>>>>>> + * dma_buf ptr.
>>>>>>>>>> * allocate returns dmabuf on success, ERR_PTR(-errno) on
>>>>>>>>>> error.
>>>>>>>>>> */
>>>>>>>>>> struct dma_heap_ops {
>>>>>>>>>> @@ -25,6 +30,11 @@ struct dma_heap_ops {
>>>>>>>>>> unsigned long len,
>>>>>>>>>> u32 fd_flags,
>>>>>>>>>> u64 heap_flags);
>>>>>>>>>> +
>>>>>>>>>> + struct dma_buf *(*allocate_read_file)(struct dma_heap
>>>>>>>>>> *heap,
>>>>>>>>>> + struct dma_heap_file *heap_file,
>>>>>>>>>> + u32 fd_flags,
>>>>>>>>>> + u64 heap_flags);
>>>>>>>>>> };
>>>>>>>>>> /**
>>>>>>>>>> @@ -65,4 +75,49 @@ const char *dma_heap_get_name(struct
>>>>>>>>>> dma_heap *heap);
>>>>>>>>>> */
>>>>>>>>>> struct dma_heap *dma_heap_add(const struct
>>>>>>>>>> dma_heap_export_info *exp_info);
>>>>>>>>>> +/**
>>>>>>>>>> + * dma_heap_destroy_file_read - waits for a file read to
>>>>>>>>>> complete then destroy it
>>>>>>>>>> + * Returns: true if the file read failed, false otherwise
>>>>>>>>>> + */
>>>>>>>>>> +bool dma_heap_destroy_file_read(struct dma_heap_file_task
>>>>>>>>>> *heap_ftask);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * dma_heap_wait_for_file_read - waits for a file read to
>>>>>>>>>> complete
>>>>>>>>>> + * Returns: true if the file read failed, false otherwise
>>>>>>>>>> + */
>>>>>>>>>> +bool dma_heap_wait_for_file_read(struct dma_heap_file_task
>>>>>>>>>> *heap_ftask);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * dma_heap_alloc_file_read - Declare a task to read file
>>>>>>>>>> when allocate pages.
>>>>>>>>>> + * @heap_file: target file to read
>>>>>>>>>> + *
>>>>>>>>>> + * Return NULL if failed, otherwise return a struct pointer.
>>>>>>>>>> + */
>>>>>>>>>> +struct dma_heap_file_task *
>>>>>>>>>> +dma_heap_declare_file_read(struct dma_heap_file *heap_file);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * dma_heap_prepare_file_read - cache each allocated page
>>>>>>>>>> until we meet this batch.
>>>>>>>>>> + * @heap_ftask: prepared and need to commit's work.
>>>>>>>>>> + * @page: current allocated page. don't care which
>>>>>>>>>> order.
>>>>>>>>>> + *
>>>>>>>>>> + * Returns true if reach to batch, false so go on prepare.
>>>>>>>>>> + */
>>>>>>>>>> +bool dma_heap_prepare_file_read(struct dma_heap_file_task
>>>>>>>>>> *heap_ftask,
>>>>>>>>>> + struct page *page);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * dma_heap_commit_file_read - prepare collect enough
>>>>>>>>>> memory, going to trigger IO
>>>>>>>>>> + * @heap_ftask: info that current IO needs
>>>>>>>>>> + *
>>>>>>>>>> + * This commit will also check if reach to tail read.
>>>>>>>>>> + * For direct I/O submissions, it is necessary to pay
>>>>>>>>>> attention to file reads
>>>>>>>>>> + * that are not page-aligned. For the unaligned portion of
>>>>>>>>>> the read, buffer IO
>>>>>>>>>> + * needs to be triggered.
>>>>>>>>>> + * Returns:
>>>>>>>>>> + * 0 if all right, -errno if something wrong
>>>>>>>>>> + */
>>>>>>>>>> +int dma_heap_submit_file_read(struct dma_heap_file_task
>>>>>>>>>> *heap_ftask);
>>>>>>>>>> +size_t dma_heap_file_size(struct dma_heap_file *heap_file);
>>>>>>>>>> +
>>>>>>>>>> #endif /* _DMA_HEAPS_H */
>>>>>>>>>> diff --git a/include/uapi/linux/dma-heap.h
>>>>>>>>>> b/include/uapi/linux/dma-heap.h
>>>>>>>>>> index a4cf716a49fa..8c20e8b74eed 100644
>>>>>>>>>> --- a/include/uapi/linux/dma-heap.h
>>>>>>>>>> +++ b/include/uapi/linux/dma-heap.h
>>>>>>>>>> @@ -39,6 +39,27 @@ struct dma_heap_allocation_data {
>>>>>>>>>> __u64 heap_flags;
>>>>>>>>>> };
>>>>>>>>>> +/**
>>>>>>>>>> + * struct dma_heap_allocation_file_data - metadata passed
>>>>>>>>>> from userspace for
>>>>>>>>>> + * allocations and read file
>>>>>>>>>> + * @fd: will be populated with a fd which
>>>>>>>>>> provides the
>>>>>>>>>> + * �� handle to the allocated dma-buf
>>>>>>>>>> + * @file_fd: file descriptor to read from(suggested
>>>>>>>>>> to use O_DIRECT open file)
>>>>>>>>>> + * @batch: how many memory alloced then file
>>>>>>>>>> read(bytes), default 128MB
>>>>>>>>>> + * will auto aligned to PAGE_SIZE
>>>>>>>>>> + * @fd_flags: file descriptor flags used when allocating
>>>>>>>>>> + * @heap_flags: flags passed to heap
>>>>>>>>>> + *
>>>>>>>>>> + * Provided by userspace as an argument to the ioctl
>>>>>>>>>> + */
>>>>>>>>>> +struct dma_heap_allocation_file_data {
>>>>>>>>>> + __u32 fd;
>>>>>>>>>> + __u32 file_fd;
>>>>>>>>>> + __u32 batch;
>>>>>>>>>> + __u32 fd_flags;
>>>>>>>>>> + __u64 heap_flags;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> #define DMA_HEAP_IOC_MAGIC 'H'
>>>>>>>>>> /**
>>>>>>>>>> @@ -50,4 +71,15 @@ struct dma_heap_allocation_data {
>>>>>>>>>> #define DMA_HEAP_IOCTL_ALLOC _IOWR(DMA_HEAP_IOC_MAGIC, 0x0,\
>>>>>>>>>> struct dma_heap_allocation_data)
>>>>>>>>>> +/**
>>>>>>>>>> + * DOC: DMA_HEAP_IOCTL_ALLOC_AND_READ - allocate memory from
>>>>>>>>>> pool and both
>>>>>>>>>> + * read file when allocate memory.
>>>>>>>>>> + *
>>>>>>>>>> + * Takes a dma_heap_allocation_file_data struct and returns
>>>>>>>>>> it with the fd field
>>>>>>>>>> + * populated with the dmabuf handle of the allocation. When
>>>>>>>>>> return, the dma-buf
>>>>>>>>>> + * content is read from file.
>>>>>>>>>> + */
>>>>>>>>>> +#define DMA_HEAP_IOCTL_ALLOC_AND_READ \
>>>>>>>>>> + _IOWR(DMA_HEAP_IOC_MAGIC, 0x1, struct
>>>>>>>>>> dma_heap_allocation_file_data)
>>>>>>>>>> +
>>>>>>>>>> #endif /* _UAPI_LINUX_DMABUF_POOL_H */
>>>>>>>>>
>>>>>>>
>>>>
>>
Hi,
This series is the follow-up of the discussion that John and I had a few
months ago here:
https://lore.kernel.org/all/CANDhNCquJn6bH3KxKf65BWiTYLVqSd9892-xtFDHHqqyrr…
The initial problem we were discussing was that I'm currently working on
a platform which has a memory layout with ECC enabled. However, enabling
the ECC has a number of drawbacks on that platform: lower performance,
increased memory usage, etc. So for things like framebuffers, the
trade-off isn't great and thus there's a memory region with ECC disabled
to allocate from for such use cases.
After a suggestion from John, I chose to start using heap allocations
flags to allow for userspace to ask for a particular ECC setup. This is
then backed by a new heap type that runs from reserved memory chunks
flagged as such, and the existing DT properties to specify the ECC
properties.
We could also easily extend this mechanism to support more flags, or
through a new ioctl to discover which flags a given heap supports.
I submitted a draft PR to the DT schema for the bindings used in this
PR:
https://github.com/devicetree-org/dt-schema/pull/138
Let me know what you think,
Maxime
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Maxime Ripard (8):
dma-buf: heaps: Introduce a new heap for reserved memory
of: Add helper to retrieve ECC memory bits
dma-buf: heaps: Import uAPI header
dma-buf: heaps: Add ECC protection flags
dma-buf: heaps: system: Remove global variable
dma-buf: heaps: system: Handle ECC flags
dma-buf: heaps: cma: Handle ECC flags
dma-buf: heaps: carveout: Handle ECC flags
drivers/dma-buf/dma-heap.c | 4 +
drivers/dma-buf/heaps/Kconfig | 8 +
drivers/dma-buf/heaps/Makefile | 1 +
drivers/dma-buf/heaps/carveout_heap.c | 330 ++++++++++++++++++++++++++++++++++
drivers/dma-buf/heaps/cma_heap.c | 10 ++
drivers/dma-buf/heaps/system_heap.c | 29 ++-
include/linux/dma-heap.h | 2 +
include/linux/of.h | 25 +++
include/uapi/linux/dma-heap.h | 5 +-
9 files changed, 407 insertions(+), 7 deletions(-)
---
base-commit: a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
change-id: 20240515-dma-buf-ecc-heap-28a311d2c94e
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
On Wed, Jul 10, 2024 at 8:08 AM Lei Liu <liulei.rjpt(a)vivo.com> wrote:
>
>
> on 2024/7/10 22:48, Christian König wrote:
> > Am 10.07.24 um 16:35 schrieb Lei Liu:
> >>
> >> on 2024/7/10 22:14, Christian König wrote:
> >>> Am 10.07.24 um 15:57 schrieb Lei Liu:
> >>>> Use vm_insert_page to establish a mapping for the memory allocated
> >>>> by dmabuf, thus supporting direct I/O read and write; and fix the
> >>>> issue of incorrect memory statistics after mapping dmabuf memory.
> >>>
> >>> Well big NAK to that! Direct I/O is intentionally disabled on DMA-bufs.
> >>
> >> Hello! Could you explain why direct_io is disabled on DMABUF? Is
> >> there any historical reason for this?
> >
> > It's basically one of the most fundamental design decision of DMA-Buf.
> > The attachment/map/fence model DMA-buf uses is not really compatible
> > with direct I/O on the underlying pages.
>
> Thank you! Is there any related documentation on this? I would like to
> understand and learn more about the fundamental reasons for the lack of
> support.
Hi Lei and Christian,
This is now the third request I've seen from three different companies
who are interested in this, but the others are not for reasons of read
performance that you mention in the commit message on your first
patch. Someone else at Google ran a comparison between a normal read()
and a direct I/O read() into a preallocated user buffer and found that
with large readahead (16 MB) the throughput can actually be slightly
higher than direct I/O. If you have concerns about read performance,
have you tried increasing the readahead size?
The other motivation is to load a gajillion byte file from disk into a
dmabuf without evicting the entire contents of pagecache while doing
so. Something like this (which does not currently work because read()
tries to GUP on the dmabuf memory as you mention):
static int dmabuf_heap_alloc(int heap_fd, size_t len)
{
struct dma_heap_allocation_data data = {
.len = len,
.fd = 0,
.fd_flags = O_RDWR | O_CLOEXEC,
.heap_flags = 0,
};
int ret = ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &data);
if (ret < 0)
return ret;
return data.fd;
}
int main(int, char **argv)
{
const char *file_path = argv[1];
printf("File: %s\n", file_path);
int file_fd = open(file_path, O_RDONLY | O_DIRECT);
struct stat st;
stat(file_path, &st);
ssize_t file_size = st.st_size;
ssize_t aligned_size = (file_size + 4095) & ~4095;
printf("File size: %zd Aligned size: %zd\n", file_size, aligned_size);
int heap_fd = open("/dev/dma_heap/system", O_RDONLY);
int dmabuf_fd = dmabuf_heap_alloc(heap_fd, aligned_size);
void *vm = mmap(nullptr, aligned_size, PROT_READ | PROT_WRITE,
MAP_SHARED, dmabuf_fd, 0);
printf("VM at 0x%lx\n", (unsigned long)vm);
dma_buf_sync sync_flags { DMA_BUF_SYNC_START |
DMA_BUF_SYNC_READ | DMA_BUF_SYNC_WRITE };
ioctl(dmabuf_fd, DMA_BUF_IOCTL_SYNC, &sync_flags);
ssize_t rc = read(file_fd, vm, file_size);
printf("Read: %zd %s\n", rc, rc < 0 ? strerror(errno) : "");
sync_flags.flags = DMA_BUF_SYNC_END | DMA_BUF_SYNC_READ |
DMA_BUF_SYNC_WRITE;
ioctl(dmabuf_fd, DMA_BUF_IOCTL_SYNC, &sync_flags);
}
Or replace the mmap() + read() with sendfile().
So I would also like to see the above code (or something else similar)
be able to work and I understand some of the reasons why it currently
does not, but I don't understand why we should actively prevent this
type of behavior entirely.
Best,
T.J.
> >
> >>>
> >>> We already discussed enforcing that in the DMA-buf framework and
> >>> this patch probably means that we should really do that.
> >>>
> >>> Regards,
> >>> Christian.
> >>
> >> Thank you for your response. With the application of AI large model
> >> edgeification, we urgently need support for direct_io on DMABUF to
> >> read some very large files. Do you have any new solutions or plans
> >> for this?
> >
> > We have seen similar projects over the years and all of those turned
> > out to be complete shipwrecks.
> >
> > There is currently a patch set under discussion to give the network
> > subsystem DMA-buf support. If you are interest in network direct I/O
> > that could help.
>
> Is there a related introduction link for this patch?
>
> >
> > Additional to that a lot of GPU drivers support userptr usages, e.g.
> > to import malloced memory into the GPU driver. You can then also do
> > direct I/O on that malloced memory and the kernel will enforce correct
> > handling with the GPU driver through MMU notifiers.
> >
> > But as far as I know a general DMA-buf based solution isn't possible.
>
> 1.The reason we need to use DMABUF memory here is that we need to share
> memory between the CPU and APU. Currently, only DMABUF memory is
> suitable for this purpose. Additionally, we need to read very large files.
>
> 2. Are there any other solutions for this? Also, do you have any plans
> to support direct_io for DMABUF memory in the future?
>
> >
> > Regards,
> > Christian.
> >
> >>
> >> Regards,
> >> Lei Liu.
> >>
> >>>
> >>>>
> >>>> Lei Liu (2):
> >>>> mm: dmabuf_direct_io: Support direct_io for memory allocated by
> >>>> dmabuf
> >>>> mm: dmabuf_direct_io: Fix memory statistics error for dmabuf
> >>>> allocated
> >>>> memory with direct_io support
> >>>>
> >>>> drivers/dma-buf/heaps/system_heap.c | 5 +++--
> >>>> fs/proc/task_mmu.c | 8 +++++++-
> >>>> include/linux/mm.h | 1 +
> >>>> mm/memory.c | 15 ++++++++++-----
> >>>> mm/rmap.c | 9 +++++----
> >>>> 5 files changed, 26 insertions(+), 12 deletions(-)
> >>>>
> >>>
> >
On Thu, Jun 20, 2024 at 3:52 PM Hans Verkuil <hverkuil-cisco(a)xs4all.nl> wrote:
>
> On 19/06/2024 06:19, Tomasz Figa wrote:
> > On Wed, Jun 19, 2024 at 1:24 AM Nicolas Dufresne <nicolas(a)ndufresne.ca> wrote:
> >>
> >> Le mardi 18 juin 2024 à 16:47 +0900, Tomasz Figa a écrit :
> >>> Hi TaoJiang,
> >>>
> >>> On Tue, Jun 18, 2024 at 4:30 PM TaoJiang <tao.jiang_2(a)nxp.com> wrote:
> >>>>
> >>>> From: Ming Qian <ming.qian(a)nxp.com>
> >>>>
> >>>> When the memory type is VB2_MEMORY_DMABUF, the v4l2 device can't know
> >>>> whether the dma buffer is coherent or synchronized.
> >>>>
> >>>> The videobuf2-core will skip cache syncs as it think the DMA exporter
> >>>> should take care of cache syncs
> >>>>
> >>>> But in fact it's likely that the client doesn't
> >>>> synchronize the dma buf before qbuf() or after dqbuf(). and it's
> >>>> difficult to find this type of error directly.
> >>>>
> >>>> I think it's helpful that videobuf2-core can call
> >>>> dma_buf_end_cpu_access() and dma_buf_begin_cpu_access() to handle the
> >>>> cache syncs.
> >>>>
> >>>> Signed-off-by: Ming Qian <ming.qian(a)nxp.com>
> >>>> Signed-off-by: TaoJiang <tao.jiang_2(a)nxp.com>
> >>>> ---
> >>>> .../media/common/videobuf2/videobuf2-core.c | 22 +++++++++++++++++++
> >>>> 1 file changed, 22 insertions(+)
> >>>>
> >>>
> >>> Sorry, that patch is incorrect. I believe you're misunderstanding the
> >>> way DMA-buf buffers should be managed in the userspace. It's the
> >>> userspace responsibility to call the DMA_BUF_IOCTL_SYNC ioctl [1] to
> >>> signal start and end of CPU access to the kernel and imply necessary
> >>> cache synchronization.
> >>>
> >>> [1] https://docs.kernel.org/driver-api/dma-buf.html#dma-buffer-ioctls
> >>>
> >>> So, really sorry, but it's a NAK.
> >>
> >>
> >>
> >> This patch *could* make sense if it was inside UVC Driver as an example, as this
> >> driver can import dmabuf, to CPU memcpy, and does omits the required sync calls
> >> (unless that got added recently, I can easily have missed it).
> >
> > Yeah, currently V4L2 drivers don't call the in-kernel
> > dma_buf_{begin,end}_cpu_access() when they need to access the buffers
> > from the CPU, while my quick grep [1] reveals that we have 68 files
> > retrieving plane vaddr by calling vb2_plane_vaddr() (not necessarily a
> > 100% guarantee of CPU access being done, but rather likely so).
> >
> > I also repeated the same thing with VB2_DMABUF [2] and tried to
> > attribute both lists to specific drivers (by retaining the path until
> > the first - or _ [3]; which seemed to be relatively accurate), leading
> > to the following drivers that claim support for DMABUF while also
> > retrieving plane vaddr (without proper synchronization - no drivers
> > currently call any begin/end CPU access):
> >
> > i2c/video
> > pci/bt8xx/bttv
> > pci/cobalt/cobalt
> > pci/cx18/cx18
> > pci/tw5864/tw5864
> > pci/tw686x/tw686x
> > platform/allegro
> > platform/amphion/vpu
> > platform/chips
> > platform/intel/pxa
> > platform/marvell/mcam
> > platform/mediatek/jpeg/mtk
> > platform/mediatek/vcodec/decoder/mtk
> > platform/mediatek/vcodec/encoder/mtk
> > platform/nuvoton/npcm
> > platform/nvidia/tegra
> > platform/nxp/imx
> > platform/renesas/rcar
> > platform/renesas/vsp1/vsp1
> > platform/rockchip/rkisp1/rkisp1
> > platform/samsung/exynos4
> > platform/samsung/s5p
> > platform/st/sti/delta/delta
> > platform/st/sti/hva/hva
> > platform/verisilicon/hantro
> > usb/au0828/au0828
> > usb/cx231xx/cx231xx
> > usb/dvb
> > usb/em28xx/em28xx
> > usb/gspca/gspca.c
> > usb/hackrf/hackrf.c
> > usb/stk1160/stk1160
> > usb/uvc/uvc
> >
> > which means we potentially have ~30 drivers which likely don't handle
> > imported DMABUFs correctly (there is still a chance that DMABUF is
> > advertised for one queue, while vaddr is used for another).
> >
> > I think we have two options:
> > 1) add vb2_{begin/end}_cpu_access() helpers, carefully audit each
> > driver and add calls to those
>
> I actually started on that 9 (!) years ago:
>
> https://git.linuxtv.org/hverkuil/media_tree.git/log/?h=vb2-cpu-access
>
> If memory serves, the main problem was that there were some drivers where
> it wasn't clear what should be done. In the end I never continued this
> work since nobody complained about it.
>
> This patch series adds vb2_plane_begin/end_cpu_access() functions,
> replaces all calls to vb2_plane_vaddr() in drivers to the new functions,
> and at the end removes vb2_plane_vaddr() altogether.
>
> > 2) take a heavy gun approach and just call vb2_begin_cpu_access()
> > whenever vb2_plane_vaddr() is called and then vb2_end_cpu_access()
> > whenever vb2_buffer_done() is called (if begin was called before).
> >
> > The latter has the disadvantage of drivers not having control over the
> > timing of the cache sync, so could end up with less than optimal
> > performance. Also there could be some more complex cases, where the
> > driver needs to mix DMA and CPU accesses to the buffer, so the fixed
> > sequence just wouldn't work for them. (But then they just wouldn't
> > work today either.)
> >
> > Hans, Marek, do you have any thoughts? (I'd personally just go with 2
> > and if any driver in the future needs something else, they could call
> > begin/end CPU access manually.)
>
> I prefer 1. If nothing else, that makes it easy to identify drivers
> that do such things.
>
> But perhaps a mix is possible: if a VB2 flag is set by the driver, then
> approach 2 is used. That might help with the drivers where it isn't clear
> what they should do. Although perhaps this can all be done in the driver
> itself: instead of vb2_plane_vaddr they call vb2_begin_cpu_access for the
> whole buffer, and at buffer_done time they call vb2_end_cpu_access. Should
> work just as well for the very few drivers that need this.
That's a good point. I guess we don't really need to dig so much into
those drivers in this case. Just mechanically do the same for all of
them (+/- maybe checking for some obvious corner cases which don't
need the extra calls). Let me see if I can give it a stab.
Best,
Tomasz
>
> Regards,
>
> Hans
>
> >
> > [1] git grep vb2_plane_vaddr | cut -d":" -f 1 | sort | uniq
> > [2] git grep VB2_DMABUF | cut -d":" -f 1 | sort | uniq
> > [3] by running [1] and [2] through | cut -d"-" -f 1 | cut -d"_" -f 1 | uniq
> >
> > Best,
> > Tomasz
> >
> >>
> >> But generally speaking, bracketing all driver with CPU access synchronization
> >> does not make sense indeed, so I second the rejection.
> >>
> >> Nicolas
> >>
> >>>
> >>> Best regards,
> >>> Tomasz
> >>>
> >>>> diff --git a/drivers/media/common/videobuf2/videobuf2-core.c b/drivers/media/common/videobuf2/videobuf2-core.c
> >>>> index 358f1fe42975..4734ff9cf3ce 100644
> >>>> --- a/drivers/media/common/videobuf2/videobuf2-core.c
> >>>> +++ b/drivers/media/common/videobuf2/videobuf2-core.c
> >>>> @@ -340,6 +340,17 @@ static void __vb2_buf_mem_prepare(struct vb2_buffer *vb)
> >>>> vb->synced = 1;
> >>>> for (plane = 0; plane < vb->num_planes; ++plane)
> >>>> call_void_memop(vb, prepare, vb->planes[plane].mem_priv);
> >>>> +
> >>>> + if (vb->memory != VB2_MEMORY_DMABUF)
> >>>> + return;
> >>>> + for (plane = 0; plane < vb->num_planes; ++plane) {
> >>>> + struct dma_buf *dbuf = vb->planes[plane].dbuf;
> >>>> +
> >>>> + if (!dbuf)
> >>>> + continue;
> >>>> +
> >>>> + dma_buf_end_cpu_access(dbuf, vb->vb2_queue->dma_dir);
> >>>> + }
> >>>> }
> >>>>
> >>>> /*
> >>>> @@ -356,6 +367,17 @@ static void __vb2_buf_mem_finish(struct vb2_buffer *vb)
> >>>> vb->synced = 0;
> >>>> for (plane = 0; plane < vb->num_planes; ++plane)
> >>>> call_void_memop(vb, finish, vb->planes[plane].mem_priv);
> >>>> +
> >>>> + if (vb->memory != VB2_MEMORY_DMABUF)
> >>>> + return;
> >>>> + for (plane = 0; plane < vb->num_planes; ++plane) {
> >>>> + struct dma_buf *dbuf = vb->planes[plane].dbuf;
> >>>> +
> >>>> + if (!dbuf)
> >>>> + continue;
> >>>> +
> >>>> + dma_buf_begin_cpu_access(dbuf, vb->vb2_queue->dma_dir);
> >>>> + }
> >>>> }
> >>>>
> >>>> /*
> >>>> --
> >>>> 2.43.0-rc1
> >>>>
> >>
> >
>
Am 10.07.24 um 16:35 schrieb Lei Liu:
>
> 在 2024/7/10 22:14, Christian König 写道:
>> Am 10.07.24 um 15:57 schrieb Lei Liu:
>>> Use vm_insert_page to establish a mapping for the memory allocated
>>> by dmabuf, thus supporting direct I/O read and write; and fix the
>>> issue of incorrect memory statistics after mapping dmabuf memory.
>>
>> Well big NAK to that! Direct I/O is intentionally disabled on DMA-bufs.
>
> Hello! Could you explain why direct_io is disabled on DMABUF? Is there
> any historical reason for this?
It's basically one of the most fundamental design decision of DMA-Buf.
The attachment/map/fence model DMA-buf uses is not really compatible
with direct I/O on the underlying pages.
>>
>> We already discussed enforcing that in the DMA-buf framework and this
>> patch probably means that we should really do that.
>>
>> Regards,
>> Christian.
>
> Thank you for your response. With the application of AI large model
> edgeification, we urgently need support for direct_io on DMABUF to
> read some very large files. Do you have any new solutions or plans for
> this?
We have seen similar projects over the years and all of those turned out
to be complete shipwrecks.
There is currently a patch set under discussion to give the network
subsystem DMA-buf support. If you are interest in network direct I/O
that could help.
Additional to that a lot of GPU drivers support userptr usages, e.g. to
import malloced memory into the GPU driver. You can then also do direct
I/O on that malloced memory and the kernel will enforce correct handling
with the GPU driver through MMU notifiers.
But as far as I know a general DMA-buf based solution isn't possible.
Regards,
Christian.
>
> Regards,
> Lei Liu.
>
>>
>>>
>>> Lei Liu (2):
>>> mm: dmabuf_direct_io: Support direct_io for memory allocated by
>>> dmabuf
>>> mm: dmabuf_direct_io: Fix memory statistics error for dmabuf
>>> allocated
>>> memory with direct_io support
>>>
>>> drivers/dma-buf/heaps/system_heap.c | 5 +++--
>>> fs/proc/task_mmu.c | 8 +++++++-
>>> include/linux/mm.h | 1 +
>>> mm/memory.c | 15 ++++++++++-----
>>> mm/rmap.c | 9 +++++----
>>> 5 files changed, 26 insertions(+), 12 deletions(-)
>>>
>>
On Mon, Jul 8, 2024 at 6:47 AM Zenghui Yu <yuzenghui(a)huawei.com> wrote:
>
> Even if a vgem device is configured in, we will skip the import_vgem_fd()
> test almost every time.
>
> TAP version 13
> 1..11
> # Testing heap: system
> # =======================================
> # Testing allocation and importing:
> ok 1 # SKIP Could not open vgem -1
>
> The problem is that we use the DRM_IOCTL_VERSION ioctl to query the driver
> version information but leave the name field a non-null-terminated string.
> Terminate it properly to actually test against the vgem device.
Hm yeah. Looks like drm_copy_field resets version.name to the actual
size of the name in the case of truncation, so maybe worth checking
that too in case there is a name like "vgemfoo" that gets converted to
"vgem\0" by this?
>
> Signed-off-by: Zenghui Yu <yuzenghui(a)huawei.com>
> ---
> tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
> index 5f541522364f..2fcc74998fa9 100644
> --- a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
> +++ b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
> @@ -32,6 +32,8 @@ static int check_vgem(int fd)
> if (ret)
> return 0;
>
> + name[4] = '\0';
> +
> return !strcmp(name, "vgem");
> }
>
> --
> 2.33.0
>
On Thu, 4 Jul 2024 at 00:40, Amirreza Zarrabi <quic_azarrabi(a)quicinc.com> wrote:
>
>
>
> On 7/3/2024 10:13 PM, Dmitry Baryshkov wrote:
> > On Tue, Jul 02, 2024 at 10:57:36PM GMT, Amirreza Zarrabi wrote:
> >> Qualcomm TEE hosts Trusted Applications and Services that run in the
> >> secure world. Access to these resources is provided using object
> >> capabilities. A TEE client with access to the capability can invoke
> >> the object and request a service. Similarly, TEE can request a service
> >> from nonsecure world with object capabilities that are exported to secure
> >> world.
> >>
> >> We provide qcom_tee_object which represents an object in both secure
> >> and nonsecure world. TEE clients can invoke an instance of qcom_tee_object
> >> to access TEE. TEE can issue a callback request to nonsecure world
> >> by invoking an instance of qcom_tee_object in nonsecure world.
> >
> > Please see Documentation/process/submitting-patches.rst on how to write
> > commit messages.
>
> Ack.
>
> >
> >>
> >> Any driver in nonsecure world that is interested to export a struct (or a
> >> service object) to TEE, requires to embed an instance of qcom_tee_object in
> >> the relevant struct and implements the dispatcher function which is called
> >> when TEE invoked the service object.
> >>
> >> We also provids simplified API which implements the Qualcomm TEE transport
> >> protocol. The implementation is independent from any services that may
> >> reside in nonsecure world.
> >
> > "also" usually means that it should go to a separate commit.
>
> I will split this patch to multiple smaller ones.
>
[...]
> >
> >> + } in, out;
> >> +};
> >> +
> >> +int qcom_tee_object_do_invoke(struct qcom_tee_object_invoke_ctx *oic,
> >> + struct qcom_tee_object *object, unsigned long op, struct qcom_tee_arg u[], int *result);
> >
> > What's the difference between a result that gets returned by the
> > function and the result that gets retuned via the pointer?
>
> The function result, is local to kernel, for instance memory allocation failure,
> or failure to issue the smc call. The result in pointer, is the remote result,
> for instance return value from TA, or the TEE itself.
>
> I'll use better name, e.g. 'remote_result'?
See how this is handled by other parties. For example, PSCI. If you
have a standard set of return codes, translate them to -ESOMETHING in
your framework and let everybody else see only the standard errors.
--
With best wishes
Dmitry
On Mon, Jul 01, 2024 at 11:26:34PM -0700, Andrew Morton wrote:
> No, I do think the cast is useful:
>
> struct page *page = dma_fence_chain_alloc();
>
> will presently generate a warning. We want this. Your change will
> remove that useful warning.
>
>
> Unrelatedly: there is no earthly reason why this is implemented as a
> macro. A static inline function would be so much better. Why do we
> keep doing this.
Agreed with all of the above. Adding the dmabuf maintainers.
On Tue, Jul 02, 2024 at 10:57:36PM GMT, Amirreza Zarrabi wrote:
> Qualcomm TEE hosts Trusted Applications and Services that run in the
> secure world. Access to these resources is provided using object
> capabilities. A TEE client with access to the capability can invoke
> the object and request a service. Similarly, TEE can request a service
> from nonsecure world with object capabilities that are exported to secure
> world.
>
> We provide qcom_tee_object which represents an object in both secure
> and nonsecure world. TEE clients can invoke an instance of qcom_tee_object
> to access TEE. TEE can issue a callback request to nonsecure world
> by invoking an instance of qcom_tee_object in nonsecure world.
Please see Documentation/process/submitting-patches.rst on how to write
commit messages.
>
> Any driver in nonsecure world that is interested to export a struct (or a
> service object) to TEE, requires to embed an instance of qcom_tee_object in
> the relevant struct and implements the dispatcher function which is called
> when TEE invoked the service object.
>
> We also provids simplified API which implements the Qualcomm TEE transport
> protocol. The implementation is independent from any services that may
> reside in nonsecure world.
"also" usually means that it should go to a separate commit.
>
> Signed-off-by: Amirreza Zarrabi <quic_azarrabi(a)quicinc.com>
> ---
> drivers/firmware/qcom/Kconfig | 14 +
> drivers/firmware/qcom/Makefile | 2 +
> drivers/firmware/qcom/qcom_object_invoke/Makefile | 4 +
> drivers/firmware/qcom/qcom_object_invoke/async.c | 142 +++
> drivers/firmware/qcom/qcom_object_invoke/core.c | 1139 ++++++++++++++++++++
> drivers/firmware/qcom/qcom_object_invoke/core.h | 186 ++++
> .../qcom/qcom_object_invoke/qcom_scm_invoke.c | 22 +
> .../firmware/qcom/qcom_object_invoke/release_wq.c | 90 ++
> include/linux/firmware/qcom/qcom_object_invoke.h | 233 ++++
> 9 files changed, 1832 insertions(+)
>
> diff --git a/drivers/firmware/qcom/Kconfig b/drivers/firmware/qcom/Kconfig
> index 7f6eb4174734..103ab82bae9f 100644
> --- a/drivers/firmware/qcom/Kconfig
> +++ b/drivers/firmware/qcom/Kconfig
> @@ -84,4 +84,18 @@ config QCOM_QSEECOM_UEFISECAPP
> Select Y here to provide access to EFI variables on the aforementioned
> platforms.
>
> +config QCOM_OBJECT_INVOKE_CORE
> + bool "Secure TEE Communication Support"
tristate
> + help
> + Various Qualcomm SoCs have a Trusted Execution Environment (TEE) running
> + in the Trust Zone. This module provides an interface to that via the
> + capability based object invocation, using SMC calls.
> +
> + OBJECT_INVOKE_CORE allows capability based secure communication between
> + TEE and VMs. Using OBJECT_INVOKE_CORE, kernel can issue calls to TEE or
> + TAs to request a service or exposes services to TEE and TAs. It implements
> + the necessary marshaling of messages with TEE.
> +
> + Select Y here to provide access to TEE.
> +
> endmenu
> diff --git a/drivers/firmware/qcom/Makefile b/drivers/firmware/qcom/Makefile
> index 0be40a1abc13..dd5e00215b2e 100644
> --- a/drivers/firmware/qcom/Makefile
> +++ b/drivers/firmware/qcom/Makefile
> @@ -8,3 +8,5 @@ qcom-scm-objs += qcom_scm.o qcom_scm-smc.o qcom_scm-legacy.o
> obj-$(CONFIG_QCOM_TZMEM) += qcom_tzmem.o
> obj-$(CONFIG_QCOM_QSEECOM) += qcom_qseecom.o
> obj-$(CONFIG_QCOM_QSEECOM_UEFISECAPP) += qcom_qseecom_uefisecapp.o
> +
> +obj-y += qcom_object_invoke/
> diff --git a/drivers/firmware/qcom/qcom_object_invoke/Makefile b/drivers/firmware/qcom/qcom_object_invoke/Makefile
> new file mode 100644
> index 000000000000..6ef4d54891a5
> --- /dev/null
> +++ b/drivers/firmware/qcom/qcom_object_invoke/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +
> +obj-$(CONFIG_QCOM_OBJECT_INVOKE_CORE) += object-invoke-core.o
> +object-invoke-core-objs := qcom_scm_invoke.o release_wq.o async.o core.o
> diff --git a/drivers/firmware/qcom/qcom_object_invoke/async.c b/drivers/firmware/qcom/qcom_object_invoke/async.c
> new file mode 100644
> index 000000000000..dd022ec68d8b
> --- /dev/null
> +++ b/drivers/firmware/qcom/qcom_object_invoke/async.c
> @@ -0,0 +1,142 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved.
> + */
> +
> +#include <linux/kobject.h>
> +#include <linux/slab.h>
> +#include <linux/mutex.h>
> +
> +#include "core.h"
> +
> +/* Async handlers and providers. */
> +struct async_msg {
> + struct {
> + u32 version; /* Protocol version: top 16b major, lower 16b minor. */
> + u32 op; /* Async operation. */
> + } header;
> +
> + /* Format of the Async data field is defined by the specified operation. */
> +
> + struct {
> + u32 count; /* Number of objects that should be released. */
> + u32 obj[];
> + } op_release;
> +};
Another generic comment: please select some prefix (like QTEE_ / qtee_)
and use it for _all_ defines and all names in the driver.
`struct async_msg` means that it is some genric code that is applicable
to the whole kernel.
> +
> +/* Async Operations and header information. */
> +
> +#define ASYNC_HEADER_SIZE sizeof(((struct async_msg *)(0))->header)
Extract struct definition. Use sizeof(struct qtee_async_msg_header).
> +
> +/* ASYNC_OP_x: operation.
> + * ASYNC_OP_x_HDR_SIZE: header size for the operation.
> + * ASYNC_OP_x_SIZE: size of each entry in a message for the operation.
> + * ASYNC_OP_x_MSG_SIZE: size of a message with n entries.
> + */
> +
> +#define ASYNC_OP_RELEASE QCOM_TEE_OBJECT_OP_RELEASE /* Added in minor version 0x0000. **/
Anything before minor version 0x0000 ?
> +#define ASYNC_OP_RELEASE_HDR_SIZE offsetof(struct async_msg, op_release.obj)
> +#define ASYNC_OP_RELEASE_SIZE sizeof(((struct async_msg *)(0))->op_release.obj[0])
sizeof(u32) is much better
> +#define ASYNC_OP_RELEASE_MSG_SIZE(n) \
> + (ASYNC_OP_RELEASE_HDR_SIZE + ((n) * ASYNC_OP_RELEASE_SIZE))
struct_size(). But I think you should be able to inline and/or drop most
of these defines.
> +
> +/* async_qcom_tee_buffer return the available async buffer in the output buffer. */
> +
> +static struct qcom_tee_buffer async_qcom_tee_buffer(struct qcom_tee_object_invoke_ctx *oic)
Why do you need to return struct instance?
> +{
> + int i;
> + size_t offset;
> +
> + struct qcom_tee_callback *msg = (struct qcom_tee_callback *)oic->out.msg.addr;
> +
> + if (!(oic->flags & OIC_FLAG_BUSY))
> + return oic->out.msg;
> +
> + /* Async requests are appended to the output buffer after the CB message. */
> +
> + offset = OFFSET_TO_BUFFER_ARGS(msg, counts_total(msg->counts));
> +
> + for_each_input_buffer(i, msg->counts)
> + offset += align_offset(msg->args[i].b.size);
> +
> + for_each_output_buffer(i, msg->counts)
> + offset += align_offset(msg->args[i].b.size);
> +
> + if (oic->out.msg.size > offset) {
> + return (struct qcom_tee_buffer)
> + { { oic->out.msg.addr + offset }, oic->out.msg.size - offset };
> + }
> +
> + pr_err("no space left for async messages! or malformed message.\n");
No spamming on the kmsg.
> +
> + return (struct qcom_tee_buffer) { { 0 }, 0 };
This doesn't look correct.
> +}
> +
What does this function return?
> +static size_t async_release_handler(struct qcom_tee_object_invoke_ctx *oic,
> + struct async_msg *async_msg, size_t size)
Please ident the code properly, this should be aligned to the open
bracket.
> +{
> + int i;
> +
> + /* We need space for at least a single entry. */
> + if (size < ASYNC_OP_RELEASE_MSG_SIZE(1))
> + return 0;
> +
> + for (i = 0; i < async_msg->op_release.count; i++) {
> + struct qcom_tee_object *object;
> +
> + /* Remove the object from xa_qcom_tee_objects so that the object_id
> + * becomes invalid for further use. However, call put_qcom_tee_object
> + * to schedule the actual release if there is no user.
> + */
> +
> + object = erase_qcom_tee_object(async_msg->op_release.obj[i]);
> +
> + put_qcom_tee_object(object);
> + }
> +
> + return ASYNC_OP_RELEASE_MSG_SIZE(i);
> +}
> +
> +/* '__fetch__async_reqs' is a handler dispatcher (from TEE). */
> +
> +void __fetch__async_reqs(struct qcom_tee_object_invoke_ctx *oic)
> +{
> + size_t consumed, used = 0;
> +
> + struct qcom_tee_buffer async_buffer = async_qcom_tee_buffer(oic);
> +
> + while (async_buffer.size - used > ASYNC_HEADER_SIZE) {
> + struct async_msg *async_msg = (struct async_msg *)(async_buffer.addr + used);
> +
> + /* TEE assumes unused buffer is set to zero. */
> + if (!async_msg->header.version)
> + goto out;
> +
> + switch (async_msg->header.op) {
> + case ASYNC_OP_RELEASE:
> + consumed = async_release_handler(oic,
> + async_msg, async_buffer.size - used);
> +
> + break;
> + default: /* Unsupported operations. */
> + consumed = 0;
> + }
> +
> + used += align_offset(consumed);
> +
> + if (!consumed) {
> + pr_err("Drop async buffer (context_id %d): buffer %p, (%p, %zx), processed %zx\n",
Should it really go to the kmsg?
> + oic->context_id,
> + oic->out.msg.addr, /* Address of Output buffer. */
> + async_buffer.addr, /* Address of beginning of async buffer. */
> + async_buffer.size, /* Available size of async buffer. */
> + used); /* Processed async buffer. */
> +
> + goto out;
> + }
> + }
> +
> + out:
> +
> + memset(async_buffer.addr, 0, async_buffer.size);
Why?
> +}
> diff --git a/drivers/firmware/qcom/qcom_object_invoke/core.c b/drivers/firmware/qcom/qcom_object_invoke/core.c
> new file mode 100644
> index 000000000000..37dde8946b08
> --- /dev/null
> +++ b/drivers/firmware/qcom/qcom_object_invoke/core.c
> @@ -0,0 +1,1139 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
> + */
> +
> +#include <linux/kobject.h>
> +#include <linux/sysfs.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/slab.h>
> +#include <linux/delay.h>
> +#include <linux/mm.h>
> +#include <linux/xarray.h>
> +
> +#include "core.h"
> +
> +/* Static 'Primordial Object' operations. */
> +
> +#define OBJECT_OP_YIELD 1
> +#define OBJECT_OP_SLEEP 2
> +
> +/* static_qcom_tee_object_primordial always exists. */
> +/* primordial_object_register and primordial_object_release extends it. */
> +
> +static struct qcom_tee_object static_qcom_tee_object_primordial;
> +
> +static int primordial_object_register(struct qcom_tee_object *object);
> +static void primordial_object_release(struct qcom_tee_object *object);
> +
> +/* Marshaling API. */
> +/*
> + * prepare_msg - Prepares input buffer for sending to TEE.
> + * update_args - Parses TEE response in input buffer.
> + * prepare_args - Parses TEE request from output buffer.
> + * update_msg - Updates output buffer with response for TEE request.
> + *
> + * prepare_msg and update_args are used in direct TEE object invocation.
> + * prepare_args and update_msg are used for TEE requests (callback or async).
> + */
> +
> +static int prepare_msg(struct qcom_tee_object_invoke_ctx *oic,
> + struct qcom_tee_object *object, unsigned long op, struct qcom_tee_arg u[]);
> +static int update_args(struct qcom_tee_arg u[], struct qcom_tee_object_invoke_ctx *oic);
> +static int prepare_args(struct qcom_tee_object_invoke_ctx *oic);
> +static int update_msg(struct qcom_tee_object_invoke_ctx *oic);
Please reorder the functions so that you don't need forward
declarations.
> +
> +static int next_arg_type(struct qcom_tee_arg u[], int i, enum qcom_tee_arg_type type)
> +{
> + while (u[i].type != QCOM_TEE_ARG_TYPE_END && u[i].type != type)
> + i++;
> +
> + return i;
> +}
> +
> +/**
> + * args_for_each_type - Iterate over argument of given type.
> + * @i: index in @args.
> + * @args: array of arguments.
> + * @at: type of argument.
> + */
> +#define args_for_each_type(i, args, at) \
> + for (i = 0, i = next_arg_type(args, i, at); \
> + args[i].type != QCOM_TEE_ARG_TYPE_END; i = next_arg_type(args, ++i, at))
> +
> +#define arg_for_each_input_buffer(i, args) args_for_each_type(i, args, QCOM_TEE_ARG_TYPE_IB)
> +#define arg_for_each_output_buffer(i, args) args_for_each_type(i, args, QCOM_TEE_ARG_TYPE_OB)
> +#define arg_for_each_input_object(i, args) args_for_each_type(i, args, QCOM_TEE_ARG_TYPE_IO)
> +#define arg_for_each_output_object(i, args) args_for_each_type(i, args, QCOM_TEE_ARG_TYPE_OO)
> +
> +/* Outside this file we use struct qcom_tee_object to identify an object. */
> +
> +/* We only allocate IDs with QCOM_TEE_OBJ_NS_BIT set in range
> + * [QCOM_TEE_OBJECT_ID_START .. QCOM_TEE_OBJECT_ID_END]. qcom_tee_object
> + * represents non-secure object. The first ID with QCOM_TEE_OBJ_NS_BIT set is reserved
> + * for primordial object.
> + */
> +
> +#define QCOM_TEE_OBJECT_PRIMORDIAL (QCOM_TEE_OBJ_NS_BIT)
> +#define QCOM_TEE_OBJECT_ID_START (QCOM_TEE_OBJECT_PRIMORDIAL + 1)
> +#define QCOM_TEE_OBJECT_ID_END (UINT_MAX)
> +
> +#define SET_QCOM_TEE_OBJECT(p, type, ...) __SET_QCOM_TEE_OBJECT(p, type, ##__VA_ARGS__, 0UL)
> +#define __SET_QCOM_TEE_OBJECT(p, type, optr, ...) do { \
> + (p)->object_type = (type); \
> + (p)->info.object_ptr = (unsigned long)(optr); \
> + (p)->release = NULL; \
> + } while (0)
> +
> +/* ''TEE Object Table''. */
> +static DEFINE_XARRAY_ALLOC(xa_qcom_tee_objects);
> +
> +struct qcom_tee_object *allocate_qcom_tee_object(void)
> +{
> + struct qcom_tee_object *object;
I thought that struct qcom_tee_object should be embedded into other
struct definitions. Here you are just allocing it. Did I misunderstand
something?
> +
> + object = kzalloc(sizeof(*object), GFP_KERNEL);
> + if (object)
> + SET_QCOM_TEE_OBJECT(object, QCOM_TEE_OBJECT_TYPE_NULL);
> +
> + return object;
> +}
> +EXPORT_SYMBOL_GPL(allocate_qcom_tee_object);
kerneldoc for all exported functions.
> +
> +void free_qcom_tee_object(struct qcom_tee_object *object)
> +{
> + kfree(object);
> +}
> +EXPORT_SYMBOL_GPL(free_qcom_tee_object);
If qcom_tee_object is refcounted, then such API is defintely forbidden.
> +
> +/* 'get_qcom_tee_object' and 'put_qcom_tee_object'. */
> +
> +static int __free_qcom_tee_object(struct qcom_tee_object *object);
What is the difference between free_qcom_tee_object() and
__free_qcom_tee_object() ?
> +static void ____destroy_qcom_tee_object(struct kref *refcount)
> +{
> + struct qcom_tee_object *object = container_of(refcount, struct qcom_tee_object, refcount);
> +
> + __free_qcom_tee_object(object);
> +}
> +
> +int get_qcom_tee_object(struct qcom_tee_object *object)
> +{
> + if (object != NULL_QCOM_TEE_OBJECT &&
> + object != ROOT_QCOM_TEE_OBJECT)
> + return kref_get_unless_zero(&object->refcount);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(get_qcom_tee_object);
> +
> +static struct qcom_tee_object *qcom_tee__get_qcom_tee_object(unsigned int object_id)
> +{
> + XA_STATE(xas, &xa_qcom_tee_objects, object_id);
> + struct qcom_tee_object *object;
> +
> + rcu_read_lock();
> + do {
> + object = xas_load(&xas);
> + if (xa_is_zero(object))
> + object = NULL_QCOM_TEE_OBJECT;
> +
> + } while (xas_retry(&xas, object));
If you are just looping over the objects, why do you need XArray instead
of list?
> +
> + /* Sure object still exists. */
Why?
> + if (!get_qcom_tee_object(object))
> + object = NULL_QCOM_TEE_OBJECT;
> +
> + rcu_read_unlock();
> +
> + return object;
> +}
> +
> +struct qcom_tee_object *qcom_tee_get_qcom_tee_object(unsigned int object_id)
> +{
> + switch (object_id) {
> + case QCOM_TEE_OBJECT_PRIMORDIAL:
> + return &static_qcom_tee_object_primordial;
> +
> + default:
> + return qcom_tee__get_qcom_tee_object(object_id);
> + }
> +}
> +
> +void put_qcom_tee_object(struct qcom_tee_object *object)
> +{
> + if (object != &static_qcom_tee_object_primordial &&
> + object != NULL_QCOM_TEE_OBJECT &&
> + object != ROOT_QCOM_TEE_OBJECT)
misaligned
> + kref_put(&object->refcount, ____destroy_qcom_tee_object);
> +}
> +EXPORT_SYMBOL_GPL(put_qcom_tee_object);
> +
> +/* 'alloc_qcom_tee_object_id' and 'erase_qcom_tee_object'. */
huh?
I think I'm going to stop here. Please:
- Split the driver into logical chunks and more logical pieces.
- Rename functions and structures to follow generic scheme. Usually it
is prefix_object_do_something().
- Add documentation that would allow us to understand what is going on.
- Also document some general design decisions. Usage of XArray. API
choice. Refcounting.
[skipped]
> diff --git a/include/linux/firmware/qcom/qcom_object_invoke.h b/include/linux/firmware/qcom/qcom_object_invoke.h
> new file mode 100644
> index 000000000000..9e6acd0f4db0
> --- /dev/null
> +++ b/include/linux/firmware/qcom/qcom_object_invoke.h
> @@ -0,0 +1,233 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved.
> + */
> +
> +#ifndef __QCOM_OBJECT_INVOKE_H
> +#define __QCOM_OBJECT_INVOKE_H
> +
> +#include <linux/kref.h>
> +#include <linux/completion.h>
> +#include <linux/workqueue.h>
> +#include <uapi/misc/qcom_tee.h>
This header doesn't exist yet. This obviously means that you haven't
actually tried building this patch. Please make sure that kernel
compiles successfully after each commit.
> +
> +struct qcom_tee_object;
> +
> +/* Primordial Object */
> +
> +/* It is used for bootstrapping the IPC connection between a VM and TEE.
> + *
> + * Each side (both the VM and the TEE) starts up with no object received from the
> + * other side. They both ''assume'' the other side implements a permanent initial
> + * object in the object table.
> + *
> + * TEE's initial object is typically called the ''root client env'', and it's
> + * invoked by VMs when they want to get a new clientEnv. The initial object created
> + * by the VMs is invoked by TEE, it's typically called the ''primordial object''.
> + *
> + * VM can register a primordial object using 'init_qcom_tee_object_user' with
> + * 'QCOM_TEE_OBJECT_TYPE_ROOT' type.
> + */
> +
> +enum qcom_tee_object_type {
> + QCOM_TEE_OBJECT_TYPE_USER = 0x1, /* TEE object. */
> + QCOM_TEE_OBJECT_TYPE_CB_OBJECT = 0x2, /* Callback Object. */
> + QCOM_TEE_OBJECT_TYPE_ROOT = 0x8, /* ''Root client env.'' or 'primordial' Object. */
> + QCOM_TEE_OBJECT_TYPE_NULL = 0x10, /* NULL object. */
> +};
> +
> +enum qcom_tee_arg_type {
> + QCOM_TEE_ARG_TYPE_END = 0,
> + QCOM_TEE_ARG_TYPE_IB = 0x80, /* Input Buffer (IB). */
> + QCOM_TEE_ARG_TYPE_OB = 0x1, /* Output Buffer (OB). */
> + QCOM_TEE_ARG_TYPE_IO = 0x81, /* Input Object (IO). */
> + QCOM_TEE_ARG_TYPE_OO = 0x2 /* Output Object (OO). */
> +};
> +
> +#define QCOM_TEE_ARG_TYPE_INPUT_MASK 0x80
> +
> +/* Maximum specific type of arguments (i.e. IB, OB, IO, and OO) that can fit in a TEE message. */
> +#define QCOM_TEE_ARGS_PER_TYPE 16
Why is it 16?
> +
> +/* Maximum arguments that can fit in a TEE message. */
> +#define QCOM_TEE_ARGS_MAX (QCOM_TEE_ARGS_PER_TYPE * 4)
> +
> +/**
> + * struct qcom_tee_arg - Argument for TEE object invocation.
> + * @type: type of argument
> + * @flags: extra flags.
> + * @b: address and size if type of argument is buffer.
> + * @o: qcom_tee_object instance if type of argument is object.
> + *
> + * @flags only accept QCOM_TEE_ARG_FLAGS_UADDR for now which states that @b
> + * contains userspace address in uaddr.
> + *
> + */
> +struct qcom_tee_arg {
> + enum qcom_tee_arg_type type;
> +
> +/* 'uaddr' holds a __user address. */
> +#define QCOM_TEE_ARG_FLAGS_UADDR 1
> + char flags;
This is not a character.
> + union {
> + struct qcom_tee_buffer {
> + union {
> + void *addr;
> + void __user *uaddr;
> + };
> + size_t size;
> + } b;
> + struct qcom_tee_object *o;
> + };
How can the code distinguish between the qcom_tee_object and
qcom_tee_buffer here?
> +};
> +
> +static inline int size_of_arg(struct qcom_tee_arg u[])
length, not size.
> +{
> + int i = 0;
> +
> + while (u[i].type != QCOM_TEE_ARG_TYPE_END)
> + i++;
> +
> + return i;
> +}
> +
> +/* Context ID - It is a unique ID assigned to a invocation which is in progress.
> + * Objects's dispatcher can use the ID to differentiate between concurrent calls.
> + * ID [0 .. 10) are reserved, i.e. never passed to object's dispatcher.
Is 10 included or excluded here? Why does it have a different bracket
type?
> + */
> +
> +struct qcom_tee_object_invoke_ctx {
> + unsigned int context_id;
> +
> +#define OIC_FLAG_BUSY 1 /* Context is busy, i.e. callback is in progress. */
> +#define OIC_FLAG_NOTIFY 2 /* Context needs to notify the current object. */
> +#define OIC_FLAG_QCOM_TEE 4 /* Context has objects shared with TEE. */
BIT(n)
> + unsigned int flags;
> +
> + /* Current object invoked in this callback context. */
> + struct qcom_tee_object *object;
> +
> + /* Arguments passed to dispatch callback. */
> + struct qcom_tee_arg u[QCOM_TEE_ARGS_MAX + 1];
> +
> + int errno;
> +
> + /* Inbound and Outbound buffers shared with TEE. */
> + struct {
> + struct qcom_tee_buffer msg;
Please define struct qcom_tee_buffer in a top-level definition instead
of having it nested somewhere in another struct;
> + } in, out;
> +};
> +
> +int qcom_tee_object_do_invoke(struct qcom_tee_object_invoke_ctx *oic,
> + struct qcom_tee_object *object, unsigned long op, struct qcom_tee_arg u[], int *result);
What's the difference between a result that gets returned by the
function and the result that gets retuned via the pointer?
> +
> +#define QCOM_TEE_OBJECT_OP_METHOD_MASK 0x0000FFFFU
> +#define QCOM_TEE_OBJECT_OP_METHOD_ID(op) ((op) & QCOM_TEE_OBJECT_OP_METHOD_MASK)
> +
> +/* Reserved Operations. */
> +
> +#define QCOM_TEE_OBJECT_OP_RELEASE (QCOM_TEE_OBJECT_OP_METHOD_MASK - 0)
> +#define QCOM_TEE_OBJECT_OP_RETAIN (QCOM_TEE_OBJECT_OP_METHOD_MASK - 1)
> +#define QCOM_TEE_OBJECT_OP_NO_OP (QCOM_TEE_OBJECT_OP_METHOD_MASK - 2)
> +
> +struct qcom_tee_object_operations {
> + void (*release)(struct qcom_tee_object *object);
> +
> + /**
> + * @op_supported:
> + *
> + * Query made to make sure the requested operation is supported. If defined,
> + * it is called before marshaling of the arguments (as optimisation).
> + */
> + int (*op_supported)(unsigned long op);
> +
> + /**
> + * @notify:
> + *
> + * After @dispatch returned, it is called to notify the status of the transport;
> + * i.e. transport errors or success. This allows the client to cleanup, if
> + * the transport fails after @dispatch submits a SUCCESS response.
> + */
> + void (*notify)(unsigned int context_id, struct qcom_tee_object *object, int status);
> +
> + int (*dispatch)(unsigned int context_id, struct qcom_tee_object *object,
> + unsigned long op, struct qcom_tee_arg args[]);
> +
> + /**
> + * @param_to_object:
> + *
> + * Called by core to do the object dependent marshaling from @param to an
> + * instance of @object (NOT IMPLEMENTED YET).
> + */
> + int (*param_to_object)(struct qcom_tee_param *param, struct qcom_tee_object *object);
> +
> + int (*object_to_param)(struct qcom_tee_object *object, struct qcom_tee_param *param);
> +};
> +
> +struct qcom_tee_object {
> + const char *name;
> + struct kref refcount;
> +
> + enum qcom_tee_object_type object_type;
> + union object_info {
> + unsigned long object_ptr;
> + } info;
> +
> + struct qcom_tee_object_operations *ops;
> +
> + /* see release_wq.c. */
> + struct work_struct work;
> +
> + /* Callback for any internal cleanup before the object's release. */
> + void (*release)(struct qcom_tee_object *object);
> +};
> +
> +/* Static instances of qcom_tee_object objects. */
> +
> +#define NULL_QCOM_TEE_OBJECT ((struct qcom_tee_object *)(0))
> +
> +/* ROOT_QCOM_TEE_OBJECT aka ''root client env''. */
> +#define ROOT_QCOM_TEE_OBJECT ((struct qcom_tee_object *)(1))
My gut feeling is that an invalid non-null pointer is a path to
disaster.
> +
> +static inline enum qcom_tee_object_type typeof_qcom_tee_object(struct qcom_tee_object *object)
> +{
> + if (object == NULL_QCOM_TEE_OBJECT)
> + return QCOM_TEE_OBJECT_TYPE_NULL;
> +
> + if (object == ROOT_QCOM_TEE_OBJECT)
> + return QCOM_TEE_OBJECT_TYPE_ROOT;
> +
> + return object->object_type;
> +}
> +
> +static inline const char *qcom_tee_object_name(struct qcom_tee_object *object)
> +{
> + if (object == NULL_QCOM_TEE_OBJECT)
> + return "null";
> +
> + if (object == ROOT_QCOM_TEE_OBJECT)
> + return "root";
> +
> + if (!object->name)
> + return "noname";
> +
> + return object->name;
> +}
> +
> +struct qcom_tee_object *allocate_qcom_tee_object(void);
> +void free_qcom_tee_object(struct qcom_tee_object *object);
> +
> +/**
> + * init_qcom_tee_object_user - Initialize an instance of qcom_tee_object.
> + * @object: object being initialized.
> + * @ot: type of object.
> + * @ops: sets of callback opeartions.
> + * @fmt: object name.
> + */
> +int init_qcom_tee_object_user(struct qcom_tee_object *object, enum qcom_tee_object_type ot,
> + struct qcom_tee_object_operations *ops, const char *fmt, ...);
> +
> +int get_qcom_tee_object(struct qcom_tee_object *object);
> +void put_qcom_tee_object(struct qcom_tee_object *object);
> +
> +#endif /* __QCOM_OBJECT_INVOKE_H */
>
> --
> 2.34.1
>
--
With best wishes
Dmitry
On Tue, Jul 02, 2024 at 10:57:35PM GMT, Amirreza Zarrabi wrote:
> Qualcomm TEE hosts Trusted Applications (TAs) and services that run in
> the secure world. Access to these resources is provided using MinkIPC.
> MinkIPC is a capability-based synchronous message passing facility. It
> allows code executing in one domain to invoke objects running in other
> domains. When a process holds a reference to an object that lives in
> another domain, that object reference is a capability. Capabilities
> allow us to separate implementation of policies from implementation of
> the transport.
>
> As part of the upstreaming of the object invoke driver (called SMC-Invoke
> driver), we need to provide a reasonable kernel API and UAPI. The clear
> option is to use TEE subsystem and write a back-end driver, however the
> TEE subsystem doesn't fit with the design of Qualcomm TEE.
>
> Does TEE subsystem fit requirements of a capability based system?
> -----------------------------------------------------------------
> In TEE subsystem, to invoke a function:
> - client should open a device file "/dev/teeX",
> - create a session with a TA, and
> - invoke the functions in that session.
>
> 1. The privilege to invoke a function is determined by a session. If a
> client has a session, it cannot share it with other clients. Even if
> it does, it is not fine-grained enough, i.e. either all accessible
> functions/resources in a session or none. Assume a scenario when a client
> wants to grant a permission to invoke just a function that it has the rights,
> to another client.
>
> The "all or nothing" for sharing sessions is not in line with our
> capability system: "if you own a capability, you should be able to grant
> or share it".
Can you please be more specific here? What kind of sharing is expected
on the user side of it?
> 2. In TEE subsystem, resources are managed in a context. Every time a
> client opens "/dev/teeX", a new context is created to keep track of
> the allocated resources, including opened sessions and remote objects. Any
> effort for sharing resources between two independent clients requires
> involvement of context manager, i.e. the back-end driver. This requires
> implementing some form of policy in the back-end driver.
What kind of resource sharing?
> 3. The TEE subsystem supports two type of memory sharing:
> - per-device memory pools, and
> - user defined memory references.
> User defined memory references are private to the application and cannot
> be shared. Memory allocated from per-device "shared" pools are accessible
> using a file descriptor. It can be mapped by any process if it has
> access to it. This means, we cannot provide the resource isolation
> between two clients. Assume a scenario when a client wants to allocate a
> memory (which is shared with TEE) from an "isolated" pool and share it
> with another client, without the right to access the contents of memory.
This doesn't explain, why would it want to share such memory with
another client.
> 4. The kernel API provided by TEE subsystem does not support a kernel
> supplicant. Adding support requires an execution context (e.g. a
> kernel thread) due to the TEE subsystem design. tee_driver_ops supports
> only "send" and "receive" callbacks and to deliver a request, someone
> should wait on "receive".
There is nothing wrong here, but maybe I'm misunderstanding something.
> We need a callback to "dispatch" or "handle" a request in the context of
> the client thread. It should redirect a request to a kernel service or
> a user supplicant. In TEE subsystem such requirement should be implemented
> in TEE back-end driver, independent from the TEE subsystem.
>
> 5. The UAPI provided by TEE subsystem is similar to the GPTEE Client
> interface. This interface is not suitable for a capability system.
> For instance, there is no session in a capability system which means
> either its should not be used, or we should overload its definition.
General comment: maybe adding more detailed explanation of how the
capabilities are aquired and how they can be used might make sense.
BTW. It might be my imperfect English, but each time I see the word
'capability' I'm thinking that some is capable of doing something. I
find it hard to use 'capability' for the reference to another object.
>
> Can we use TEE subsystem?
> -------------------------
> There are workarounds for some of the issues above. The question is if we
> should define our own UAPI or try to use a hack-y way of fitting into
> the TEE subsystem. I am using word hack-y, as most of the workaround
> involves:
>
> - "diverging from the definition". For instance, ignoring the session
> open and close ioctl calls or use file descriptors for all remote
> resources (as, fd is the closet to capability) which undermines the
> isolation provided by the contexts,
>
> - "overloading the variables". For instance, passing object ID as file
> descriptors in a place of session ID, or
>
> - "bypass TEE subsystem". For instance, extensively rely on meta
> parameters or push everything (e.g. kernel services) to the back-end
> driver, which means leaving almost all TEE subsystem unused.
>
> We cannot take the full benefits of TEE subsystem and may need to
> implement most of the requirements in the back-end driver. Also, as
> discussed above, the UAPI is not suitable for capability-based use cases.
> We proposed a new set of ioctl calls for SMC-Invoke driver.
>
> In this series we posted three patches. We implemented a transport
> driver that provides qcom_tee_object. Any object on secure side is
> represented with an instance of qcom_tee_object and any struct exposed
> to TEE should embed an instance of qcom_tee_object. Any, support for new
> services, e.g. memory object, RPMB, userspace clients or supplicants are
> implemented independently from the driver.
>
> We have a simple memory object and a user driver that uses
> qcom_tee_object.
Could you please point out any user for the uAPI? I'd like to understand
how does it from from the userspace point of view.
>
> Signed-off-by: Amirreza Zarrabi <quic_azarrabi(a)quicinc.com>
> ---
> Amirreza Zarrabi (3):
> firmware: qcom: implement object invoke support
> firmware: qcom: implement memory object support for TEE
> firmware: qcom: implement ioctl for TEE object invocation
>
> drivers/firmware/qcom/Kconfig | 36 +
> drivers/firmware/qcom/Makefile | 2 +
> drivers/firmware/qcom/qcom_object_invoke/Makefile | 12 +
> drivers/firmware/qcom/qcom_object_invoke/async.c | 142 +++
> drivers/firmware/qcom/qcom_object_invoke/core.c | 1139 ++++++++++++++++++
> drivers/firmware/qcom/qcom_object_invoke/core.h | 186 +++
> .../qcom/qcom_object_invoke/qcom_scm_invoke.c | 22 +
> .../firmware/qcom/qcom_object_invoke/release_wq.c | 90 ++
> .../qcom/qcom_object_invoke/xts/mem_object.c | 406 +++++++
> .../qcom_object_invoke/xts/object_invoke_uapi.c | 1231 ++++++++++++++++++++
> include/linux/firmware/qcom/qcom_object_invoke.h | 233 ++++
> include/uapi/misc/qcom_tee.h | 117 ++
> 12 files changed, 3616 insertions(+)
> ---
> base-commit: 74564adfd3521d9e322cfc345fdc132df80f3c79
> change-id: 20240702-qcom-tee-object-and-ioctls-6f52fde03485
>
> Best regards,
> --
> Amirreza Zarrabi <quic_azarrabi(a)quicinc.com>
>
--
With best wishes
Dmitry
Am 27.06.24 um 05:21 schrieb Jason-JH Lin (林睿祥):
>
> On Wed, 2024-06-26 at 19:56 +0200, Daniel Vetter wrote:
> >
> > External email : Please do not click links or open attachments until
> > you have verified the sender or the content.
> > On Wed, Jun 26, 2024 at 12:49:02PM +0200, Christian König wrote:
> > > Am 26.06.24 um 10:05 schrieb Jason-JH Lin (林睿祥):
> > > > > > I think I have the same problem as the ECC_FLAG mention in:
> > > > > > > >
> > https://lore.kernel.org/linux-media/20240515-dma-buf-ecc-heap-v1-0-54cbbd04…
> > > > > > > > I think it would be better to have the user configurable
> > private
> > > > > > information in dma-buf, so all the drivers who have the same
> > > > > > requirement can get their private information from dma-buf
> > directly
> > > > > > and
> > > > > > no need to change or add the interface.
> > > > > > > > What's your opinion in this point?
> > > > > > Well of hand I don't see the need for that.
> > > > > > What happens if you get a non-secure buffer imported in your
> > secure
> > > > > device?
> > > >
> > > > We use the same mediatek-drm driver for secure and non-secure
> > buffer.
> > > > If non-secure buffer imported to mediatek-drm driver, it's go to
> > the
> > > > normal flow with normal hardware settings.
> > > >
> > > > We use different configurations to make hardware have different
> > > > permission to access the buffer it should access.
> > > >
> > > > So if we can't get the information of "the buffer is allocated
> > from
> > > > restricted_mtk_cma" when importing the buffer into the driver, we
> > won't
> > > > be able to configure the hardware correctly.
> > >
> > > Why can't you get this information from userspace?
> >
> > Same reason amd and i915/xe also pass this around internally in the
> > kernel, it's just that for those gpus the render and kms node are the
> > same
> > driver so this is easy.
> >
The reason I ask is that encryption here looks just like another
parameter for the buffer, e.g. like format, stride, tilling etc..
So instead of this during buffer import:
mtk_gem->secure = (!strncmp(attach->dmabuf->exp_name, "restricted", 10));
mtk_gem->dma_addr = sg_dma_address(sg->sgl);
mtk_gem->size = attach->dmabuf->size;
mtk_gem->sg = sg;
You can trivially say during use hey this buffer is encrypted.
At least that's my 10 mile high view, maybe I'm missing some extensive
key exchange or something like that.
>
> > But on arm you have split designs everywhere and dma-buf
> > import/export, so
> > something else is needed. And neither current kms uapi nor
> > protocols/extensions have provisions for this (afaik) because it
> > works on
> > the big gpus, and on android it's just hacked up with backchannels.
> >
> > So yeah essentially I think we probably need something like this, as
> > much
> > as it sucks. I see it somewhat similar to handling pcip2pdma
> > limitations
> > in the kernel too.
> >
> > Not sure where/how it should be handled though, and maybe I've missed
> > something around protocols, in which case I guess we should add some
> > secure buffer flags to the ADDFB2 ioctl.
>
> Thanks for your hint, I'll try to add the secure flag to the ADDFB2
> ioctl. If it works, I'll send the patch.
Yeah, exactly what I would suggest as well.
I'm not an expert for that part, but as far as I know we already have
bunch of device specific tilling flags in there.
Adding an MTK_ENCRYPTED flag should be trivial.
Regards,
Christian.
>
> Regards,
> Jason-JH.Lin
>
> > -Sima
>
> ************* MEDIATEK Confidentiality Notice ********************
> The information contained in this e-mail message (including any
> attachments) may be confidential, proprietary, privileged, or otherwise
> exempt from disclosure under applicable laws. It is intended to be
> conveyed only to the designated recipient(s). Any use, dissemination,
> distribution, printing, retaining or copying of this e-mail (including its
> attachments) by unintended recipient(s) is strictly prohibited and may
> be unlawful. If you are not an intended recipient of this e-mail, or believe
> that you have received this e-mail in error, please notify the sender
> immediately (by replying to this e-mail), delete any and all copies of
> this e-mail (including any attachments) from your system, and do not
> disclose the content of this e-mail to any other person. Thank you!
Hi Jonathan,
Here's the v12 of my patchset that introduces DMABUF support to IIO.
Apart from a small documentation fix, it reverts to using
mutex_lock/mutex_unlock in one particular place, which used cleanup
GOTOs (which don't play well with scope-managed cleanups).
Changelog:
- [3/7]:
- Revert to mutex_lock/mutex_unlock in iio_buffer_attach_dmabuf(),
as it uses cleanup GOTOs
- [6/7]:
- "obtained using..." -> "which can be obtained using..."
This is based on next-20240619.
Cheers,
-Paul
Paul Cercueil (7):
dmaengine: Add API function dmaengine_prep_peripheral_dma_vec()
dmaengine: dma-axi-dmac: Implement device_prep_peripheral_dma_vec
iio: core: Add new DMABUF interface infrastructure
iio: buffer-dma: Enable support for DMABUFs
iio: buffer-dmaengine: Support new DMABUF based userspace API
Documentation: iio: Document high-speed DMABUF based API
Documentation: dmaengine: Document new dma_vec API
Documentation/driver-api/dmaengine/client.rst | 9 +
.../driver-api/dmaengine/provider.rst | 10 +
Documentation/iio/iio_dmabuf_api.rst | 54 +++
Documentation/iio/index.rst | 1 +
drivers/dma/dma-axi-dmac.c | 40 ++
drivers/iio/Kconfig | 1 +
drivers/iio/buffer/industrialio-buffer-dma.c | 178 ++++++-
.../buffer/industrialio-buffer-dmaengine.c | 62 ++-
drivers/iio/industrialio-buffer.c | 459 ++++++++++++++++++
include/linux/dmaengine.h | 33 ++
include/linux/iio/buffer-dma.h | 31 ++
include/linux/iio/buffer_impl.h | 33 ++
include/uapi/linux/iio/buffer.h | 22 +
13 files changed, 913 insertions(+), 20 deletions(-)
create mode 100644 Documentation/iio/iio_dmabuf_api.rst
--
2.43.0