Hello Jens,
Here are some comments on patch V4.
*smccc_call32* Do we only target aarch32 optee_os?
*Buffer management* It looks that temporary buffers are now managed in the client API, not in the driver anymore. The positive point is that it makes the driver lighter. But the drawbacks is the number of extra ioctl used when using a temporary buffer: 1 to allocate, 1 to map, and 1 to unmap. So a TEEC_InvokeCommand could lead to 13 ioctls. What about the performances? Could not it be optimized in some way? Could we manage temporary buffers, dma buffers and "other" buffers in different ways?
*Kernel API* The GP API looks managed in the Client part now, without a 1-to-1 mapping in kernel side. That means that we have lost the Kernel API, which is useful when a given driver wants to use a TA. It looks we are really missing this Kernel API. This is the most critical point of our review.
*Non-zero copy* When using RegisterSharedMemory(), it looks we do not benefit from the fact that the original buffer was allocated using dma_buf. Hence we never have zero-copy in the current version when using RegisterSharedMemory as we will furher use the so-called shadowbuffer. Can you confirm this limitation can be removed in further versions of the patch?
By the way, we think xtest 5006, which calls TEEC_RegisterSharedBuffer() twice may lead to memory leaks (it could be the same with the current optee_linuxdriver).
*List of SHM* The list of the SHM allocated is now contained in the device, whereas it used to be in the context in optee_linuxdriver. This is not clear to us whether this is an issue or not.
*tee_shm_pool_alloc_res_mem()* If we'd like to use the generic function for a LX backend, we would need to add the following arguments to the function:
- the alignment of the private region, and the one of the dma region - a way to give the limits between both region (the size of the 1st region, or anything else).
Emmanuel and I would like to check few other points in the proposal. We'll come back to you if we have any other remarks.
Best regards, Jean Michel / Emmanuel / Etienne / Pascal
Hi Pascal and all,
Thanks for taking the time to do a thorough review.
On Tue, Jun 23, 2015 at 09:37:13PM +0200, Pascal Brand wrote:
*smccc_call32* Do we only target aarch32 optee_os?
No, this for Aarch64 optee_os too. It's just that we don't use Aarch64 SMC Calling convention. Aarch32 SMC Calling Convention is enough to pass the information we need.
*Buffer management* It looks that temporary buffers are now managed in the client API, not in the driver anymore. The positive point is that it makes the driver lighter. But the drawbacks is the number of extra ioctl used when using a temporary buffer: 1 to allocate, 1 to map, and 1 to unmap. So a TEEC_InvokeCommand could lead to 13 ioctls. What about the performances? Could not it be optimized in some way? Could we manage temporary buffers, dma buffers and "other" buffers in different ways?
Temporary buffers should be avoided completely if the client is concerned about performance. Allocated shared memory is the most efficient way of shared memory. DMA buffers should be treated in the same way as allocated shared memory.
*Kernel API* The GP API looks managed in the Client part now, without a 1-to-1 mapping in kernel side. That means that we have lost the Kernel API, which is useful when a given driver wants to use a TA. It looks we are really missing this Kernel API. This is the most critical point of our review.
The kernel API isn't there right now because it's not used by anything. This patch-set is kept as small as possible while still being useful, to avoid making it harder than necessary to upstream. Once this patch-set is upstream we can start to add different features, for instance a kernel API.
*Non-zero copy* When using RegisterSharedMemory(), it looks we do not benefit from the fact that the original buffer was allocated using dma_buf. Hence we never have zero-copy in the current version when using RegisterSharedMemory as we will furher use the so-called shadowbuffer. Can you confirm this limitation can be removed in further versions of the patch?
How can TEEC_RegisterSharedMemory() tell if the supplied memory range is backed by a dma_buf? Perhaps we can add another TEE_IOC_XXX to given an address get a file descriptor etc needed to use that memory as shared memory. The supplied address must of course be within a range of a mmaped dma_buf for it to work.
Another option would be to extend TEE Client API with a way of initializing a TEEC_SharedMemory based on a given dma_buf.
Yes, I think it should be possible to remove this limitation in future versions.
By the way, we think xtest 5006, which calls TEEC_RegisterSharedBuffer() twice may lead to memory leaks (it could be the same with the current optee_linuxdriver).
Agree, the test is broken in that sense. The memory will be reclaimed when the process dies though.
*List of SHM* The list of the SHM allocated is now contained in the device, whereas it used to be in the context in optee_linuxdriver. This is not clear to us whether this is an issue or not.
This list is only used by the currently unused functions tee_shm_find_by_va() and tee_shm_find_by_pa(). I don't think they will ever be needed by the OP-TEE driver. I added those functions because other drivers may need them. In the OP-TEE driver we avoid tee_shm_find_by_pa() by always passing a shm_ref to secure world and secure world passes it back when needed.
*tee_shm_pool_alloc_res_mem()* If we'd like to use the generic function for a LX backend, we would need to add the following arguments to the function:
- the alignment of the private region, and the one of the dma region
- a way to give the limits between both region (the size of the 1st
region, or anything else).
No problem, this is only a kernel internal API so it can be changed whenever we need to. But we should try to get it as good as possible from the start. By alignment I guess it's alignment of the different areas, not the alignment of the buffers allocated from each area.
Would this be OK?
struct tee_shm_pool_res_mem_info { unsigned long priv_vaddr; unsigned long priv_paddr; size_t priv_size; unsigned long dma_buf_vaddr; unsigned long dma_buf_paddr; size_t dma_buf_size; };
struct tee_shm_pool *tee_shm_pool_alloc_res_mem(struct device *dev, struct tee_shm_pool_res_mem_info *info);
Emmanuel and I would like to check few other points in the proposal. We'll come back to you if we have any other remarks.
OK. I plan to have something ready to send to the kernel lists during week 28.
-- Thanks, Jens
Hi Jens,
On 2015-06-24 12:32, Jens Wiklander wrote:
*Non-zero copy* When using RegisterSharedMemory(), it looks we do not benefit from the fact that the original buffer was allocated using dma_buf. Hence we never have zero-copy in the current version when using RegisterSharedMemory as we will furher use the so-called shadowbuffer. Can you confirm this limitation can be removed in further versions of the patch?
How can TEEC_RegisterSharedMemory() tell if the supplied memory range is backed by a dma_buf? Perhaps we can add another TEE_IOC_XXX to given an address get a file descriptor etc needed to use that memory as shared memory. The supplied address must of course be within a range of a mmaped dma_buf for it to work.
Another option would be to extend TEE Client API with a way of initializing a TEEC_SharedMemory based on a given dma_buf.
Yes, I think it should be possible to remove this limitation in future versions.
I agree here with Pascal, having the ability to do RegisterSharedMemory is important. I always assumed the RegisterSharedMemory API was meant to be called on any memory range inside the Client memory space, not necessarily on dma-buff allocated memory. The point is that RegisterSharedMemory would be called on memory that we know is not dmabuff(for dmabuff we already have AllocateSharedMemory). I was thinking maybe the driver implementation chooses how to handle the buffer: * in some cases where 1:1 memory mapping(or contiguous memory space) is needed then copy * in other cases create some mapping table directly from the buffer(zero copy option).
Regards, Valentin
Hi Valentin,
On Wed, Jun 24, 2015 at 03:18:24PM +0300, Valentin Manea wrote:
Hi Jens,
On 2015-06-24 12:32, Jens Wiklander wrote:
*Non-zero copy* When using RegisterSharedMemory(), it looks we do not benefit from the fact that the original buffer was allocated using dma_buf. Hence we never have zero-copy in the current version when using RegisterSharedMemory as we will furher use the so-called shadowbuffer. Can you confirm this limitation can be removed in further versions of the patch?
How can TEEC_RegisterSharedMemory() tell if the supplied memory range is backed by a dma_buf? Perhaps we can add another TEE_IOC_XXX to given an address get a file descriptor etc needed to use that memory as shared memory. The supplied address must of course be within a range of a mmaped dma_buf for it to work.
Another option would be to extend TEE Client API with a way of initializing a TEEC_SharedMemory based on a given dma_buf.
Yes, I think it should be possible to remove this limitation in future versions.
I agree here with Pascal, having the ability to do RegisterSharedMemory is important. I always assumed the RegisterSharedMemory API was meant to be called on any memory range inside the Client memory space, not necessarily on dma-buff allocated memory. The point is that RegisterSharedMemory would be called on memory that we know is not dmabuff(for dmabuff we already have AllocateSharedMemory).
I guess that there are use cases where a client gets hold of a dma_buf without AllocateSharedMemory.
I was thinking maybe the driver implementation chooses how to handle the buffer:
- in some cases where 1:1 memory mapping(or contiguous memory space)
is needed then copy
- in other cases create some mapping table directly from the
buffer(zero copy option).
The proposed api in tee.h doesn't cover memory that isn't represented by a file descriptor. A memref is represented as: struct tee_ioctl_param_memref { __u64 shm_offs; __u64 size; __s64 shm_fd; };
I see two options here: a) We add a way to represent registered shared memory with a file descriptor b) We add another parameter type, regmem
In a) we can then map this memory into another process (having tee-supplicant in mind) if needed. In b) there's no easy way of mapping the memory into another process, but perhaps easier to implement if this isn't a problem.
Either case is out of scope for patch v4, but we shouldn't make it hard to add it to tee.h later.
A) would mean one more TEE_IOC_XXX while b) would need some more bits in TEE_IOCTL_PARAMS_ATTR_TYPE_XXX. Changing TEE_IOCTL_PARAM_ATTR_TYPE_MASK to 0xff would guarantee room for all future parameter types. Another bit from the TEE_GEN_CAP_XXX bits can be used to indicate if registered shared memory is supported by a specific driver.
-- Regards, Jens
Hi Jens
On 2015-06-24 17:34, Jens Wiklander wrote:
The proposed api in tee.h doesn't cover memory that isn't represented by a file descriptor. A memref is represented as: struct tee_ioctl_param_memref { __u64 shm_offs; __u64 size; __s64 shm_fd; };
I see two options here: a) We add a way to represent registered shared memory with a file descriptor b) We add another parameter type, regmem
In a) we can then map this memory into another process (having tee-supplicant in mind) if needed. In b) there's no easy way of mapping the memory into another process, but perhaps easier to implement if this isn't a problem.
Either case is out of scope for patch v4, but we shouldn't make it hard to add it to tee.h later.
A) would mean one more TEE_IOC_XXX while b) would need some more bits in TEE_IOCTL_PARAMS_ATTR_TYPE_XXX. Changing TEE_IOCTL_PARAM_ATTR_TYPE_MASK to 0xff would guarantee room for all future parameter types. Another bit from the TEE_GEN_CAP_XXX bits can be used to indicate if registered shared memory is supported by a specific driver.
In the use-case I was thinking about there wasn't a need to map the memory with a third party because it could potentially generate some strange side effects: since the maps can only be done in 4k increments, too much memory would be shared from one process to the other. For me at least option b) fits much better.
Regards, Valentin
Hello,
*Buffer management*
It looks that temporary buffers are now managed in the client API, not in the driver anymore. The positive point is that it makes the driver lighter. But the drawbacks is the number of extra ioctl used when using a temporary buffer: 1 to allocate, 1 to map, and 1 to unmap. So a TEEC_InvokeCommand could lead to 13 ioctls. What about the performances? Could not it be optimized in some way? Could we manage temporary buffers, dma buffers and "other" buffers in different ways?
Temporary buffers should be avoided completely if the client is concerned about performance. Allocated shared memory is the most efficient way of shared memory. DMA buffers should be treated in the same way as allocated shared memory.
Well, I agree that temporary buffers should be avoided when performances is required. However, they are widely used when the user is using small buffers, used to pass a struct for example. In such a case, the struct may be on the stack, and you pass it in a temporary buffer. The extra copy does not really affect performances in our current optee_linuxdriver. But performances may be more affected with the Generic Driver.
The kernel API isn't there right now because it's not used by anything.
This patch-set is kept as small as possible while still being useful, to avoid making it harder than necessary to upstream. Once this patch-set is upstream we can start to add different features, for instance a kernel API.
All right. We should just not forget that this is something quite important for us.
Would this be OK?
struct tee_shm_pool_res_mem_info { unsigned long priv_vaddr; unsigned long priv_paddr; size_t priv_size; unsigned long dma_buf_vaddr; unsigned long dma_buf_paddr; size_t dma_buf_size; }; struct tee_shm_pool *tee_shm_pool_alloc_res_mem(struct device *dev, struct tee_shm_pool_res_mem_info *info);
Yes.
The proposed api in tee.h doesn't cover memory that isn't represented by
a file descriptor. A memref is represented as:
struct tee_ioctl_param_memref {
__u64 shm_offs;
__u64 size;
__s64 shm_fd;
};
I see two options here:
a) We add a way to represent registered shared memory with a file
descriptor
b) We add another parameter type, regmem
In a) we can then map this memory into another process (having
tee-supplicant in mind) if needed. In b) there's no easy way of mapping
the memory into another process, but perhaps easier to implement if
this isn't a problem.
Either case is out of scope for patch v4, but we shouldn't make it hard to
add it to tee.h later.
A) would mean one more TEE_IOC_XXX while b) would need some more bits in
TEE_IOCTL_PARAMS_ATTR_TYPE_XXX. Changing TEE_IOCTL_PARAM_ATTR_TYPE_MASK
to 0xff would guarantee room for all future parameter types.
Another bit from the TEE_GEN_CAP_XXX bits can be used to indicate if
registered shared memory is supported by a specific driver.
In the use-case I was thinking about there wasn't a need to map the memory with a third party because it could potentially generate some strange side effects: since the maps can only be done in 4k increments, too much memory would be shared from one process to the other. For me at least option b) fits much better.
+1 on b)
- it would also add more generic arguments - I wonder if it could not be used to solve the temporary buffers perf issues.
For use cases where a client gets hold of a dma_buf without AllocateSharedMemory, we could implement it using implement-specific TEEC_MEM_DMABUF flag, as currently performed in optee_client.
We also have an extra review comment, on optee_client. In the supplicant (optee_client), the buffer lifetime in get_param() and free_param() is not really clear to us:
- get_param(): - mmap the buffer - sets params[idx].u.memref.shm_fd to -1. Why? - free_param() - unmap the buffer - close it
The questions:
- why don't we have a "refcount++" of the buffer in get_param() - is not it dangerous to close the buffer (roughly meaning "it is no more used") before being sent and processed by the driver.
Best regards, Pascal.
Hi,
On Thu, Jul 2, 2015 at 4:01 PM, Pascal Brand pascal.brand@linaro.org wrote:
Hello,
*Buffer management* It looks that temporary buffers are now managed in the client API, not in the driver anymore. The positive point is that it makes the driver lighter. But the drawbacks is the number of extra ioctl used when using a temporary buffer: 1 to allocate, 1 to map, and 1 to unmap. So a TEEC_InvokeCommand could lead to 13 ioctls. What about the performances? Could not it be optimized in some way? Could we manage temporary buffers, dma buffers and "other" buffers in different ways?
Temporary buffers should be avoided completely if the client is concerned about performance. Allocated shared memory is the most efficient way of shared memory. DMA buffers should be treated in the same way as allocated shared memory.
Well, I agree that temporary buffers should be avoided when performances is required. However, they are widely used when the user is using small buffers, used to pass a struct for example. In such a case, the struct may be on the stack, and you pass it in a temporary buffer. The extra copy does not really affect performances in our current optee_linuxdriver. But performances may be more affected with the Generic Driver.
Yes, I agree that the old driver handles this more efficiently. As far as temporary buffers aren't used in performance critical loops we should be OK though. If it turns out to be a real problem I'd rather address it later when the initial patch set has been accepted.
The kernel API isn't there right now because it's not used by anything. This patch-set is kept as small as possible while still being useful, to avoid making it harder than necessary to upstream. Once this patch-set is upstream we can start to add different features, for instance a kernel API.
All right. We should just not forget that this is something quite important for us.
Would this be OK? struct tee_shm_pool_res_mem_info { unsigned long priv_vaddr; unsigned long priv_paddr; size_t priv_size; unsigned long dma_buf_vaddr; unsigned long dma_buf_paddr; size_t dma_buf_size; }; struct tee_shm_pool *tee_shm_pool_alloc_res_mem(struct device *dev, struct tee_shm_pool_res_mem_info *info);
Yes.
Thanks
The proposed api in tee.h doesn't cover memory that isn't represented by
a file descriptor. A memref is represented as:
struct tee_ioctl_param_memref {
__u64 shm_offs; __u64 size; __s64 shm_fd;
};
I see two options here:
a) We add a way to represent registered shared memory with a file
descriptor
b) We add another parameter type, regmem
In a) we can then map this memory into another process (having
tee-supplicant in mind) if needed. In b) there's no easy way of mapping
the memory into another process, but perhaps easier to implement if
this isn't a problem.
Either case is out of scope for patch v4, but we shouldn't make it hard to
add it to tee.h later.
A) would mean one more TEE_IOC_XXX while b) would need some more bits in
TEE_IOCTL_PARAMS_ATTR_TYPE_XXX. Changing TEE_IOCTL_PARAM_ATTR_TYPE_MASK
to 0xff would guarantee room for all future parameter types.
Another bit from the TEE_GEN_CAP_XXX bits can be used to indicate if
registered shared memory is supported by a specific driver.
In the use-case I was thinking about there wasn't a need to map the memory with a third party because it could potentially generate some strange side effects: since the maps can only be done in 4k increments, too much memory would be shared from one process to the other. For me at least option b) fits much better.
+1 on b)
it would also add more generic arguments I wonder if it could not be used to solve the temporary buffers perf issues.
Yes possibly, we should keep this in mind when fixing registered shared memory.
For use cases where a client gets hold of a dma_buf without AllocateSharedMemory, we could implement it using implement-specific TEEC_MEM_DMABUF flag, as currently performed in optee_client.
Good idea.
We also have an extra review comment, on optee_client. In the supplicant (optee_client), the buffer lifetime in get_param() and free_param() is not really clear to us:
get_param():
mmap the buffer sets params[idx].u.memref.shm_fd to -1. Why?
write_response() closes all file descriptors that hasn't been claimed through get_param(). Since get_param() has created an shm with this file descriptor it will be closed free_param() instead. Closing it twice would be an error. Doing a dup() on the file descriptor instead would just be inefficient.
free_param()
unmap the buffer close it
The questions:
why don't we have a "refcount++" of the buffer in get_param() is not it dangerous to close the buffer (roughly meaning "it is no more used") before being sent and processed by the driver.
The refcount was already increased by the driver when creating a new file descriptor for this buffer. So when tee-supplicant get the buffer the refcount is already at least 2, closing it will not make it hit 0.
Thanks, Jens