Hi all,
In this email I would like to discuss number of things related to OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory" and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on virtualization support in OP-TEE. My current approach is total isolation of different guests from each other. To implement this I want to divide all OP-TEE program state into two big parts: kernel and TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a core services. Examples: temporary stacks used by entry points (used before thread_alloc_and_run() invocation), list of known guests, device drivers state and so on. This kind of data is not guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with stack and state), opened sessions, loaded TAs, mutexes, pobjs and such. This kind of data have a meaning only regarding to a certain guest.
So, memory layout can look like this:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+ +----------+ | .data | +----------+ | .bss | +----------+ | .heap | +----------+ | TA SPACE | +----------+
(This is just an illustration, I aware that actual OP-TEE layout is more complex).
Sections starting with "k" belong to kernel data. I also extended bget. Now it supports multiple pools and I can use kmalloc() to allocated memory from .kheap and plain malloc() to allocate mempory from .heap.
This layout allows us to switch guest context with simple banking:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+
============ ============ ============ ==Guest 1=== ==Guest 2=== ==Guest 3=== ============ ============ ============ | .data | | .data | | .data | +----------+ +----------+ +----------+ | .bss | | .bss | | .bss | +----------+ +----------+ +----------+ | .heap | | .heap | | .heap | +----------+ +----------+ +----------+ | TA SPACE | | TA SPACE | | TA SPACE | +----------+ +----------+ +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I think, it is clear now, why I want to talk about virtual memory management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs use the same mappings. Also there is a separate address space for a dynamic shared memory mappings. Pager actually does two things: it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings for a different guests. But ever now it have an overhead, because of page encryption/hashing. Also, for virtualization it is crucial to have different mappings on different CPUs. Plus, for efficient guest context switching it is good to use TTBR1. All this means that pager should be severely rewritten. Also, use of TTBR1 imposes restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE. What I propose:
1. Let OP-TEE to live in own address space, not bound to platform configuration (i.e. as linux does). For example most of the OP-TEE will be mapped at 0x2000000, when TEE contexts will be mapped at 0x8000000.
2. Add page allocator. Currently tee_mm acts both as a page allocator and as a address space allocator.
3. Promote pgt_cache to full-scale address space manager.
4. Split pager into two parts: paging mechanism and backend. Backend can encrypt pages, hash them, or just use some platform-specific mechanism to protect paged-out pages
5. Rename things. pgt_cache is no more just a cache. tee_mm does two different things at once.
Jens, I am sure, you have your own vision how VM management should like. Would you share your thoughts, please? Maybe there is a way, how I can implement TEE context banking with less effort. I will be very happy in this case :-) My biggest concern that if I'll take straight approach to context banking, this will complicate things a lot. So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it will ease up integration of virtualization support.
I also CC'ed Julien Grall from ARM (and, now, from Linaro). He is one of ARM arch maintainers in XEN project. I think, he can share some valuable ideas about virtualization as a whole.
On 12/06/2017 05:40 PM, Volodymyr Babchuk wrote:
Hi all,
Hi Volodymyr,
I am not an OP-TEE expert, thought I have one comment below on how Xen works.
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I don't think the mutex example is correct. When there are a sudden guest crash (e.g when the guest badly behave), Xen will notify all the guest vCPUs by raising a softirq. This will raise a SGI target the pCPU where the vCPU is running.
So the vCPU will finish what he is doing (such as handling an hypercall or SMC) and will get unscheduled before returning to EL1/EL0.
Do you expect the mutex to stay locked accross SMC call?
Cheers,
Hi Julien,
On 7 December 2017 at 00:33, Julien Grall julien.grall@arm.com wrote:
Hi Volodymyr,
I am not an OP-TEE expert, thought I have one comment below on how Xen works.
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I don't think the mutex example is correct. When there are a sudden guest crash (e.g when the guest badly behave), Xen will notify all the guest vCPUs by raising a softirq. This will raise a SGI target the pCPU where the vCPU is running.
So the vCPU will finish what he is doing (such as handling an hypercall or SMC) and will get unscheduled before returning to EL1/EL0.
Yes, this is not a problem. If pCPU is busy in Secure World, OP-TEE will let it out only when it is relatively safe. E.g. it will not issue RPC return to handle SGI during an atomic operation. But it can issue RPC return while holding a mutex. Or, it is possible, that one of OP-TEE threads is already locked on a mutex and it is waiting its turn in Normal World, sleeping on completion in linux optee driver.
Do you expect the mutex to stay locked accross SMC call?
Yes, this is the case. As I said in xen-devel discussion, OP-TEE is scheduled by Normal World. So, when OP-TEE thread gets blocked on a mutex, it issues RPC to exit to Normal World. When mutex owner unlocks it, it also issues RPC to unblock the first thread, so it can return back into Secure World.
But, as OP-TEE uses mutexes only in STD calls, it is possible to have a mutex instance per-guest. So, when guest dies, OP-TEE can just forget about all guest threads, mutexes and other state.
Hi Volodymyr,
On Wed, Dec 6, 2017 at 6:40 PM, Volodymyr Babchuk vlad.babchuk@gmail.com wrote:
Hi all,
In this email I would like to discuss number of things related to OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory" and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on virtualization support in OP-TEE. My current approach is total isolation of different guests from each other. To implement this I want to divide all OP-TEE program state into two big parts: kernel and TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a core services. Examples: temporary stacks used by entry points (used before thread_alloc_and_run() invocation), list of known guests, device drivers state and so on. This kind of data is not guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with stack and state), opened sessions, loaded TAs, mutexes, pobjs and such. This kind of data have a meaning only regarding to a certain guest.
So, memory layout can look like this:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+ +----------+ | .data | +----------+ | .bss | +----------+ | .heap | +----------+ | TA SPACE | +----------+
(This is just an illustration, I aware that actual OP-TEE layout is more complex).
Sections starting with "k" belong to kernel data. I also extended bget. Now it supports multiple pools and I can use kmalloc() to allocated memory from .kheap and plain malloc() to allocate mempory from .heap.
This layout allows us to switch guest context with simple banking:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+
============ ============ ============ ==Guest 1=== ==Guest 2=== ==Guest 3=== ============ ============ ============ | .data | | .data | | .data | +----------+ +----------+ +----------+ | .bss | | .bss | | .bss | +----------+ +----------+ +----------+ | .heap | | .heap | | .heap | +----------+ +----------+ +----------+ | TA SPACE | | TA SPACE | | TA SPACE | +----------+ +----------+ +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I think, it is clear now, why I want to talk about virtual memory management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs use the same mappings. Also there is a separate address space for a dynamic shared memory mappings. Pager actually does two things: it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings for a different guests. But ever now it have an overhead, because of page encryption/hashing. Also, for virtualization it is crucial to have different mappings on different CPUs. Plus, for efficient guest context switching it is good to use TTBR1. All this means that pager should be severely rewritten. Also, use of TTBR1 imposes restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE. What I propose:
- Let OP-TEE to live in own address space, not bound to platform configuration (i.e. as linux does). For example most of the OP-TEE will be mapped at 0x2000000, when TEE contexts will be mapped at 0x8000000.
You're not the first one proposing this, and I believe this is a direction we'll take sooner or later.
- Add page allocator. Currently tee_mm acts both as a page allocator and as a address space allocator.
tee_mm is actually more of a resource allocator.
Promote pgt_cache to full-scale address space manager.
Split pager into two parts: paging mechanism and backend. Backend can encrypt pages, hash them, or just use some platform-specific mechanism to protect paged-out pages
I don't think the pager need a backend, we're only swapping out to memory so it's still a fast operation (relative to writing to flash or rotating media).
- Rename things. pgt_cache is no more just a cache. tee_mm does two different things at once.
Jens, I am sure, you have your own vision how VM management should like. Would you share your thoughts, please? Maybe there is a way, how I can implement TEE context banking with less effort. I will be very happy in this case :-) My biggest concern that if I'll take straight approach to context banking, this will complicate things a lot. So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it will ease up integration of virtualization support.
If I understand it correctly you'd like to put the guest state in a virtual memory space, which is switched for each guest. And the reason for this is to deal with global variables?
I don't think we have that many global variables. grep 'static struct mutex' **/*.c|wc -l gives 7
I'm probably missing some place, in LTC for instance.
This is clearly manageable by other means, like having a "struct guest_state" containing this.
I think it's overkill and also a bit complicated to introduce a guest specific vmspace just to contain a few global variables. How would what should go into the guest vmspace be identified?
Thanks, Jens
Hi Jens,
On 7 December 2017 at 09:37, Jens Wiklander jens.wiklander@linaro.org wrote:
In this email I would like to discuss number of things related to OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory" and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on virtualization support in OP-TEE. My current approach is total isolation of different guests from each other. To implement this I want to divide all OP-TEE program state into two big parts: kernel and TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a core services. Examples: temporary stacks used by entry points (used before thread_alloc_and_run() invocation), list of known guests, device drivers state and so on. This kind of data is not guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with stack and state), opened sessions, loaded TAs, mutexes, pobjs and such. This kind of data have a meaning only regarding to a certain guest.
So, memory layout can look like this:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+ +----------+ | .data | +----------+ | .bss | +----------+ | .heap | +----------+ | TA SPACE | +----------+
(This is just an illustration, I aware that actual OP-TEE layout is more complex).
Sections starting with "k" belong to kernel data. I also extended bget. Now it supports multiple pools and I can use kmalloc() to allocated memory from .kheap and plain malloc() to allocate mempory from .heap.
This layout allows us to switch guest context with simple banking:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+
============ ============ ============ ==Guest 1=== ==Guest 2=== ==Guest 3=== ============ ============ ============ | .data | | .data | | .data | +----------+ +----------+ +----------+ | .bss | | .bss | | .bss | +----------+ +----------+ +----------+ | .heap | | .heap | | .heap | +----------+ +----------+ +----------+ | TA SPACE | | TA SPACE | | TA SPACE | +----------+ +----------+ +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I think, it is clear now, why I want to talk about virtual memory management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs use the same mappings. Also there is a separate address space for a dynamic shared memory mappings. Pager actually does two things: it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings for a different guests. But ever now it have an overhead, because of page encryption/hashing. Also, for virtualization it is crucial to have different mappings on different CPUs. Plus, for efficient guest context switching it is good to use TTBR1. All this means that pager should be severely rewritten. Also, use of TTBR1 imposes restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE. What I propose:
- Let OP-TEE to live in own address space, not bound to platform configuration (i.e. as linux does). For example most of the OP-TEE will be mapped at 0x2000000, when TEE contexts will be mapped at 0x8000000.
You're not the first one proposing this, and I believe this is a direction we'll take sooner or later.
Yes. It is not as I insist on a such rework. But if we'll conclude, that it is good to have for the virtualization, then I can assist there.
- Add page allocator. Currently tee_mm acts both as a page allocator and as a address space allocator.
tee_mm is actually more of a resource allocator.
Indeed. I wasn't precisely correct there.
Promote pgt_cache to full-scale address space manager.
Split pager into two parts: paging mechanism and backend. Backend can encrypt pages, hash them, or just use some platform-specific mechanism to protect paged-out pages
I don't think the pager need a backend, we're only swapping out to memory so it's still a fast operation (relative to writing to flash or rotating media).
I'm not insist on this. I just thought that it a good idea. For example, there are platforms with cryptoDMA, which can greatly increase paging speed.
- Rename things. pgt_cache is no more just a cache. tee_mm does two different things at once.
Jens, I am sure, you have your own vision how VM management should like. Would you share your thoughts, please? Maybe there is a way, how I can implement TEE context banking with less effort. I will be very happy in this case :-) My biggest concern that if I'll take straight approach to context banking, this will complicate things a lot. So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it will ease up integration of virtualization support.
If I understand it correctly you'd like to put the guest state in a virtual memory space, which is switched for each guest. And the reason for this is to deal with global variables?
One of the reasons, yes. Plus, not only global variables, but also malloc pool, mapped shared buffers and other resources.
Consider entries in tee_mmu or pgt state.
I don't think we have that many global variables. grep 'static struct mutex' **/*.c|wc -l gives 7
Actually, there are lots of things that should go to a guest context (or require own cleanup code): session list, open TAs list, pgt_cache state, mutexes, condvars, pobjs, pTA states, and so on, so on. Just look at tee_ta_manager.c for example:
struct mutex tee_ta_mutex = MUTEX_INITIALIZER; static struct condvar tee_ta_cv = CONDVAR_INITIALIZER; static int tee_ta_single_instance_thread = THREAD_ID_INVALID; static size_t tee_ta_single_instance_count; struct tee_ta_ctx_head tee_ctxes = TAILQ_HEAD_INITIALIZER(tee_ctxes);
All this should be stored in guest context.
Almost any file in core/tee has at least couple of globals. They also should be moved to a guest context. This is doable, but this is a lots of changes.
What is I propose is to do one big chance in VM part and leave all other code as is. On other hand, it is probably good thing to have a context for every OP-TEE part. This can make it more modular.
I'm probably missing some place, in LTC for instance.
This is clearly manageable by other means, like having a "struct guest_state" containing this.
Yes. I have such structure (probably, you seen it in [1]). Also I have get_client_context() function that returns pointer to context for a current caller.
I think it's overkill and also a bit complicated to introduce a guest specific vmspace just to contain a few global variables.
Yes. But I tried another approach - to tag all resources with guest_id. That was even more complicated. But your idea looks simpler in a way. Could you please take a look at [1]? I think, this is what you are talking about. I'll try to play with it further, but currently I'm worried about amount of needed changes on the whole OP-TEE.
On other hand, separate vmspace guarantees that in any moment OP-TEE code will work with data belonging to one guest. Also, it is naturally provides quota mechanism for guests.
How would what should go into the guest vmspace be identified?
All what is not goes to to kernel vmspace, goes to guest vmspace. For kernel vmspace I created macros like "__kdata" or "__kbss" to put variables into right sections. There will be a lot more guest-tied variables, than kernel ones. So it is preferred to leave guest data in default sections.
So, to sum up:
* My (implicit context) approach with vmspace: Pros: * Most of OP-TEE code can be leaved intact. * A bit faster: it is faster to access global variable directly. * Authors of new features should not worry about virtualization. I hope, that separate vmspaces will look pretty transparent.
Cons: * Huge changes in VM management. * Due to implicit nature, it can cause problems in corner cases
* Your (explicit context) approach with struct guest_state Pros: * Does not require big changes in VM. Maybe no changes at all. * Easier to understand. * Better code shape due to absence of global variables (I don't quite sure there). * Changes in subsystems can be introduced gradually. You don't need to switch to completely new scheme at once.
Cons: * A bit slower. Code should access context somehow. Either via get_guest_context() or via function parameters. * A lot of changes in all code across whole OP-TEE. * Authors of new code should remember about virtualization and don't use global state at all. Or provide reliable mechanisms of recovery.
[1] https://github.com/OP-TEE/optee_os/pull/1910
On Thu, Dec 7, 2017 at 7:25 PM, Volodymyr Babchuk vlad.babchuk@gmail.com wrote:
Hi Jens,
On 7 December 2017 at 09:37, Jens Wiklander jens.wiklander@linaro.org wrote:
In this email I would like to discuss number of things related to OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory" and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on virtualization support in OP-TEE. My current approach is total isolation of different guests from each other. To implement this I want to divide all OP-TEE program state into two big parts: kernel and TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a core services. Examples: temporary stacks used by entry points (used before thread_alloc_and_run() invocation), list of known guests, device drivers state and so on. This kind of data is not guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with stack and state), opened sessions, loaded TAs, mutexes, pobjs and such. This kind of data have a meaning only regarding to a certain guest.
So, memory layout can look like this:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+ +----------+ | .data | +----------+ | .bss | +----------+ | .heap | +----------+ | TA SPACE | +----------+
(This is just an illustration, I aware that actual OP-TEE layout is more complex).
Sections starting with "k" belong to kernel data. I also extended bget. Now it supports multiple pools and I can use kmalloc() to allocated memory from .kheap and plain malloc() to allocate mempory from .heap.
This layout allows us to switch guest context with simple banking:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+
============ ============ ============ ==Guest 1=== ==Guest 2=== ==Guest 3=== ============ ============ ============ | .data | | .data | | .data | +----------+ +----------+ +----------+ | .bss | | .bss | | .bss | +----------+ +----------+ +----------+ | .heap | | .heap | | .heap | +----------+ +----------+ +----------+ | TA SPACE | | TA SPACE | | TA SPACE | +----------+ +----------+ +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I think, it is clear now, why I want to talk about virtual memory management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs use the same mappings. Also there is a separate address space for a dynamic shared memory mappings. Pager actually does two things: it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings for a different guests. But ever now it have an overhead, because of page encryption/hashing. Also, for virtualization it is crucial to have different mappings on different CPUs. Plus, for efficient guest context switching it is good to use TTBR1. All this means that pager should be severely rewritten. Also, use of TTBR1 imposes restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE. What I propose:
- Let OP-TEE to live in own address space, not bound to platform configuration (i.e. as linux does). For example most of the OP-TEE will be mapped at 0x2000000, when TEE contexts will be mapped at 0x8000000.
You're not the first one proposing this, and I believe this is a direction we'll take sooner or later.
Yes. It is not as I insist on a such rework. But if we'll conclude, that it is good to have for the virtualization, then I can assist there.
- Add page allocator. Currently tee_mm acts both as a page allocator and as a address space allocator.
tee_mm is actually more of a resource allocator.
Indeed. I wasn't precisely correct there.
Promote pgt_cache to full-scale address space manager.
Split pager into two parts: paging mechanism and backend. Backend can encrypt pages, hash them, or just use some platform-specific mechanism to protect paged-out pages
I don't think the pager need a backend, we're only swapping out to memory so it's still a fast operation (relative to writing to flash or rotating media).
I'm not insist on this. I just thought that it a good idea. For example, there are platforms with cryptoDMA, which can greatly increase paging speed.
I'd leave this alone until there's an actual hardware where you get any gain. This doesn't have anything to do with virtualization either.
- Rename things. pgt_cache is no more just a cache. tee_mm does two different things at once.
Jens, I am sure, you have your own vision how VM management should like. Would you share your thoughts, please? Maybe there is a way, how I can implement TEE context banking with less effort. I will be very happy in this case :-) My biggest concern that if I'll take straight approach to context banking, this will complicate things a lot. So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it will ease up integration of virtualization support.
If I understand it correctly you'd like to put the guest state in a virtual memory space, which is switched for each guest. And the reason for this is to deal with global variables?
One of the reasons, yes. Plus, not only global variables, but also malloc pool, mapped shared buffers and other resources.
Consider entries in tee_mmu or pgt state.
I don't think we have that many global variables. grep 'static struct mutex' **/*.c|wc -l gives 7
Actually, there are lots of things that should go to a guest context (or require own cleanup code): session list, open TAs list, pgt_cache state, mutexes, condvars, pobjs, pTA states, and so on, so on. Just look at tee_ta_manager.c for example:
struct mutex tee_ta_mutex = MUTEX_INITIALIZER; static struct condvar tee_ta_cv = CONDVAR_INITIALIZER; static int tee_ta_single_instance_thread = THREAD_ID_INVALID; static size_t tee_ta_single_instance_count; struct tee_ta_ctx_head tee_ctxes = TAILQ_HEAD_INITIALIZER(tee_ctxes);
All this should be stored in guest context.
Yes and it would probably need to be analyzed a bit too, to see if there's any cleanup to be done.
Almost any file in core/tee has at least couple of globals. They also should be moved to a guest context. This is doable, but this is a lots of changes.
What is I propose is to do one big chance in VM part and leave all other code as is. On other hand, it is probably good thing to have a context for every OP-TEE part. This can make it more modular.
I'm probably missing some place, in LTC for instance.
This is clearly manageable by other means, like having a "struct guest_state" containing this.
Yes. I have such structure (probably, you seen it in [1]). Also I have get_client_context() function that returns pointer to context for a current caller.
I think it's overkill and also a bit complicated to introduce a guest specific vmspace just to contain a few global variables.
Yes. But I tried another approach - to tag all resources with guest_id. That was even more complicated. But your idea looks simpler in a way. Could you please take a look at [1]? I think, this is what you are talking about. I'll try to play with it further, but currently I'm worried about amount of needed changes on the whole OP-TEE.
I took a quick look, and this is the direction I'd prefer.
On other hand, separate vmspace guarantees that in any moment OP-TEE code will work with data belonging to one guest. Also, it is naturally provides quota mechanism for guests.
Yes, you'd get some with a separate vmspace, but I think it will be too complicated. Just think about debugging... But it's not like we can't manage quota without a vmspace. A separate bget-pool for each guest is a good start.
How would what should go into the guest vmspace be identified?
All what is not goes to to kernel vmspace, goes to guest vmspace. For kernel vmspace I created macros like "__kdata" or "__kbss" to put variables into right sections. There will be a lot more guest-tied variables, than kernel ones. So it is preferred to leave guest data in default sections.
So, to sum up:
- My (implicit context) approach with vmspace: Pros:
- Most of OP-TEE code can be leaved intact.
- A bit faster: it is faster to access global variable directly.
I doubt that you'd be able to measure it. In fact it wouldn't surprise me that much if it became slower instead.
- Authors of new features should not worry about virtualization. I hope, that separate vmspaces will look pretty transparent.
Yes, this is an advantage, but there's ways to guard against adding global variables.
Cons:
- Huge changes in VM management.
Indeed
Due to implicit nature, it can cause problems in corner cases
Your (explicit context) approach with struct guest_state Pros:
- Does not require big changes in VM. Maybe no changes at all.
- Easier to understand.
Easier to understand is quite important.
- Better code shape due to absence of global variables (I don't quite sure there).
Agree, this is a weak argument.
- Changes in subsystems can be introduced gradually. You don't need to switch to completely new scheme at once.
Yes, this is also quite important. Big bang isn't that fun.
Cons:
- A bit slower. Code should access context somehow. Either via get_guest_context() or via function parameters.
Again, I doubt you'd be able to measure it. We can store the pointer to thread_core_local in TPIDRPRW to speed up the access to thread_core_local in the cases where we don't pass some pointer as a parameter.
- A lot of changes in all code across whole OP-TEE.
Yes, there will be changes, but all doesn't have to be done in a single pull request. The different parts can be fixed one, by one.
- Authors of new code should remember about virtualization and don't use global state at all. Or provide reliable mechanisms of recovery.
We can add something that checks that only whitelisted files may add global variables (.data or .bss)
Thanks, Jens
On Fri, Dec 8, 2017 at 2:12 PM, Jens Wiklander jens.wiklander@linaro.org wrote:
On Thu, Dec 7, 2017 at 7:25 PM, Volodymyr Babchuk vlad.babchuk@gmail.com wrote:
Hi Jens,
On 7 December 2017 at 09:37, Jens Wiklander jens.wiklander@linaro.org wrote:
In this email I would like to discuss number of things related to OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory" and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on virtualization support in OP-TEE. My current approach is total isolation of different guests from each other. To implement this I want to divide all OP-TEE program state into two big parts: kernel and TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a core services. Examples: temporary stacks used by entry points (used before thread_alloc_and_run() invocation), list of known guests, device drivers state and so on. This kind of data is not guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with stack and state), opened sessions, loaded TAs, mutexes, pobjs and such. This kind of data have a meaning only regarding to a certain guest.
So, memory layout can look like this:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+ +----------+ | .data | +----------+ | .bss | +----------+ | .heap | +----------+ | TA SPACE | +----------+
(This is just an illustration, I aware that actual OP-TEE layout is more complex).
Sections starting with "k" belong to kernel data. I also extended bget. Now it supports multiple pools and I can use kmalloc() to allocated memory from .kheap and plain malloc() to allocate mempory from .heap.
This layout allows us to switch guest context with simple banking:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+
============ ============ ============ ==Guest 1=== ==Guest 2=== ==Guest 3=== ============ ============ ============ | .data | | .data | | .data | +----------+ +----------+ +----------+ | .bss | | .bss | | .bss | +----------+ +----------+ +----------+ | .heap | | .heap | | .heap | +----------+ +----------+ +----------+ | TA SPACE | | TA SPACE | | TA SPACE | +----------+ +----------+ +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I think, it is clear now, why I want to talk about virtual memory management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs use the same mappings. Also there is a separate address space for a dynamic shared memory mappings. Pager actually does two things: it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings for a different guests. But ever now it have an overhead, because of page encryption/hashing. Also, for virtualization it is crucial to have different mappings on different CPUs. Plus, for efficient guest context switching it is good to use TTBR1. All this means that pager should be severely rewritten. Also, use of TTBR1 imposes restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE. What I propose:
- Let OP-TEE to live in own address space, not bound to platform configuration (i.e. as linux does). For example most of the OP-TEE will be mapped at 0x2000000, when TEE contexts will be mapped at 0x8000000.
You're not the first one proposing this, and I believe this is a direction we'll take sooner or later.
Yes. It is not as I insist on a such rework. But if we'll conclude, that it is good to have for the virtualization, then I can assist there.
- Add page allocator. Currently tee_mm acts both as a page allocator and as a address space allocator.
tee_mm is actually more of a resource allocator.
Indeed. I wasn't precisely correct there.
Promote pgt_cache to full-scale address space manager.
Split pager into two parts: paging mechanism and backend. Backend can encrypt pages, hash them, or just use some platform-specific mechanism to protect paged-out pages
I don't think the pager need a backend, we're only swapping out to memory so it's still a fast operation (relative to writing to flash or rotating media).
I'm not insist on this. I just thought that it a good idea. For example, there are platforms with cryptoDMA, which can greatly increase paging speed.
I'd leave this alone until there's an actual hardware where you get any gain. This doesn't have anything to do with virtualization either.
- Rename things. pgt_cache is no more just a cache. tee_mm does two different things at once.
Jens, I am sure, you have your own vision how VM management should like. Would you share your thoughts, please? Maybe there is a way, how I can implement TEE context banking with less effort. I will be very happy in this case :-) My biggest concern that if I'll take straight approach to context banking, this will complicate things a lot. So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it will ease up integration of virtualization support.
If I understand it correctly you'd like to put the guest state in a virtual memory space, which is switched for each guest. And the reason for this is to deal with global variables?
One of the reasons, yes. Plus, not only global variables, but also malloc pool, mapped shared buffers and other resources.
Consider entries in tee_mmu or pgt state.
I don't think we have that many global variables. grep 'static struct mutex' **/*.c|wc -l gives 7
Actually, there are lots of things that should go to a guest context (or require own cleanup code): session list, open TAs list, pgt_cache state, mutexes, condvars, pobjs, pTA states, and so on, so on. Just look at tee_ta_manager.c for example:
struct mutex tee_ta_mutex = MUTEX_INITIALIZER; static struct condvar tee_ta_cv = CONDVAR_INITIALIZER; static int tee_ta_single_instance_thread = THREAD_ID_INVALID; static size_t tee_ta_single_instance_count; struct tee_ta_ctx_head tee_ctxes = TAILQ_HEAD_INITIALIZER(tee_ctxes);
All this should be stored in guest context.
Yes and it would probably need to be analyzed a bit too, to see if there's any cleanup to be done.
Almost any file in core/tee has at least couple of globals. They also should be moved to a guest context. This is doable, but this is a lots of changes.
What is I propose is to do one big chance in VM part and leave all other code as is. On other hand, it is probably good thing to have a context for every OP-TEE part. This can make it more modular.
I'm probably missing some place, in LTC for instance.
This is clearly manageable by other means, like having a "struct guest_state" containing this.
Yes. I have such structure (probably, you seen it in [1]). Also I have get_client_context() function that returns pointer to context for a current caller.
I think it's overkill and also a bit complicated to introduce a guest specific vmspace just to contain a few global variables.
Yes. But I tried another approach - to tag all resources with guest_id. That was even more complicated. But your idea looks simpler in a way. Could you please take a look at [1]? I think, this is what you are talking about. I'll try to play with it further, but currently I'm worried about amount of needed changes on the whole OP-TEE.
I took a quick look, and this is the direction I'd prefer.
On other hand, separate vmspace guarantees that in any moment OP-TEE code will work with data belonging to one guest. Also, it is naturally provides quota mechanism for guests.
Yes, you'd get some with a separate vmspace, but I think it will be too complicated. Just think about debugging... But it's not like we can't manage quota without a vmspace. A separate bget-pool for each guest is a good start.
How would what should go into the guest vmspace be identified?
All what is not goes to to kernel vmspace, goes to guest vmspace. For kernel vmspace I created macros like "__kdata" or "__kbss" to put variables into right sections. There will be a lot more guest-tied variables, than kernel ones. So it is preferred to leave guest data in default sections.
So, to sum up:
- My (implicit context) approach with vmspace: Pros:
- Most of OP-TEE code can be leaved intact.
- A bit faster: it is faster to access global variable directly.
I doubt that you'd be able to measure it. In fact it wouldn't surprise me that much if it became slower instead.
- Authors of new features should not worry about virtualization. I hope, that separate vmspaces will look pretty transparent.
Yes, this is an advantage, but there's ways to guard against adding global variables.
Here's another Pro: * In case OP-TEE isn't configured for virtualization there's almost zero impact.
Cons:
- Huge changes in VM management.
Indeed
Hmm, come to think of it. It doesn't have to be that bad. There will be some read/write mapped pages that need to change depending on which guest is being served. Making a proof of concept shouldn't be that hard.
The selling point for me would be virtually no impact unless configured for virtualization.
Thanks, Jens
Due to implicit nature, it can cause problems in corner cases
Your (explicit context) approach with struct guest_state Pros:
- Does not require big changes in VM. Maybe no changes at all.
- Easier to understand.
Easier to understand is quite important.
- Better code shape due to absence of global variables (I don't quite sure there).
Agree, this is a weak argument.
- Changes in subsystems can be introduced gradually. You don't need to switch to completely new scheme at once.
Yes, this is also quite important. Big bang isn't that fun.
Cons:
- A bit slower. Code should access context somehow. Either via get_guest_context() or via function parameters.
Again, I doubt you'd be able to measure it. We can store the pointer to thread_core_local in TPIDRPRW to speed up the access to thread_core_local in the cases where we don't pass some pointer as a parameter.
- A lot of changes in all code across whole OP-TEE.
Yes, there will be changes, but all doesn't have to be done in a single pull request. The different parts can be fixed one, by one.
- Authors of new code should remember about virtualization and don't use global state at all. Or provide reliable mechanisms of recovery.
We can add something that checks that only whitelisted files may add global variables (.data or .bss)
Thanks, Jens
Hi Jens,
On 8 December 2017 at 16:36, Jens Wiklander jens.wiklander@linaro.org wrote:
On Fri, Dec 8, 2017 at 2:12 PM, Jens Wiklander jens.wiklander@linaro.org wrote:
On Thu, Dec 7, 2017 at 7:25 PM, Volodymyr Babchuk vlad.babchuk@gmail.com wrote:
Hi Jens,
On 7 December 2017 at 09:37, Jens Wiklander jens.wiklander@linaro.org wrote:
In this email I would like to discuss number of things related to OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory" and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on virtualization support in OP-TEE. My current approach is total isolation of different guests from each other. To implement this I want to divide all OP-TEE program state into two big parts: kernel and TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a core services. Examples: temporary stacks used by entry points (used before thread_alloc_and_run() invocation), list of known guests, device drivers state and so on. This kind of data is not guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with stack and state), opened sessions, loaded TAs, mutexes, pobjs and such. This kind of data have a meaning only regarding to a certain guest.
So, memory layout can look like this:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+ +----------+ | .data | +----------+ | .bss | +----------+ | .heap | +----------+ | TA SPACE | +----------+
(This is just an illustration, I aware that actual OP-TEE layout is more complex).
Sections starting with "k" belong to kernel data. I also extended bget. Now it supports multiple pools and I can use kmalloc() to allocated memory from .kheap and plain malloc() to allocate mempory from .heap.
This layout allows us to switch guest context with simple banking:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+
============ ============ ============ ==Guest 1=== ==Guest 2=== ==Guest 3=== ============ ============ ============ | .data | | .data | | .data | +----------+ +----------+ +----------+ | .bss | | .bss | | .bss | +----------+ +----------+ +----------+ | .heap | | .heap | | .heap | +----------+ +----------+ +----------+ | TA SPACE | | TA SPACE | | TA SPACE | +----------+ +----------+ +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I think, it is clear now, why I want to talk about virtual memory management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs use the same mappings. Also there is a separate address space for a dynamic shared memory mappings. Pager actually does two things: it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings for a different guests. But ever now it have an overhead, because of page encryption/hashing. Also, for virtualization it is crucial to have different mappings on different CPUs. Plus, for efficient guest context switching it is good to use TTBR1. All this means that pager should be severely rewritten. Also, use of TTBR1 imposes restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE. What I propose:
- Let OP-TEE to live in own address space, not bound to platform configuration (i.e. as linux does). For example most of the OP-TEE will be mapped at 0x2000000, when TEE contexts will be mapped at 0x8000000.
You're not the first one proposing this, and I believe this is a direction we'll take sooner or later.
Yes. It is not as I insist on a such rework. But if we'll conclude, that it is good to have for the virtualization, then I can assist there.
- Add page allocator. Currently tee_mm acts both as a page allocator and as a address space allocator.
tee_mm is actually more of a resource allocator.
Indeed. I wasn't precisely correct there.
Promote pgt_cache to full-scale address space manager.
Split pager into two parts: paging mechanism and backend. Backend can encrypt pages, hash them, or just use some platform-specific mechanism to protect paged-out pages
I don't think the pager need a backend, we're only swapping out to memory so it's still a fast operation (relative to writing to flash or rotating media).
I'm not insist on this. I just thought that it a good idea. For example, there are platforms with cryptoDMA, which can greatly increase paging speed.
I'd leave this alone until there's an actual hardware where you get any gain. This doesn't have anything to do with virtualization either.
- Rename things. pgt_cache is no more just a cache. tee_mm does two different things at once.
Jens, I am sure, you have your own vision how VM management should like. Would you share your thoughts, please? Maybe there is a way, how I can implement TEE context banking with less effort. I will be very happy in this case :-) My biggest concern that if I'll take straight approach to context banking, this will complicate things a lot. So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it will ease up integration of virtualization support.
If I understand it correctly you'd like to put the guest state in a virtual memory space, which is switched for each guest. And the reason for this is to deal with global variables?
One of the reasons, yes. Plus, not only global variables, but also malloc pool, mapped shared buffers and other resources.
Consider entries in tee_mmu or pgt state.
I don't think we have that many global variables. grep 'static struct mutex' **/*.c|wc -l gives 7
Actually, there are lots of things that should go to a guest context (or require own cleanup code): session list, open TAs list, pgt_cache state, mutexes, condvars, pobjs, pTA states, and so on, so on. Just look at tee_ta_manager.c for example:
struct mutex tee_ta_mutex = MUTEX_INITIALIZER; static struct condvar tee_ta_cv = CONDVAR_INITIALIZER; static int tee_ta_single_instance_thread = THREAD_ID_INVALID; static size_t tee_ta_single_instance_count; struct tee_ta_ctx_head tee_ctxes = TAILQ_HEAD_INITIALIZER(tee_ctxes);
All this should be stored in guest context.
Yes and it would probably need to be analyzed a bit too, to see if there's any cleanup to be done.
Almost any file in core/tee has at least couple of globals. They also should be moved to a guest context. This is doable, but this is a lots of changes.
What is I propose is to do one big chance in VM part and leave all other code as is. On other hand, it is probably good thing to have a context for every OP-TEE part. This can make it more modular.
I'm probably missing some place, in LTC for instance.
This is clearly manageable by other means, like having a "struct guest_state" containing this.
Yes. I have such structure (probably, you seen it in [1]). Also I have get_client_context() function that returns pointer to context for a current caller.
I think it's overkill and also a bit complicated to introduce a guest specific vmspace just to contain a few global variables.
Yes. But I tried another approach - to tag all resources with guest_id. That was even more complicated. But your idea looks simpler in a way. Could you please take a look at [1]? I think, this is what you are talking about. I'll try to play with it further, but currently I'm worried about amount of needed changes on the whole OP-TEE.
I took a quick look, and this is the direction I'd prefer.
On other hand, separate vmspace guarantees that in any moment OP-TEE code will work with data belonging to one guest. Also, it is naturally provides quota mechanism for guests.
Yes, you'd get some with a separate vmspace, but I think it will be too complicated. Just think about debugging... But it's not like we can't manage quota without a vmspace. A separate bget-pool for each guest is a good start.
How would what should go into the guest vmspace be identified?
All what is not goes to to kernel vmspace, goes to guest vmspace. For kernel vmspace I created macros like "__kdata" or "__kbss" to put variables into right sections. There will be a lot more guest-tied variables, than kernel ones. So it is preferred to leave guest data in default sections.
So, to sum up:
- My (implicit context) approach with vmspace: Pros:
- Most of OP-TEE code can be leaved intact.
- A bit faster: it is faster to access global variable directly.
I doubt that you'd be able to measure it. In fact it wouldn't surprise me that much if it became slower instead.
- Authors of new features should not worry about virtualization. I hope, that separate vmspaces will look pretty transparent.
Yes, this is an advantage, but there's ways to guard against adding global variables.
Here's another Pro:
- In case OP-TEE isn't configured for virtualization there's almost zero impact.
Basically yes. Memory footprint will get bigger in any case. But besides that, I don't see any additional overhead.
Cons:
- Huge changes in VM management.
Indeed
Hmm, come to think of it. It doesn't have to be that bad. There will be some read/write mapped pages that need to change depending on which guest is being served. Making a proof of concept shouldn't be that hard.
Yep. At first I wanted to implement PoC as a completely separate feature, but then I realized that it would be incomplete wirth pager, and pager, on other hand provides some vaguely similar services. Then I realized that I need separate memory mapping for every core, which leaded me to TTBR1. But if I want to use TTBR1, then I can't place vmspace at arbitrary address. And so we are here :) Actually, I can implement PoC in quick and dirty way, just to see how it comes and what minimal set of changes is needed.
The selling point for me would be virtually no impact unless configured for virtualization.
I just realized, that with bit of #ifdef magic I can merge bget pools back into one and also, I don't need to put kernel and tee data sections into separate pages. So memory footprint overhead will be even smaller, than I expected.
Thanks, Jens
Due to implicit nature, it can cause problems in corner cases
Your (explicit context) approach with struct guest_state Pros:
- Does not require big changes in VM. Maybe no changes at all.
- Easier to understand.
Easier to understand is quite important.
- Better code shape due to absence of global variables (I don't quite sure there).
Agree, this is a weak argument.
- Changes in subsystems can be introduced gradually. You don't need to switch to completely new scheme at once.
Yes, this is also quite important. Big bang isn't that fun.
Cons:
- A bit slower. Code should access context somehow. Either via get_guest_context() or via function parameters.
Again, I doubt you'd be able to measure it. We can store the pointer to thread_core_local in TPIDRPRW to speed up the access to thread_core_local in the cases where we don't pass some pointer as a parameter.
- A lot of changes in all code across whole OP-TEE.
Yes, there will be changes, but all doesn't have to be done in a single pull request. The different parts can be fixed one, by one.
- Authors of new code should remember about virtualization and don't use global state at all. Or provide reliable mechanisms of recovery.
We can add something that checks that only whitelisted files may add global variables (.data or .bss)
Thanks, Jens
On Fri, Dec 8, 2017 at 4:41 PM, Volodymyr Babchuk vlad.babchuk@gmail.com wrote:
Hi Jens,
On 8 December 2017 at 16:36, Jens Wiklander jens.wiklander@linaro.org wrote:
On Fri, Dec 8, 2017 at 2:12 PM, Jens Wiklander jens.wiklander@linaro.org wrote:
On Thu, Dec 7, 2017 at 7:25 PM, Volodymyr Babchuk vlad.babchuk@gmail.com wrote:
Hi Jens,
On 7 December 2017 at 09:37, Jens Wiklander jens.wiklander@linaro.org wrote:
In this email I would like to discuss number of things related to OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory" and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on virtualization support in OP-TEE. My current approach is total isolation of different guests from each other. To implement this I want to divide all OP-TEE program state into two big parts: kernel and TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a core services. Examples: temporary stacks used by entry points (used before thread_alloc_and_run() invocation), list of known guests, device drivers state and so on. This kind of data is not guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with stack and state), opened sessions, loaded TAs, mutexes, pobjs and such. This kind of data have a meaning only regarding to a certain guest.
So, memory layout can look like this:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+ +----------+ | .data | +----------+ | .bss | +----------+ | .heap | +----------+ | TA SPACE | +----------+
(This is just an illustration, I aware that actual OP-TEE layout is more complex).
Sections starting with "k" belong to kernel data. I also extended bget. Now it supports multiple pools and I can use kmalloc() to allocated memory from .kheap and plain malloc() to allocate mempory from .heap.
This layout allows us to switch guest context with simple banking:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+
============ ============ ============ ==Guest 1=== ==Guest 2=== ==Guest 3=== ============ ============ ============ | .data | | .data | | .data | +----------+ +----------+ +----------+ | .bss | | .bss | | .bss | +----------+ +----------+ +----------+ | .heap | | .heap | | .heap | +----------+ +----------+ +----------+ | TA SPACE | | TA SPACE | | TA SPACE | +----------+ +----------+ +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I think, it is clear now, why I want to talk about virtual memory management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs use the same mappings. Also there is a separate address space for a dynamic shared memory mappings. Pager actually does two things: it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings for a different guests. But ever now it have an overhead, because of page encryption/hashing. Also, for virtualization it is crucial to have different mappings on different CPUs. Plus, for efficient guest context switching it is good to use TTBR1. All this means that pager should be severely rewritten. Also, use of TTBR1 imposes restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE. What I propose:
- Let OP-TEE to live in own address space, not bound to platform configuration (i.e. as linux does). For example most of the OP-TEE will be mapped at 0x2000000, when TEE contexts will be mapped at 0x8000000.
You're not the first one proposing this, and I believe this is a direction we'll take sooner or later.
Yes. It is not as I insist on a such rework. But if we'll conclude, that it is good to have for the virtualization, then I can assist there.
- Add page allocator. Currently tee_mm acts both as a page allocator and as a address space allocator.
tee_mm is actually more of a resource allocator.
Indeed. I wasn't precisely correct there.
Promote pgt_cache to full-scale address space manager.
Split pager into two parts: paging mechanism and backend. Backend can encrypt pages, hash them, or just use some platform-specific mechanism to protect paged-out pages
I don't think the pager need a backend, we're only swapping out to memory so it's still a fast operation (relative to writing to flash or rotating media).
I'm not insist on this. I just thought that it a good idea. For example, there are platforms with cryptoDMA, which can greatly increase paging speed.
I'd leave this alone until there's an actual hardware where you get any gain. This doesn't have anything to do with virtualization either.
- Rename things. pgt_cache is no more just a cache. tee_mm does two different things at once.
Jens, I am sure, you have your own vision how VM management should like. Would you share your thoughts, please? Maybe there is a way, how I can implement TEE context banking with less effort. I will be very happy in this case :-) My biggest concern that if I'll take straight approach to context banking, this will complicate things a lot. So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it will ease up integration of virtualization support.
If I understand it correctly you'd like to put the guest state in a virtual memory space, which is switched for each guest. And the reason for this is to deal with global variables?
One of the reasons, yes. Plus, not only global variables, but also malloc pool, mapped shared buffers and other resources.
Consider entries in tee_mmu or pgt state.
I don't think we have that many global variables. grep 'static struct mutex' **/*.c|wc -l gives 7
Actually, there are lots of things that should go to a guest context (or require own cleanup code): session list, open TAs list, pgt_cache state, mutexes, condvars, pobjs, pTA states, and so on, so on. Just look at tee_ta_manager.c for example:
struct mutex tee_ta_mutex = MUTEX_INITIALIZER; static struct condvar tee_ta_cv = CONDVAR_INITIALIZER; static int tee_ta_single_instance_thread = THREAD_ID_INVALID; static size_t tee_ta_single_instance_count; struct tee_ta_ctx_head tee_ctxes = TAILQ_HEAD_INITIALIZER(tee_ctxes);
All this should be stored in guest context.
Yes and it would probably need to be analyzed a bit too, to see if there's any cleanup to be done.
Almost any file in core/tee has at least couple of globals. They also should be moved to a guest context. This is doable, but this is a lots of changes.
What is I propose is to do one big chance in VM part and leave all other code as is. On other hand, it is probably good thing to have a context for every OP-TEE part. This can make it more modular.
I'm probably missing some place, in LTC for instance.
This is clearly manageable by other means, like having a "struct guest_state" containing this.
Yes. I have such structure (probably, you seen it in [1]). Also I have get_client_context() function that returns pointer to context for a current caller.
I think it's overkill and also a bit complicated to introduce a guest specific vmspace just to contain a few global variables.
Yes. But I tried another approach - to tag all resources with guest_id. That was even more complicated. But your idea looks simpler in a way. Could you please take a look at [1]? I think, this is what you are talking about. I'll try to play with it further, but currently I'm worried about amount of needed changes on the whole OP-TEE.
I took a quick look, and this is the direction I'd prefer.
On other hand, separate vmspace guarantees that in any moment OP-TEE code will work with data belonging to one guest. Also, it is naturally provides quota mechanism for guests.
Yes, you'd get some with a separate vmspace, but I think it will be too complicated. Just think about debugging... But it's not like we can't manage quota without a vmspace. A separate bget-pool for each guest is a good start.
How would what should go into the guest vmspace be identified?
All what is not goes to to kernel vmspace, goes to guest vmspace. For kernel vmspace I created macros like "__kdata" or "__kbss" to put variables into right sections. There will be a lot more guest-tied variables, than kernel ones. So it is preferred to leave guest data in default sections.
So, to sum up:
- My (implicit context) approach with vmspace: Pros:
- Most of OP-TEE code can be leaved intact.
- A bit faster: it is faster to access global variable directly.
I doubt that you'd be able to measure it. In fact it wouldn't surprise me that much if it became slower instead.
- Authors of new features should not worry about virtualization. I hope, that separate vmspaces will look pretty transparent.
Yes, this is an advantage, but there's ways to guard against adding global variables.
Here's another Pro:
- In case OP-TEE isn't configured for virtualization there's almost zero impact.
Basically yes. Memory footprint will get bigger in any case. But besides that, I don't see any additional overhead.
I was actually expecting none worth mentioning, a completely optional feature.
Cons:
- Huge changes in VM management.
Indeed
Hmm, come to think of it. It doesn't have to be that bad. There will be some read/write mapped pages that need to change depending on which guest is being served. Making a proof of concept shouldn't be that hard.
Yep. At first I wanted to implement PoC as a completely separate feature, but then I realized that it would be incomplete wirth pager, and pager, on other hand provides some vaguely similar services. Then I realized that I need separate memory mapping for every core, which leaded me to TTBR1. But if I want to use TTBR1, then I can't place vmspace at arbitrary address. And so we are here :) Actually, I can implement PoC in quick and dirty way, just to see how it comes and what minimal set of changes is needed.
We already have separate memory mapping for each core...
Thanks, Jens
The selling point for me would be virtually no impact unless configured for virtualization.
I just realized, that with bit of #ifdef magic I can merge bget pools back into one and also, I don't need to put kernel and tee data sections into separate pages. So memory footprint overhead will be even smaller, than I expected.
Thanks, Jens
Due to implicit nature, it can cause problems in corner cases
Your (explicit context) approach with struct guest_state Pros:
- Does not require big changes in VM. Maybe no changes at all.
- Easier to understand.
Easier to understand is quite important.
- Better code shape due to absence of global variables (I don't quite sure there).
Agree, this is a weak argument.
- Changes in subsystems can be introduced gradually. You don't need to switch to completely new scheme at once.
Yes, this is also quite important. Big bang isn't that fun.
Cons:
- A bit slower. Code should access context somehow. Either via get_guest_context() or via function parameters.
Again, I doubt you'd be able to measure it. We can store the pointer to thread_core_local in TPIDRPRW to speed up the access to thread_core_local in the cases where we don't pass some pointer as a parameter.
- A lot of changes in all code across whole OP-TEE.
Yes, there will be changes, but all doesn't have to be done in a single pull request. The different parts can be fixed one, by one.
- Authors of new code should remember about virtualization and don't use global state at all. Or provide reliable mechanisms of recovery.
We can add something that checks that only whitelisted files may add global variables (.data or .bss)
Thanks, Jens
-- WBR Volodymyr Babchuk aka lorc [+380976646013] mailto: vlad.babchuk@gmail.com
On 8 December 2017 at 17:53, Jens Wiklander jens.wiklander@linaro.org wrote:
On Fri, Dec 8, 2017 at 4:41 PM, Volodymyr Babchuk vlad.babchuk@gmail.com wrote:
Hi Jens,
On 8 December 2017 at 16:36, Jens Wiklander jens.wiklander@linaro.org wrote:
On Fri, Dec 8, 2017 at 2:12 PM, Jens Wiklander jens.wiklander@linaro.org wrote:
On Thu, Dec 7, 2017 at 7:25 PM, Volodymyr Babchuk vlad.babchuk@gmail.com wrote:
Hi Jens,
On 7 December 2017 at 09:37, Jens Wiklander jens.wiklander@linaro.org wrote:
> In this email I would like to discuss number of things related to > OP-TEE virtual memory management (and virtualization). > > (Note about terminology: I will use "VM" for "virtual memory" > and "guest" for "virtual machine" in this email) > > I want to begin with motivation. As you know, I'm working on > virtualization support in OP-TEE. My current approach is total > isolation of different guests from each other. To implement this I > want to divide all OP-TEE program state into two big parts: kernel and > TEE. > > Kernel data (or kernel state) is a guest-agnostic data needed for a > core services. Examples: temporary stacks used by entry points (used > before thread_alloc_and_run() invocation), list of known guests, > device drivers state and so on. This kind of data is not > guest-specific, so naturally it should exist in one copy. > > TEE data (or TEE state) is guest-bound information: threads (with > stack and state), opened sessions, loaded TAs, mutexes, pobjs and > such. This kind of data have a meaning only regarding to a certain > guest. > > So, memory layout can look like this: > > +----------+ > | .reset | > +----------+ > | .text | > +----------+ > | .ro | > +----------+ > | .kdata | > +----------+ > | .kbss | > +----------+ > | .kheap | > +----------+ > +----------+ > | .data | > +----------+ > | .bss | > +----------+ > | .heap | > +----------+ > | TA SPACE | > +----------+ > > (This is just an illustration, I aware that actual OP-TEE layout is > more complex). > > Sections starting with "k" belong to kernel data. I also extended > bget. Now it supports multiple pools and I can use kmalloc() to > allocated memory from .kheap and plain malloc() to allocate mempory > from .heap. > > This layout allows us to switch guest context with simple banking: > > +----------+ > | .reset | > +----------+ > | .text | > +----------+ > | .ro | > +----------+ > | .kdata | > +----------+ > | .kbss | > +----------+ > | .kheap | > +----------+ > > ============ ============ ============ > ==Guest 1=== ==Guest 2=== ==Guest 3=== > ============ ============ ============ > | .data | | .data | | .data | > +----------+ +----------+ +----------+ > | .bss | | .bss | | .bss | > +----------+ +----------+ +----------+ > | .heap | | .heap | | .heap | > +----------+ +----------+ +----------+ > | TA SPACE | | TA SPACE | | TA SPACE | > +----------+ +----------+ +----------+ > > > If guest suddenly dies, we can't cleanup resources (consider mutex that > will be never unlocked). Instead we can just drop whole guest context > and forged about it. But we will need special cleanup code for kernel > state, though. This is a reason to keep kernel data footprint as > small as possible. > > I think, it is clear now, why I want to talk about virtual memory > management :) > > Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs > use the same mappings. Also there is a separate address space for a > dynamic shared memory mappings. Pager actually does two things: > it breaks 1:1 mapping and also actually does paging. > > My first intention was to reuse pager and to make it manage mappings > for a different guests. But ever now it have an overhead, because of > page encryption/hashing. Also, for virtualization it is crucial > to have different mappings on different CPUs. Plus, for efficient > guest context switching it is good to use TTBR1. All this means > that pager should be severely rewritten. Also, use of TTBR1 imposes > restrictions on guest context location in virtual address space. > > So, I think it is a good occasion to fully overhaul VM in OP-TEE. > What I propose: > > 1. Let OP-TEE to live in own address space, not bound to platform > configuration (i.e. as linux does). For example most of the OP-TEE > will be mapped at 0x2000000, when TEE contexts will be mapped at > 0x8000000.
You're not the first one proposing this, and I believe this is a direction we'll take sooner or later.
Yes. It is not as I insist on a such rework. But if we'll conclude, that it is good to have for the virtualization, then I can assist there.
> > 2. Add page allocator. Currently tee_mm acts both as a page allocator > and as a address space allocator.
tee_mm is actually more of a resource allocator.
Indeed. I wasn't precisely correct there.
> > 3. Promote pgt_cache to full-scale address space manager. > > 4. Split pager into two parts: paging mechanism and backend. Backend > can encrypt pages, hash them, or just use some platform-specific > mechanism to protect paged-out pages
I don't think the pager need a backend, we're only swapping out to memory so it's still a fast operation (relative to writing to flash or rotating media).
I'm not insist on this. I just thought that it a good idea. For example, there are platforms with cryptoDMA, which can greatly increase paging speed.
I'd leave this alone until there's an actual hardware where you get any gain. This doesn't have anything to do with virtualization either.
> > 5. Rename things. pgt_cache is no more just a cache. tee_mm does > two different things at once. > > Jens, I am sure, you have your own vision how VM management should > like. Would you share your thoughts, please? Maybe there is a way, how > I can implement TEE context banking with less effort. I will be very > happy in this case :-) My biggest concern that if I'll take straight > approach to context banking, this will complicate things a lot. > So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it > will ease up integration of virtualization support.
If I understand it correctly you'd like to put the guest state in a virtual memory space, which is switched for each guest. And the reason for this is to deal with global variables?
One of the reasons, yes. Plus, not only global variables, but also malloc pool, mapped shared buffers and other resources.
Consider entries in tee_mmu or pgt state.
I don't think we have that many global variables. grep 'static struct mutex' **/*.c|wc -l gives 7
Actually, there are lots of things that should go to a guest context (or require own cleanup code): session list, open TAs list, pgt_cache state, mutexes, condvars, pobjs, pTA states, and so on, so on. Just look at tee_ta_manager.c for example:
struct mutex tee_ta_mutex = MUTEX_INITIALIZER; static struct condvar tee_ta_cv = CONDVAR_INITIALIZER; static int tee_ta_single_instance_thread = THREAD_ID_INVALID; static size_t tee_ta_single_instance_count; struct tee_ta_ctx_head tee_ctxes = TAILQ_HEAD_INITIALIZER(tee_ctxes);
All this should be stored in guest context.
Yes and it would probably need to be analyzed a bit too, to see if there's any cleanup to be done.
Almost any file in core/tee has at least couple of globals. They also should be moved to a guest context. This is doable, but this is a lots of changes.
What is I propose is to do one big chance in VM part and leave all other code as is. On other hand, it is probably good thing to have a context for every OP-TEE part. This can make it more modular.
I'm probably missing some place, in LTC for instance.
This is clearly manageable by other means, like having a "struct guest_state" containing this.
Yes. I have such structure (probably, you seen it in [1]). Also I have get_client_context() function that returns pointer to context for a current caller.
I think it's overkill and also a bit complicated to introduce a guest specific vmspace just to contain a few global variables.
Yes. But I tried another approach - to tag all resources with guest_id. That was even more complicated. But your idea looks simpler in a way. Could you please take a look at [1]? I think, this is what you are talking about. I'll try to play with it further, but currently I'm worried about amount of needed changes on the whole OP-TEE.
I took a quick look, and this is the direction I'd prefer.
On other hand, separate vmspace guarantees that in any moment OP-TEE code will work with data belonging to one guest. Also, it is naturally provides quota mechanism for guests.
Yes, you'd get some with a separate vmspace, but I think it will be too complicated. Just think about debugging... But it's not like we can't manage quota without a vmspace. A separate bget-pool for each guest is a good start.
How would what should go into the guest vmspace be identified?
All what is not goes to to kernel vmspace, goes to guest vmspace. For kernel vmspace I created macros like "__kdata" or "__kbss" to put variables into right sections. There will be a lot more guest-tied variables, than kernel ones. So it is preferred to leave guest data in default sections.
So, to sum up:
- My (implicit context) approach with vmspace: Pros:
- Most of OP-TEE code can be leaved intact.
- A bit faster: it is faster to access global variable directly.
I doubt that you'd be able to measure it. In fact it wouldn't surprise me that much if it became slower instead.
- Authors of new features should not worry about virtualization. I hope, that separate vmspaces will look pretty transparent.
Yes, this is an advantage, but there's ways to guard against adding global variables.
Here's another Pro:
- In case OP-TEE isn't configured for virtualization there's almost zero impact.
Basically yes. Memory footprint will get bigger in any case. But besides that, I don't see any additional overhead.
I was actually expecting none worth mentioning, a completely optional feature.
It will require some changes here and there. But yes, I don't expect any major overhead for a case when virtualization is disabled.
Cons:
- Huge changes in VM management.
Indeed
Hmm, come to think of it. It doesn't have to be that bad. There will be some read/write mapped pages that need to change depending on which guest is being served. Making a proof of concept shouldn't be that hard.
Yep. At first I wanted to implement PoC as a completely separate feature, but then I realized that it would be incomplete wirth pager, and pager, on other hand provides some vaguely similar services. Then I realized that I need separate memory mapping for every core, which leaded me to TTBR1. But if I want to use TTBR1, then I can't place vmspace at arbitrary address. And so we are here :) Actually, I can implement PoC in quick and dirty way, just to see how it comes and what minimal set of changes is needed.
We already have separate memory mapping for each core...
Ah yes, sorry. I completely forget about that :-( I think, comments in tee_pager_handle_fault() confused me. Now I remembered that L1 page tables are different for a different cores. So, actually I can try to switch vmspaces by manipulating entries in L1 table.
It would be cool to support pager. But I think, I can begin with a simple implementation, which will be incompatible with pager.
It just appeared to me, that hypervisor can provide backing memory for guests, so number of supported guests will not be limited by amount of secure RAM. But this requires pager, so I'd better postpone this feature.
The selling point for me would be virtually no impact unless configured for virtualization.
I just realized, that with bit of #ifdef magic I can merge bget pools back into one and also, I don't need to put kernel and tee data sections into separate pages. So memory footprint overhead will be even smaller, than I expected.
Thanks, Jens
Due to implicit nature, it can cause problems in corner cases
Your (explicit context) approach with struct guest_state Pros:
- Does not require big changes in VM. Maybe no changes at all.
- Easier to understand.
Easier to understand is quite important.
- Better code shape due to absence of global variables (I don't quite sure there).
Agree, this is a weak argument.
- Changes in subsystems can be introduced gradually. You don't need to switch to completely new scheme at once.
Yes, this is also quite important. Big bang isn't that fun.
Cons:
- A bit slower. Code should access context somehow. Either via get_guest_context() or via function parameters.
Again, I doubt you'd be able to measure it. We can store the pointer to thread_core_local in TPIDRPRW to speed up the access to thread_core_local in the cases where we don't pass some pointer as a parameter.
- A lot of changes in all code across whole OP-TEE.
Yes, there will be changes, but all doesn't have to be done in a single pull request. The different parts can be fixed one, by one.
- Authors of new code should remember about virtualization and don't use global state at all. Or provide reliable mechanisms of recovery.
We can add something that checks that only whitelisted files may add global variables (.data or .bss)
Hell all,
Just a small update and one question.
Currently I have the PoC for proposed approach. Right now it halfly consist of hacks, but anyways, I'm able to run xtest in two domains in parallel. And it even passes :) If someone is really interested, then you can find this PoC at [1]. But it is in very pity state. I'm reworking it shape into something that can be pushed for a review.
And so I have a question. One of my changes introduces new VA space, where I map whole secure RAM. This simplifies guests pagetables management and some other tasks. Something similar is done in the pager code, but pager creates alias mappings in runtime, only for pages that it want to access. But for me it is easier to have a static view of a whole secure RAM, than to remap needed pages in runtime. So, this is my question: is that design decision in pager was intentional? Like, from security standpoint it is better not to have whole secure RAM mapped or something like that...
There are two options before me: I can leave whole secure RAM mapped and make pager to use this mapping, or I can make my code to behave like pager (i.e. map needed pages dynamically).
What do you think?
[1] https://github.com/lorc/optee_os/tree/virt_hard
Hi,
On Wed, Dec 27, 2017 at 1:46 PM, Volodymyr Babchuk vlad.babchuk@gmail.com wrote:
Hell all,
Just a small update and one question.
Currently I have the PoC for proposed approach. Right now it halfly consist of hacks, but anyways, I'm able to run xtest in two domains in parallel. And it even passes :)
Congrats! :-)
If someone is really interested, then you can find this PoC at [1]. But it is in very pity state. I'm reworking it shape into something that can be pushed for a review.
And so I have a question. One of my changes introduces new VA space, where I map whole secure RAM. This simplifies guests pagetables management and some other tasks. Something similar is done in the pager code, but pager creates alias mappings in runtime, only for pages that it want to access. But for me it is easier to have a static view of a whole secure RAM, than to remap needed pages in runtime. So, this is my question: is that design decision in pager was intentional? Like, from security standpoint it is better not to have whole secure RAM mapped or something like that...
The pager does that to avoid leaving aliases of pages read-write enabled while read-only-exec at the place where it's used.
There are two options before me: I can leave whole secure RAM mapped and make pager to use this mapping, or I can make my code to behave like pager (i.e. map needed pages dynamically).
I think you should map needed pages dynamically.
Also I'm not that happy about tagging __kdata and __kbss directly in the code, it's easy to miss applying one. The way we're partitioning the memory for the pager and init code could be applied here too. That isn't 100% robust either, but it usually results in build error instead of some hard to find corruption. If you don't feel to comfortable with implementing that yourself I can do that once everything else is in place.
What do you think?
I sounds like you're progressing well. I'll wait with looking at the code until you think it's ready for review.
[1] https://github.com/lorc/optee_os/tree/virt_hard
-- WBR Volodymyr Babchuk aka lorc [+380976646013] mailto: vlad.babchuk@gmail.com
On 27 December 2017 at 17:24, Jens Wiklander jens.wiklander@linaro.org wrote:
Currently I have the PoC for proposed approach. Right now it halfly consist of hacks, but anyways, I'm able to run xtest in two domains in parallel. And it even passes :)
Congrats! :-)
Thanks!
If someone is really interested, then you can find this PoC at [1]. But it is in very pity state. I'm reworking it shape into something that can be pushed for a review.
And so I have a question. One of my changes introduces new VA space, where I map whole secure RAM. This simplifies guests pagetables management and some other tasks. Something similar is done in the pager code, but pager creates alias mappings in runtime, only for pages that it want to access. But for me it is easier to have a static view of a whole secure RAM, than to remap needed pages in runtime. So, this is my question: is that design decision in pager was intentional? Like, from security standpoint it is better not to have whole secure RAM mapped or something like that...
The pager does that to avoid leaving aliases of pages read-write enabled while read-only-exec at the place where it's used.
Ah, I see.
There are two options before me: I can leave whole secure RAM mapped and make pager to use this mapping, or I can make my code to behave like pager (i.e. map needed pages dynamically).
I think you should map needed pages dynamically.
Yeah, I think that it is more fits into OP-TEE way.
Also I'm not that happy about tagging __kdata and __kbss directly in the code, it's easy to miss applying one. The way we're partitioning the memory for the pager and init code could be applied here too. That isn't 100% robust either, but it usually results in build error instead of some hard to find corruption. If you don't feel to comfortable with implementing that yourself I can do that once everything else is in place.
I have concerns there. In pager case, mistake in partitioning would lead to crash or to bloated pager section. But in case of virtualization this will lead to hard-to-catch problems with isolation.
Imagine that during automatic partitioning guest-related symbol (e.g. sessions list) will go into shared space. This kind of bug will be revealed only if guests will issue simultaneous calls to OP-TEE. I can imagine more subtle cases, where problem will be not so obvious, but can lead to problems with isolation.
Thus, I decided to use explicit tagging. In this case you at least see what are you doing. Also, approach "explicitly tell what belongs to OP-TEE core" will ensure that any new symbols will go into guest sections.