Hi all,
In this email I would like to discuss number of things related to OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory" and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on virtualization support in OP-TEE. My current approach is total isolation of different guests from each other. To implement this I want to divide all OP-TEE program state into two big parts: kernel and TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a core services. Examples: temporary stacks used by entry points (used before thread_alloc_and_run() invocation), list of known guests, device drivers state and so on. This kind of data is not guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with stack and state), opened sessions, loaded TAs, mutexes, pobjs and such. This kind of data have a meaning only regarding to a certain guest.
So, memory layout can look like this:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+ +----------+ | .data | +----------+ | .bss | +----------+ | .heap | +----------+ | TA SPACE | +----------+
(This is just an illustration, I aware that actual OP-TEE layout is more complex).
Sections starting with "k" belong to kernel data. I also extended bget. Now it supports multiple pools and I can use kmalloc() to allocated memory from .kheap and plain malloc() to allocate mempory from .heap.
This layout allows us to switch guest context with simple banking:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+
============ ============ ============ ==Guest 1=== ==Guest 2=== ==Guest 3=== ============ ============ ============ | .data | | .data | | .data | +----------+ +----------+ +----------+ | .bss | | .bss | | .bss | +----------+ +----------+ +----------+ | .heap | | .heap | | .heap | +----------+ +----------+ +----------+ | TA SPACE | | TA SPACE | | TA SPACE | +----------+ +----------+ +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I think, it is clear now, why I want to talk about virtual memory management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs use the same mappings. Also there is a separate address space for a dynamic shared memory mappings. Pager actually does two things: it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings for a different guests. But ever now it have an overhead, because of page encryption/hashing. Also, for virtualization it is crucial to have different mappings on different CPUs. Plus, for efficient guest context switching it is good to use TTBR1. All this means that pager should be severely rewritten. Also, use of TTBR1 imposes restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE. What I propose:
1. Let OP-TEE to live in own address space, not bound to platform configuration (i.e. as linux does). For example most of the OP-TEE will be mapped at 0x2000000, when TEE contexts will be mapped at 0x8000000.
2. Add page allocator. Currently tee_mm acts both as a page allocator and as a address space allocator.
3. Promote pgt_cache to full-scale address space manager.
4. Split pager into two parts: paging mechanism and backend. Backend can encrypt pages, hash them, or just use some platform-specific mechanism to protect paged-out pages
5. Rename things. pgt_cache is no more just a cache. tee_mm does two different things at once.
Jens, I am sure, you have your own vision how VM management should like. Would you share your thoughts, please? Maybe there is a way, how I can implement TEE context banking with less effort. I will be very happy in this case :-) My biggest concern that if I'll take straight approach to context banking, this will complicate things a lot. So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it will ease up integration of virtualization support.
I also CC'ed Julien Grall from ARM (and, now, from Linaro). He is one of ARM arch maintainers in XEN project. I think, he can share some valuable ideas about virtualization as a whole.