VM management in OP-TEE - Tee-dev

6 Dec 2017


      Hi all,
In this email I would like to discuss number of things related to
OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory"
and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on
virtualization support in OP-TEE. My current approach is total
isolation of different guests from each other. To implement this I
want to divide all OP-TEE program state into two big parts: kernel and
TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a
core services. Examples: temporary stacks used by entry points (used
before thread_alloc_and_run() invocation), list of known guests,
device drivers state and so on. This kind of data is not
guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with
stack and state), opened sessions, loaded TAs, mutexes, pobjs and
such. This kind of data have a meaning only regarding to a certain
guest.
So, memory layout can look like this:
+----------+
         | .reset   |
         +----------+
         | .text    |
         +----------+
         | .ro      |
         +----------+
         | .kdata   |
         +----------+
         | .kbss    |
         +----------+
         | .kheap   |
         +----------+
         +----------+
         | .data    |
         +----------+
         | .bss     |
         +----------+
         | .heap    |
         +----------+
         | TA SPACE |
                 +----------+
(This is just an illustration, I aware that actual OP-TEE layout is
more complex).
Sections starting with "k" belong to kernel data. I also extended
bget. Now it supports multiple pools and I can use kmalloc() to
allocated memory from .kheap and plain malloc() to allocate mempory
from .heap.
This layout allows us to switch guest context with simple banking:
+----------+
                 | .reset   |
                 +----------+
                 | .text    |
                 +----------+
                 | .ro      |
                 +----------+
                 | .kdata   |
                 +----------+
                 | .kbss    |
                 +----------+
                 | .kheap   |
                 +----------+
============  ============  ============
   ==Guest 1===  ==Guest 2===  ==Guest 3===
   ============  ============  ============
   | .data    |  | .data    |  | .data    |
   +----------+  +----------+  +----------+
   | .bss     |  | .bss     |  | .bss     |
   +----------+  +----------+  +----------+
   | .heap    |  | .heap    |  | .heap    |
   +----------+  +----------+  +----------+
   | TA SPACE |  | TA SPACE |  | TA SPACE |
   +----------+  +----------+  +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that
will be never unlocked). Instead we can just drop whole guest context
and forged about it.  But we will need special cleanup code for kernel
state, though.  This is a reason to keep kernel data footprint as
small as possible.
I think, it is clear now, why I want to talk about virtual memory
management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs
use the same mappings. Also there is a separate address space for a
dynamic shared memory mappings. Pager actually does two things:
it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings
for a different guests. But ever now it have an overhead, because of
page encryption/hashing. Also, for virtualization it is crucial
to have different mappings on different CPUs. Plus, for efficient
guest context switching it is good to use TTBR1. All this means
that pager should be severely rewritten. Also, use of TTBR1 imposes
restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE.
What I propose:
1. Let OP-TEE to live in own address space, not bound to platform
   configuration (i.e. as linux does). For example most of the OP-TEE
   will be mapped at 0x2000000, when TEE contexts will be mapped at
   0x8000000.
2. Add page allocator. Currently tee_mm acts both as a page allocator
   and as a address space allocator.
3. Promote pgt_cache to full-scale address space manager.
4. Split pager into two parts: paging mechanism and backend. Backend
   can encrypt pages, hash them, or just use some platform-specific
   mechanism to protect paged-out pages
5. Rename things. pgt_cache is no more just a cache. tee_mm does
   two different things at once.
Jens, I am sure, you have your own vision how VM management should
like. Would you share your thoughts, please? Maybe there is a way, how
I can implement TEE context banking with less effort. I will be very
happy in this case :-) My biggest concern that if I'll take straight
approach to context banking, this will complicate things a lot.
So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it
will ease up integration of virtualization support.
I also CC'ed Julien Grall from ARM (and, now, from Linaro). He is one of
ARM arch maintainers in XEN project. I think, he can share some valuable
ideas about virtualization as a whole.
-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com