Re: [Tee-dev] VM management in OP-TEE

8 Dec 2017

      On Fri, Dec 8, 2017 at 2:12 PM, Jens Wiklander
jens.wiklander@linaro.org wrote:
...
On Thu, Dec 7, 2017 at 7:25 PM, Volodymyr Babchuk
vlad.babchuk@gmail.com wrote:
...
Hi Jens,
On 7 December 2017 at 09:37, Jens Wiklander jens.wiklander@linaro.org wrote:
...
...
In this email I would like to discuss number of things related to
OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory"
and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on
virtualization support in OP-TEE. My current approach is total
isolation of different guests from each other. To implement this I
want to divide all OP-TEE program state into two big parts: kernel and
TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a
core services. Examples: temporary stacks used by entry points (used
before thread_alloc_and_run() invocation), list of known guests,
device drivers state and so on. This kind of data is not
guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with
stack and state), opened sessions, loaded TAs, mutexes, pobjs and
such. This kind of data have a meaning only regarding to a certain
guest.
So, memory layout can look like this:
     +----------+
     | .reset   |
     +----------+
     | .text    |
     +----------+
     | .ro      |
     +----------+
     | .kdata   |
     +----------+
     | .kbss    |
     +----------+
     | .kheap   |
     +----------+
     +----------+
     | .data    |
     +----------+
     | .bss     |
     +----------+
     | .heap    |
     +----------+
     | TA SPACE |
             +----------+

(This is just an illustration, I aware that actual OP-TEE layout is
more complex).
Sections starting with "k" belong to kernel data. I also extended
bget. Now it supports multiple pools and I can use kmalloc() to
allocated memory from .kheap and plain malloc() to allocate mempory
from .heap.
This layout allows us to switch guest context with simple banking:
             +----------+
             | .reset   |
             +----------+
             | .text    |
             +----------+
             | .ro      |
             +----------+
             | .kdata   |
             +----------+
             | .kbss    |
             +----------+
             | .kheap   |
             +----------+

============  ============  ============
   ==Guest 1===  ==Guest 2===  ==Guest 3===
   ============  ============  ============
   | .data    |  | .data    |  | .data    |
   +----------+  +----------+  +----------+
   | .bss     |  | .bss     |  | .bss     |
   +----------+  +----------+  +----------+
   | .heap    |  | .heap    |  | .heap    |
   +----------+  +----------+  +----------+
   | TA SPACE |  | TA SPACE |  | TA SPACE |
   +----------+  +----------+  +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that
will be never unlocked). Instead we can just drop whole guest context
and forged about it.  But we will need special cleanup code for kernel
state, though.  This is a reason to keep kernel data footprint as
small as possible.
I think, it is clear now, why I want to talk about virtual memory
management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs
use the same mappings. Also there is a separate address space for a
dynamic shared memory mappings. Pager actually does two things:
it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings
for a different guests. But ever now it have an overhead, because of
page encryption/hashing. Also, for virtualization it is crucial
to have different mappings on different CPUs. Plus, for efficient
guest context switching it is good to use TTBR1. All this means
that pager should be severely rewritten. Also, use of TTBR1 imposes
restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE.
What I propose:

Let OP-TEE to live in own address space, not bound to platform
configuration (i.e. as linux does). For example most of the OP-TEE
will be mapped at 0x2000000, when TEE contexts will be mapped at
0x8000000.

You're not the first one proposing this, and I believe this is a
direction we'll take sooner or later.
Yes. It is not as I insist on a such rework. But if we'll conclude,
that it is good to have for the virtualization, then I can assist
there.
...
...

Add page allocator. Currently tee_mm acts both as a page allocator
and as a address space allocator.

tee_mm is actually more of a resource allocator.
Indeed. I wasn't precisely correct there.
...
...

Promote pgt_cache to full-scale address space manager.

Split pager into two parts: paging mechanism and backend. Backend
can encrypt pages, hash them, or just use some platform-specific
mechanism to protect paged-out pages

I don't think the pager need a backend, we're only swapping out to
memory so it's still a fast operation (relative to writing to flash or
rotating media).
I'm not insist on this. I just thought that it a good idea. For
example, there are platforms with cryptoDMA, which can greatly
increase paging speed.
I'd leave this alone until there's an actual hardware where you get
any gain. This doesn't have anything to do with virtualization either.
...
...
...

Rename things. pgt_cache is no more just a cache. tee_mm does
two different things at once.

Jens, I am sure, you have your own vision how VM management should
like. Would you share your thoughts, please? Maybe there is a way, how
I can implement TEE context banking with less effort. I will be very
happy in this case :-) My biggest concern that if I'll take straight
approach to context banking, this will complicate things a lot.
So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it
will ease up integration of virtualization support.
If I understand it correctly you'd like to put the guest state in a
virtual memory space, which is switched for each guest. And the reason
for this is to deal with global variables?
One of the reasons, yes. Plus, not only global variables, but also malloc pool,
mapped shared buffers and other resources.
Consider entries in tee_mmu or pgt state.
...
I don't think we have that many global variables.
grep  'static struct mutex' **/*.c|wc -l
gives 7
Actually, there are lots of things that should go to a guest context
(or require own cleanup code): session list, open TAs list, pgt_cache
state, mutexes, condvars, pobjs, pTA states, and so on, so on. Just
look at tee_ta_manager.c for example:
struct mutex tee_ta_mutex = MUTEX_INITIALIZER;
static struct condvar tee_ta_cv = CONDVAR_INITIALIZER;
static int tee_ta_single_instance_thread = THREAD_ID_INVALID;
static size_t tee_ta_single_instance_count;
struct tee_ta_ctx_head tee_ctxes = TAILQ_HEAD_INITIALIZER(tee_ctxes);
All this should be stored in guest context.
Yes and it would probably need to be analyzed a bit too, to see if
there's any cleanup to be done.
...
Almost any file in core/tee has at least couple of globals. They also
should be moved to a guest context.
This is doable, but this is a lots of changes.
What is I propose is to do one big chance in VM part and leave all
other code as is. On other hand, it is probably good thing to have a
context for every OP-TEE part. This can make it more modular.
...
I'm probably missing some place, in LTC for instance.
This is clearly manageable by other means, like having a "struct
guest_state" containing this.
Yes. I have such structure (probably, you seen it in [1]). Also I have
get_client_context() function that returns pointer to context for a
current caller.
...
I think it's overkill and also a bit complicated to introduce a guest
specific vmspace just to contain a few global variables.
Yes. But I tried another approach - to tag all resources with
guest_id. That was even more complicated. But your idea looks simpler
in a way.
Could you please take a look at [1]? I think, this is what you are
talking about. I'll try to play with it further, but  currently I'm
worried about amount of needed changes on the whole OP-TEE.
I took a quick look, and this is the direction I'd prefer.
...
On other hand, separate vmspace guarantees that in any moment OP-TEE
code will work with data belonging to one guest. Also, it is naturally
provides quota mechanism for guests.
Yes, you'd get some with a separate vmspace, but I think it will be
too complicated. Just think about debugging...
But it's not like we can't manage quota without a vmspace. A separate
bget-pool for each guest is a good start.
...
...
How would
what should go into the guest vmspace be identified?
All what is not goes to to kernel vmspace, goes to guest vmspace. For
kernel vmspace I created macros like "__kdata" or "__kbss" to put
variables into right sections. There will be a lot more guest-tied
variables, than kernel ones. So it is preferred to leave guest data in
default sections.
So, to sum up:

My (implicit context) approach with vmspace:
Pros:
Most of OP-TEE code can be leaved intact.
A bit faster: it is faster to access global variable directly.

I doubt that you'd be able to measure it. In fact it wouldn't surprise
me that much if it became slower instead.
...

Authors of new features should not worry about virtualization. I
hope, that separate vmspaces will look pretty transparent.

Yes, this is an advantage, but there's ways to guard against adding
global variables.
Here's another Pro:
* In case OP-TEE isn't configured for virtualization there's almost zero impact.
...
...
Cons:

Huge changes in VM management.

Indeed
Hmm, come to think of it. It doesn't have to be that bad. There will
be some read/write mapped pages that need to change depending on which
guest is being served. Making a proof of concept shouldn't be that
hard.
The selling point for me would be virtually no impact unless
configured for virtualization.
Thanks,
Jens
...
...

Due to implicit nature, it can cause problems in corner cases

Your (explicit context) approach with struct guest_state
Pros:

Does not require big changes in VM. Maybe no changes at all.
Easier to understand.

Easier to understand is quite important.
...

Better code shape due to absence of global variables (I don't
quite sure there).

Agree, this is a weak argument.
...

Changes in subsystems can be introduced gradually. You don't
need to switch to completely new scheme at once.

Yes, this is also quite important. Big bang isn't that fun.
...
Cons:

A bit slower. Code should access context somehow. Either via
get_guest_context() or via function parameters.

Again, I doubt you'd be able to measure it. We can store the pointer
to thread_core_local in TPIDRPRW to speed up the access to
thread_core_local in the cases where we don't pass some pointer as a
parameter.
...

A lot of changes in all code across whole OP-TEE.

Yes, there will be changes, but all doesn't have to be done in a
single pull request. The different parts can be fixed one, by one.
...

Authors of new code should remember about virtualization and don't
use global state at all. Or provide reliable mechanisms of
recovery.

We can add something that checks that only whitelisted files may add
global variables (.data or .bss)
Thanks,
Jens

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Tee-dev] VM management in OP-TEE