On Fri, Dec 8, 2017 at 2:12 PM, Jens Wiklander jens.wiklander@linaro.org wrote:
On Thu, Dec 7, 2017 at 7:25 PM, Volodymyr Babchuk vlad.babchuk@gmail.com wrote:
Hi Jens,
On 7 December 2017 at 09:37, Jens Wiklander jens.wiklander@linaro.org wrote:
In this email I would like to discuss number of things related to OP-TEE virtual memory management (and virtualization).
(Note about terminology: I will use "VM" for "virtual memory" and "guest" for "virtual machine" in this email)
I want to begin with motivation. As you know, I'm working on virtualization support in OP-TEE. My current approach is total isolation of different guests from each other. To implement this I want to divide all OP-TEE program state into two big parts: kernel and TEE.
Kernel data (or kernel state) is a guest-agnostic data needed for a core services. Examples: temporary stacks used by entry points (used before thread_alloc_and_run() invocation), list of known guests, device drivers state and so on. This kind of data is not guest-specific, so naturally it should exist in one copy.
TEE data (or TEE state) is guest-bound information: threads (with stack and state), opened sessions, loaded TAs, mutexes, pobjs and such. This kind of data have a meaning only regarding to a certain guest.
So, memory layout can look like this:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+ +----------+ | .data | +----------+ | .bss | +----------+ | .heap | +----------+ | TA SPACE | +----------+
(This is just an illustration, I aware that actual OP-TEE layout is more complex).
Sections starting with "k" belong to kernel data. I also extended bget. Now it supports multiple pools and I can use kmalloc() to allocated memory from .kheap and plain malloc() to allocate mempory from .heap.
This layout allows us to switch guest context with simple banking:
+----------+ | .reset | +----------+ | .text | +----------+ | .ro | +----------+ | .kdata | +----------+ | .kbss | +----------+ | .kheap | +----------+
============ ============ ============ ==Guest 1=== ==Guest 2=== ==Guest 3=== ============ ============ ============ | .data | | .data | | .data | +----------+ +----------+ +----------+ | .bss | | .bss | | .bss | +----------+ +----------+ +----------+ | .heap | | .heap | | .heap | +----------+ +----------+ +----------+ | TA SPACE | | TA SPACE | | TA SPACE | +----------+ +----------+ +----------+
If guest suddenly dies, we can't cleanup resources (consider mutex that will be never unlocked). Instead we can just drop whole guest context and forged about it. But we will need special cleanup code for kernel state, though. This is a reason to keep kernel data footprint as small as possible.
I think, it is clear now, why I want to talk about virtual memory management :)
Right now (in absence of pager) OP-TEE is mostly mapped 1:1, all CPUs use the same mappings. Also there is a separate address space for a dynamic shared memory mappings. Pager actually does two things: it breaks 1:1 mapping and also actually does paging.
My first intention was to reuse pager and to make it manage mappings for a different guests. But ever now it have an overhead, because of page encryption/hashing. Also, for virtualization it is crucial to have different mappings on different CPUs. Plus, for efficient guest context switching it is good to use TTBR1. All this means that pager should be severely rewritten. Also, use of TTBR1 imposes restrictions on guest context location in virtual address space.
So, I think it is a good occasion to fully overhaul VM in OP-TEE. What I propose:
- Let OP-TEE to live in own address space, not bound to platform configuration (i.e. as linux does). For example most of the OP-TEE will be mapped at 0x2000000, when TEE contexts will be mapped at 0x8000000.
You're not the first one proposing this, and I believe this is a direction we'll take sooner or later.
Yes. It is not as I insist on a such rework. But if we'll conclude, that it is good to have for the virtualization, then I can assist there.
- Add page allocator. Currently tee_mm acts both as a page allocator and as a address space allocator.
tee_mm is actually more of a resource allocator.
Indeed. I wasn't precisely correct there.
Promote pgt_cache to full-scale address space manager.
Split pager into two parts: paging mechanism and backend. Backend can encrypt pages, hash them, or just use some platform-specific mechanism to protect paged-out pages
I don't think the pager need a backend, we're only swapping out to memory so it's still a fast operation (relative to writing to flash or rotating media).
I'm not insist on this. I just thought that it a good idea. For example, there are platforms with cryptoDMA, which can greatly increase paging speed.
I'd leave this alone until there's an actual hardware where you get any gain. This doesn't have anything to do with virtualization either.
- Rename things. pgt_cache is no more just a cache. tee_mm does two different things at once.
Jens, I am sure, you have your own vision how VM management should like. Would you share your thoughts, please? Maybe there is a way, how I can implement TEE context banking with less effort. I will be very happy in this case :-) My biggest concern that if I'll take straight approach to context banking, this will complicate things a lot. So, I'm ready to improve VM part of OP-TEE for everyone's benefit, if it will ease up integration of virtualization support.
If I understand it correctly you'd like to put the guest state in a virtual memory space, which is switched for each guest. And the reason for this is to deal with global variables?
One of the reasons, yes. Plus, not only global variables, but also malloc pool, mapped shared buffers and other resources.
Consider entries in tee_mmu or pgt state.
I don't think we have that many global variables. grep 'static struct mutex' **/*.c|wc -l gives 7
Actually, there are lots of things that should go to a guest context (or require own cleanup code): session list, open TAs list, pgt_cache state, mutexes, condvars, pobjs, pTA states, and so on, so on. Just look at tee_ta_manager.c for example:
struct mutex tee_ta_mutex = MUTEX_INITIALIZER; static struct condvar tee_ta_cv = CONDVAR_INITIALIZER; static int tee_ta_single_instance_thread = THREAD_ID_INVALID; static size_t tee_ta_single_instance_count; struct tee_ta_ctx_head tee_ctxes = TAILQ_HEAD_INITIALIZER(tee_ctxes);
All this should be stored in guest context.
Yes and it would probably need to be analyzed a bit too, to see if there's any cleanup to be done.
Almost any file in core/tee has at least couple of globals. They also should be moved to a guest context. This is doable, but this is a lots of changes.
What is I propose is to do one big chance in VM part and leave all other code as is. On other hand, it is probably good thing to have a context for every OP-TEE part. This can make it more modular.
I'm probably missing some place, in LTC for instance.
This is clearly manageable by other means, like having a "struct guest_state" containing this.
Yes. I have such structure (probably, you seen it in [1]). Also I have get_client_context() function that returns pointer to context for a current caller.
I think it's overkill and also a bit complicated to introduce a guest specific vmspace just to contain a few global variables.
Yes. But I tried another approach - to tag all resources with guest_id. That was even more complicated. But your idea looks simpler in a way. Could you please take a look at [1]? I think, this is what you are talking about. I'll try to play with it further, but currently I'm worried about amount of needed changes on the whole OP-TEE.
I took a quick look, and this is the direction I'd prefer.
On other hand, separate vmspace guarantees that in any moment OP-TEE code will work with data belonging to one guest. Also, it is naturally provides quota mechanism for guests.
Yes, you'd get some with a separate vmspace, but I think it will be too complicated. Just think about debugging... But it's not like we can't manage quota without a vmspace. A separate bget-pool for each guest is a good start.
How would what should go into the guest vmspace be identified?
All what is not goes to to kernel vmspace, goes to guest vmspace. For kernel vmspace I created macros like "__kdata" or "__kbss" to put variables into right sections. There will be a lot more guest-tied variables, than kernel ones. So it is preferred to leave guest data in default sections.
So, to sum up:
- My (implicit context) approach with vmspace: Pros:
- Most of OP-TEE code can be leaved intact.
- A bit faster: it is faster to access global variable directly.
I doubt that you'd be able to measure it. In fact it wouldn't surprise me that much if it became slower instead.
- Authors of new features should not worry about virtualization. I hope, that separate vmspaces will look pretty transparent.
Yes, this is an advantage, but there's ways to guard against adding global variables.
Here's another Pro: * In case OP-TEE isn't configured for virtualization there's almost zero impact.
Cons:
- Huge changes in VM management.
Indeed
Hmm, come to think of it. It doesn't have to be that bad. There will be some read/write mapped pages that need to change depending on which guest is being served. Making a proof of concept shouldn't be that hard.
The selling point for me would be virtually no impact unless configured for virtualization.
Thanks, Jens
Due to implicit nature, it can cause problems in corner cases
Your (explicit context) approach with struct guest_state Pros:
- Does not require big changes in VM. Maybe no changes at all.
- Easier to understand.
Easier to understand is quite important.
- Better code shape due to absence of global variables (I don't quite sure there).
Agree, this is a weak argument.
- Changes in subsystems can be introduced gradually. You don't need to switch to completely new scheme at once.
Yes, this is also quite important. Big bang isn't that fun.
Cons:
- A bit slower. Code should access context somehow. Either via get_guest_context() or via function parameters.
Again, I doubt you'd be able to measure it. We can store the pointer to thread_core_local in TPIDRPRW to speed up the access to thread_core_local in the cases where we don't pass some pointer as a parameter.
- A lot of changes in all code across whole OP-TEE.
Yes, there will be changes, but all doesn't have to be done in a single pull request. The different parts can be fixed one, by one.
- Authors of new code should remember about virtualization and don't use global state at all. Or provide reliable mechanisms of recovery.
We can add something that checks that only whitelisted files may add global variables (.data or .bss)
Thanks, Jens