There is a limited amount of SGX memory (EPC) on each system. When that memory is used up, SGX has its own swapping mechanism which is similar in concept but totally separate from the core mm/* code. Instead of swapping to disk, SGX swaps from EPC to normal RAM. That normal RAM comes from a shared memory pseudo-file and can itself be swapped by the core mm code. There is a hierarchy like this:
EPC <-> shmem <-> disk
After data is swapped back in from shmem to EPC, the shmem backing storage needs to be freed. Currently, the backing shmem is not freed. This effectively wastes the shmem while the enclave is running. The memory is recovered when the enclave is destroyed and the backing storage freed.
Sort this out by freeing memory with shmem_truncate_range(), as soon as a page is faulted back to the EPC. In addition, free the memory for PCMD pages as soon as all PCMD's in a page have been marked as unused by zeroing its contents.
Reported-by: Dave Hansen dave.hansen@linux.intel.com Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Signed-off-by: Jarkko Sakkinen jarkko@kernel.org --- v4: * Sanitized the offset calculations. v3: * Resend. v2: * Rewrite commit message as proposed by Dave. * Truncate PCMD pages (Dave). --- arch/x86/kernel/cpu/sgx/encl.c | 44 +++++++++++++++++++++++++++------- 1 file changed, 36 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 8be6f0592bdc..94319190efb1 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -12,6 +12,17 @@ #include "encls.h" #include "sgx.h"
+ +/* + * Free a page from the backing storage in the given page index. + */ +static inline void sgx_encl_truncate_backing_page(struct sgx_encl *encl, pgoff_t index) +{ + struct inode *inode = file_inode(encl->backing); + + shmem_truncate_range(inode, PFN_PHYS(index), PFN_PHYS(index) + PAGE_SIZE - 1); +} + /* * ELDU: Load an EPC page as unblocked. For more info, see "OS Management of EPC * Pages" in the SDM. @@ -24,7 +35,9 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, struct sgx_encl *encl = encl_page->encl; struct sgx_pageinfo pginfo; struct sgx_backing b; + bool pcmd_page_empty; pgoff_t page_index; + u8 *pcmd_page; int ret;
if (secs_page) @@ -38,8 +51,8 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
pginfo.addr = encl_page->desc & PAGE_MASK; pginfo.contents = (unsigned long)kmap_atomic(b.contents); - pginfo.metadata = (unsigned long)kmap_atomic(b.pcmd) + - b.pcmd_offset; + pcmd_page = kmap_atomic(b.pcmd); + pginfo.metadata = (unsigned long)pcmd_page + b.pcmd_offset;
if (secs_page) pginfo.secs = (u64)sgx_get_epc_virt_addr(secs_page); @@ -55,11 +68,28 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, ret = -EFAULT; }
- kunmap_atomic((void *)(unsigned long)(pginfo.metadata - b.pcmd_offset)); + memset(pcmd_page + b.pcmd_offset, 0, sizeof(struct sgx_pcmd)); + + /* + * The area for the PCMD in the page was zeroed above. Check if the + * whole page is now empty meaning that all PCMD's have been zeroed: + */ + pcmd_page_empty = !memchr_inv(pcmd_page, 0, PAGE_SIZE); + + kunmap_atomic(pcmd_page); kunmap_atomic((void *)(unsigned long)pginfo.contents);
sgx_encl_put_backing(&b, false);
+ sgx_encl_truncate_backing_page(encl, page_index); + + if (pcmd_page_empty) { + pgoff_t pcmd_off = encl->size + PAGE_SIZE /* SECS */ + + page_index * sizeof(struct sgx_pcmd); + + sgx_encl_truncate_backing_page(encl, PFN_DOWN(pcmd_off)); + } + return ret; }
@@ -583,7 +613,7 @@ static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl, static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, struct sgx_backing *backing) { - pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5); + pgoff_t pcmd_off = encl->size + PAGE_SIZE /* SECS */ + page_index * sizeof(struct sgx_pcmd); struct page *contents; struct page *pcmd;
@@ -591,7 +621,7 @@ static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, if (IS_ERR(contents)) return PTR_ERR(contents);
- pcmd = sgx_encl_get_backing_page(encl, pcmd_index); + pcmd = sgx_encl_get_backing_page(encl, PFN_DOWN(pcmd_off)); if (IS_ERR(pcmd)) { put_page(contents); return PTR_ERR(pcmd); @@ -600,9 +630,7 @@ static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, backing->page_index = page_index; backing->contents = contents; backing->pcmd = pcmd; - backing->pcmd_offset = - (page_index & (PAGE_SIZE / sizeof(struct sgx_pcmd) - 1)) * - sizeof(struct sgx_pcmd); + backing->pcmd_offset = pcmd_off & (PAGE_SIZE - 1);
return 0; }
On Tue, Feb 22, 2022 at 01:03:41PM +0100, Jarkko Sakkinen wrote:
There is a limited amount of SGX memory (EPC) on each system. When that memory is used up, SGX has its own swapping mechanism which is similar in concept but totally separate from the core mm/* code. Instead of swapping to disk, SGX swaps from EPC to normal RAM. That normal RAM comes from a shared memory pseudo-file and can itself be swapped by the core mm code. There is a hierarchy like this:
EPC <-> shmem <-> disk
After data is swapped back in from shmem to EPC, the shmem backing storage needs to be freed. Currently, the backing shmem is not freed. This effectively wastes the shmem while the enclave is running. The memory is recovered when the enclave is destroyed and the backing storage freed.
Sort this out by freeing memory with shmem_truncate_range(), as soon as a page is faulted back to the EPC. In addition, free the memory for PCMD pages as soon as all PCMD's in a page have been marked as unused by zeroing its contents.
Reported-by: Dave Hansen dave.hansen@linux.intel.com Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Signed-off-by: Jarkko Sakkinen jarkko@kernel.org
v4:
- Sanitized the offset calculations.
v3:
- Resend.
v2:
- Rewrite commit message as proposed by Dave.
- Truncate PCMD pages (Dave).
arch/x86/kernel/cpu/sgx/encl.c | 44 +++++++++++++++++++++++++++------- 1 file changed, 36 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 8be6f0592bdc..94319190efb1 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -12,6 +12,17 @@ #include "encls.h" #include "sgx.h"
+/*
- Free a page from the backing storage in the given page index.
- */
+static inline void sgx_encl_truncate_backing_page(struct sgx_encl *encl, pgoff_t index) +{
- struct inode *inode = file_inode(encl->backing);
- shmem_truncate_range(inode, PFN_PHYS(index), PFN_PHYS(index) + PAGE_SIZE - 1);
+}
/*
- ELDU: Load an EPC page as unblocked. For more info, see "OS Management of EPC
- Pages" in the SDM.
@@ -24,7 +35,9 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, struct sgx_encl *encl = encl_page->encl; struct sgx_pageinfo pginfo; struct sgx_backing b;
- bool pcmd_page_empty; pgoff_t page_index;
- u8 *pcmd_page; int ret;
if (secs_page) @@ -38,8 +51,8 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, pginfo.addr = encl_page->desc & PAGE_MASK; pginfo.contents = (unsigned long)kmap_atomic(b.contents);
- pginfo.metadata = (unsigned long)kmap_atomic(b.pcmd) +
b.pcmd_offset;
- pcmd_page = kmap_atomic(b.pcmd);
- pginfo.metadata = (unsigned long)pcmd_page + b.pcmd_offset;
if (secs_page) pginfo.secs = (u64)sgx_get_epc_virt_addr(secs_page); @@ -55,11 +68,28 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, ret = -EFAULT; }
- kunmap_atomic((void *)(unsigned long)(pginfo.metadata - b.pcmd_offset));
- memset(pcmd_page + b.pcmd_offset, 0, sizeof(struct sgx_pcmd));
- /*
* The area for the PCMD in the page was zeroed above. Check if the
* whole page is now empty meaning that all PCMD's have been zeroed:
*/
- pcmd_page_empty = !memchr_inv(pcmd_page, 0, PAGE_SIZE);
- kunmap_atomic(pcmd_page); kunmap_atomic((void *)(unsigned long)pginfo.contents);
sgx_encl_put_backing(&b, false);
- sgx_encl_truncate_backing_page(encl, page_index);
- if (pcmd_page_empty) {
pgoff_t pcmd_off = encl->size + PAGE_SIZE /* SECS */ +
page_index * sizeof(struct sgx_pcmd);
sgx_encl_truncate_backing_page(encl, PFN_DOWN(pcmd_off));
- }
- return ret;
} @@ -583,7 +613,7 @@ static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl, static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, struct sgx_backing *backing) {
- pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
- pgoff_t pcmd_off = encl->size + PAGE_SIZE /* SECS */ + page_index * sizeof(struct sgx_pcmd); struct page *contents; struct page *pcmd;
@@ -591,7 +621,7 @@ static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, if (IS_ERR(contents)) return PTR_ERR(contents);
- pcmd = sgx_encl_get_backing_page(encl, pcmd_index);
- pcmd = sgx_encl_get_backing_page(encl, PFN_DOWN(pcmd_off)); if (IS_ERR(pcmd)) { put_page(contents); return PTR_ERR(pcmd);
@@ -600,9 +630,7 @@ static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, backing->page_index = page_index; backing->contents = contents; backing->pcmd = pcmd;
- backing->pcmd_offset =
(page_index & (PAGE_SIZE / sizeof(struct sgx_pcmd) - 1)) *
sizeof(struct sgx_pcmd);
- backing->pcmd_offset = pcmd_off & (PAGE_SIZE - 1);
return 0; } -- 2.35.1
Tested this in my XPS13 2020 model: i7-1065G7
I did the testing in bare metal to minimize any possible side-effects.
$ ./test_sgx TAP version 13 1..6 # Starting 6 tests from 2 test cases. # RUN enclave.unclobbered_vdso ... # OK enclave.unclobbered_vdso ok 1 enclave.unclobbered_vdso # RUN enclave.unclobbered_vdso_oversubscribed ... # OK enclave.unclobbered_vdso_oversubscribed ok 2 enclave.unclobbered_vdso_oversubscribed # RUN enclave.clobbered_vdso ... # OK enclave.clobbered_vdso ok 3 enclave.clobbered_vdso # RUN enclave.clobbered_vdso_and_user_function ... # OK enclave.clobbered_vdso_and_user_function ok 4 enclave.clobbered_vdso_and_user_function # RUN enclave.tcs_entry ... # OK enclave.tcs_entry ok 5 enclave.tcs_entry # RUN enclave.pte_permissions ... # OK enclave.pte_permissions ok 6 enclave.pte_permissions # PASSED: 6 / 6 tests passed. # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
BR, Jarkko
On 2/22/22 04:03, Jarkko Sakkinen wrote:
- if (pcmd_page_empty) {
pgoff_t pcmd_off = encl->size + PAGE_SIZE /* SECS */ +
page_index * sizeof(struct sgx_pcmd);
sgx_encl_truncate_backing_page(encl, PFN_DOWN(pcmd_off));
- }
- return ret;
} @@ -583,7 +613,7 @@ static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl, static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, struct sgx_backing *backing) {
- pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
- pgoff_t pcmd_off = encl->size + PAGE_SIZE /* SECS */ + page_index * sizeof(struct sgx_pcmd);
Jarkko, I really don't like how this looks. The '/* SECS */' thing is pretty ugly and the comment in the middle of an arithmetic operation is just really hard to read.
Then, there's the fact that this gem is copied-and-pasted. Oh, and it looks a wee bit over 80 columns.
I went to the trouble of writing a nice, fully-fleshed-out helper function for this with a comment included:
https://lore.kernel.org/all/8afec431-4dfc-d8df-152b-76cca0e17ccb@intel.com/
Was there a problem using that? The change from the last version is:
* Sanitized the offset calculations.
Given that there have been multiple different calculations over the four versions so far, which version was right? v3 or v4?
On Thu, Feb 24, 2022 at 09:14:05AM -0800, Dave Hansen wrote:
On 2/22/22 04:03, Jarkko Sakkinen wrote:
- if (pcmd_page_empty) {
pgoff_t pcmd_off = encl->size + PAGE_SIZE /* SECS */ +
page_index * sizeof(struct sgx_pcmd);
sgx_encl_truncate_backing_page(encl, PFN_DOWN(pcmd_off));
- }
- return ret;
} @@ -583,7 +613,7 @@ static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl, static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, struct sgx_backing *backing) {
- pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
- pgoff_t pcmd_off = encl->size + PAGE_SIZE /* SECS */ + page_index * sizeof(struct sgx_pcmd);
Jarkko, I really don't like how this looks. The '/* SECS */' thing is pretty ugly and the comment in the middle of an arithmetic operation is just really hard to read.
Then, there's the fact that this gem is copied-and-pasted. Oh, and it looks a wee bit over 80 columns.
Today you can have 100.
I went to the trouble of writing a nice, fully-fleshed-out helper function for this with a comment included:
https://lore.kernel.org/all/8afec431-4dfc-d8df-152b-76cca0e17ccb@intel.com/
Keeping full byte offset up until parts of it are required for something makes the formula just a simple equation of additions and multiplications, e.g. nothing like "/ sizeof(struct sgx_pcmd)" is required.
Then you get the PCMD page index will be just PFN_DOWN(pcmd_off) and offset inside that page is pcmd_off & PAGE_MASK. At least fro me this is more intuitive way to do the calculations.
I thought that the formula is so simple that it does not matter if it is just in two sites open coded but I can wrap it too, if required, e.g.
/* * Calculate byte offset of a PCMD struct associated to an enclave page. * PCMD's follow right after the EPC data in the backing storage. In * addition to the visible enclave pages, there's one extra page slot * for SECS, before PCMD data. */ static pgoff_t *sgx_encl_page_index_to_pcmd_offset(struct sgx_encl *encl, unsigned long page_index) { return encl->size + PAGE_SIZE + page_index * sizeof(struct sgx_pcmd); }
Was there a problem using that? The change from the last version is:
- Sanitized the offset calculations.
Given that there have been multiple different calculations over the four versions so far, which version was right? v3 or v4?
This one has correct and tested calculations but for peer test probably Reinette should verify that. I tested this with my laptop in bare metal.
BR, Jarkko
On 2/28/22 04:55, Jarkko Sakkinen wrote:
I thought that the formula is so simple that it does not matter if it is just in two sites open coded but I can wrap it too, if required, e.g.
/*
- Calculate byte offset of a PCMD struct associated to an enclave page.
- PCMD's follow right after the EPC data in the backing storage. In
- addition to the visible enclave pages, there's one extra page slot
- for SECS, before PCMD data.
*/ static pgoff_t *sgx_encl_page_index_to_pcmd_offset(struct sgx_encl *encl, unsigned long page_index) { return encl->size + PAGE_SIZE + page_index * sizeof(struct sgx_pcmd); }
Yes, it's required. Please wrap it.
There's also nothing wrong with spreading that calculation across several lines. It may be arithmetically simple, but it's combining three or four logical steps. There's no shame in separating and commenting some of those separately.
On Mon, Feb 28, 2022 at 07:32:02AM -0800, Dave Hansen wrote:
On 2/28/22 04:55, Jarkko Sakkinen wrote:
I thought that the formula is so simple that it does not matter if it is just in two sites open coded but I can wrap it too, if required, e.g.
/*
- Calculate byte offset of a PCMD struct associated to an enclave page.
- PCMD's follow right after the EPC data in the backing storage. In
- addition to the visible enclave pages, there's one extra page slot
- for SECS, before PCMD data.
*/ static pgoff_t *sgx_encl_page_index_to_pcmd_offset(struct sgx_encl *encl, unsigned long page_index) { return encl->size + PAGE_SIZE + page_index * sizeof(struct sgx_pcmd); }
Yes, it's required. Please wrap it.
There's also nothing wrong with spreading that calculation across several lines. It may be arithmetically simple, but it's combining three or four logical steps. There's no shame in separating and commenting some of those separately.
I can do that but one thing that would make this more documentative would be to describe the formula as "encl->size + sizeof(struct sgx_secs) + sizeof(struct sgx_pcmd)", i.e. PAGE_SIZE is a magic number so the end result would be:
/* * Calculate byte offset of a PCMD struct associated to an enclave page. PCMD's * follow right after the EPC data in the backing storage. In addition to the * visible enclave pages, there's one extra page slot for SECS, before PCMD * structs. */ static inline pgoff_t sgx_encl_page_index_to_pcmd_offset(struct sgx_encl *encl, unsigned long page_index) { pgoff_t epc_end_off = encl->size + sizeof(struct sgx_secs);
return epc_end_off + page_index * sizeof(struct sgx_pcmd); }
Now it is hard to get this wrong and also the file offset has the nice quality of packing the page index for PCMD page, and offset within that page. IMHO, this will nice and clean long-term way to sort this out.
/Jarkko
linux-stable-mirror@lists.linaro.org