The backing memory was not freed, after loading it back to the SGX reserved memory. This problem was not prevalent with the systems that were available at the time when SGX was first upstreamed, as the swapped memory could grow only up to 128 MB. However, Icelake Server can have gigabytes of SGX memory, and thus this has become a real bottleneck.
Free the swap space for other use by calling shmem_truncate_range(), when a page is faulted back.
Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Signed-off-by: Jarkko Sakkinen jarkko@kernel.org --- arch/x86/kernel/cpu/sgx/encl.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 001808e3901c..f2d3f2e5028f 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -22,6 +22,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, { unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK; struct sgx_encl *encl = encl_page->encl; + struct inode *inode = file_inode(encl->backing); struct sgx_pageinfo pginfo; struct sgx_backing b; pgoff_t page_index; @@ -60,6 +61,9 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
sgx_encl_put_backing(&b, false);
+ /* Free the backing memory. */ + shmem_truncate_range(inode, PFN_PHYS(page_index), PFN_PHYS(page_index) + PAGE_SIZE - 1); + return ret; }
On 11/3/21 4:22 PM, Jarkko Sakkinen wrote:
The backing memory was not freed, after loading it back to the SGX reserved memory. This problem was not prevalent with the systems that were available at the time when SGX was first upstreamed, as the swapped memory could grow only up to 128 MB. However, Icelake Server can have gigabytes of SGX memory, and thus this has become a real bottleneck.
Free the swap space for other use by calling shmem_truncate_range(), when a page is faulted back.
This needs a _bit_ more context. Could you include a few sentences about what backing memory is?
It's also not a big deal but it is nice to include something like: This was found by inspection.
and:
Reported-by: Dave Hansen dave.hansen@linux.intel.com
Also, what is the end-user-visible effect of this bug? What would a user see if they were impacted by this? How did you determine that this fixed the issue? This memory will be recovered when the enclave is torn down, right?
Do we also need to deal with truncating the PCMD? (For those watching along at home, there are two things SGX swaps to RAM: the actual page data and also some metadata that ensures page integrity and helps prevent things like rolling back to old versions of swapped pages)
On Thu, 2021-11-04 at 06:50 -0700, Dave Hansen wrote:
On 11/3/21 4:22 PM, Jarkko Sakkinen wrote:
The backing memory was not freed, after loading it back to the SGX reserved memory. This problem was not prevalent with the systems that were available at the time when SGX was first upstreamed, as the swapped memory could grow only up to 128 MB. However, Icelake Server can have gigabytes of SGX memory, and thus this has become a real bottleneck.
Free the swap space for other use by calling shmem_truncate_range(), when a page is faulted back.
This needs a _bit_ more context. Could you include a few sentences about what backing memory is?
It's also not a big deal but it is nice to include something like: This was found by inspection.
and:
Reported-by: Dave Hansen dave.hansen@linux.intel.com
Oops, I'm sorry, I was planning to add this but forgot to do it.
Also, what is the end-user-visible effect of this bug? What would a user see if they were impacted by this? How did you determine that this fixed the issue? This memory will be recovered when the enclave is torn down, right?
You're absolutely right.
I just realized how to make the whole thing concrete and reproduce OOM with the 128 MB EPC that I have in my i5-9600KF desktop. I'll just reconfigure my test VM to have sufficiently small amount of RAM, and come up with an appropriate workload.
Do we also need to deal with truncating the PCMD? (For those watching along at home, there are two things SGX swaps to RAM: the actual page data and also some metadata that ensures page integrity and helps prevent things like rolling back to old versions of swapped pages)
Yes.
This can be achieved by iterating through all of the enclave pages, which share the same shmem page for storing their PCMD's, as the one being faulted back. If none of those pages is swapped, the PCMD page can safely truncated.
I guess it makes sense to put this into patch of its own.
/Jarkko
On Thu, 2021-11-04 at 17:04 +0200, Jarkko Sakkinen wrote:
This can be achieved by iterating through all of the enclave pages, which share the same shmem page for storing their PCMD's, as the one being faulted back. If none of those pages is swapped, the PCMD page can safely truncated.
We have bookkeeping in place for this: encl->page_array.
/Jarkko
On 11/4/21 8:04 AM, Jarkko Sakkinen wrote:
Do we also need to deal with truncating the PCMD? (For those watching along at home, there are two things SGX swaps to RAM: the actual page data and also some metadata that ensures page integrity and helps prevent things like rolling back to old versions of swapped pages)
Yes.
This can be achieved by iterating through all of the enclave pages, which share the same shmem page for storing their PCMD's, as the one being faulted back. If none of those pages is swapped, the PCMD page can safely truncated.
I was thinking we could just read the page. If it's all 0's, truncate it.
On Thu, 2021-11-04 at 08:13 -0700, Dave Hansen wrote:
On 11/4/21 8:04 AM, Jarkko Sakkinen wrote:
Do we also need to deal with truncating the PCMD? (For those watching along at home, there are two things SGX swaps to RAM: the actual page data and also some metadata that ensures page integrity and helps prevent things like rolling back to old versions of swapped pages)
Yes.
This can be achieved by iterating through all of the enclave pages, which share the same shmem page for storing their PCMD's, as the one being faulted back. If none of those pages is swapped, the PCMD page can safely truncated.
I was thinking we could just read the page. If it's all 0's, truncate it.
Hmm... did ELDU zero PCMD as a side-effect?
It should be fairly effecient just to check the pages by using encl->page_tree.
/Jarkko
On 11/4/21 8:25 AM, Jarkko Sakkinen wrote:
On Thu, 2021-11-04 at 08:13 -0700, Dave Hansen wrote:
On 11/4/21 8:04 AM, Jarkko Sakkinen wrote:
Do we also need to deal with truncating the PCMD? (For those watching along at home, there are two things SGX swaps to RAM: the actual page data and also some metadata that ensures page integrity and helps prevent things like rolling back to old versions of swapped pages)
Yes.
This can be achieved by iterating through all of the enclave pages, which share the same shmem page for storing their PCMD's, as the one being faulted back. If none of those pages is swapped, the PCMD page can safely truncated.
I was thinking we could just read the page. If it's all 0's, truncate it.
Hmm... did ELDU zero PCMD as a side-effect?
I don't think so, but there's nothing stopping us from doing it ourselves.
It should be fairly effecient just to check the pages by using encl->page_tree.
That sounds more complicated and slower than what I suggested. You could even just check the refcount on the page. I _think_ page cache pages have a refcount of 2. So, look for the refcount that means "no more PCMD in this page", and just free it if so.
On Thu, Nov 04, 2021 at 08:29:49AM -0700, Dave Hansen wrote:
On 11/4/21 8:25 AM, Jarkko Sakkinen wrote:
On Thu, 2021-11-04 at 08:13 -0700, Dave Hansen wrote:
On 11/4/21 8:04 AM, Jarkko Sakkinen wrote:
Do we also need to deal with truncating the PCMD? (For those watching along at home, there are two things SGX swaps to RAM: the actual page data and also some metadata that ensures page integrity and helps prevent things like rolling back to old versions of swapped pages)
Yes.
This can be achieved by iterating through all of the enclave pages, which share the same shmem page for storing their PCMD's, as the one being faulted back. If none of those pages is swapped, the PCMD page can safely truncated.
I was thinking we could just read the page. If it's all 0's, truncate it.
Hmm... did ELDU zero PCMD as a side-effect?
I don't think so, but there's nothing stopping us from doing it ourselves.
Ok.
It should be fairly effecient just to check the pages by using encl->page_tree.
That sounds more complicated and slower than what I suggested. You could even just check the refcount on the page. I _think_ page cache pages have a refcount of 2. So, look for the refcount that means "no more PCMD in this page", and just free it if so.
Umh, so... there is total 32 PCMD's per one page.
/Jarkko
On 11/7/21 11:42 AM, Jarkko Sakkinen wrote:
It should be fairly effecient just to check the pages by using encl->page_tree.
That sounds more complicated and slower than what I suggested. You could even just check the refcount on the page. I _think_ page cache pages have a refcount of 2. So, look for the refcount that means "no more PCMD in this page", and just free it if so.
Umh, so... there is total 32 PCMD's per one page.
When you place PCMD in a page, you do a get_page(). The refcount goes up by one. So, a PCMD page with one PCMD will (I think) have a refcount of 3. If you totally fill it up with 31 *more* PCMD entries, it will have a refcount of 34. You do *not* do a put_page() on the PCMD page at the end of the allocation phase.
When the backing storage is freed, you drop the refcount. So, going from 32 PCMD entries to 31 entries in a page, you go from 34->33.
When that refcount drops to 2, you know there is no more data in the page that's useful. At that point you can truncate it out of the backing storage.
There's no reason to scan the page, or a boatload of other metadata. Just keep a refcount. Just use the *existing* 'struct page' refcount.
On Sun, 2021-11-07 at 11:51 -0800, Dave Hansen wrote:
On 11/7/21 11:42 AM, Jarkko Sakkinen wrote:
It should be fairly effecient just to check the pages by using encl->page_tree.
That sounds more complicated and slower than what I suggested. You could even just check the refcount on the page. I _think_ page cache pages have a refcount of 2. So, look for the refcount that means "no more PCMD in this page", and just free it if so.
Umh, so... there is total 32 PCMD's per one page.
When you place PCMD in a page, you do a get_page(). The refcount goes up by one. So, a PCMD page with one PCMD will (I think) have a refcount of 3. If you totally fill it up with 31 *more* PCMD entries, it will have a refcount of 34. You do *not* do a put_page() on the PCMD page at the end of the allocation phase.
When the backing storage is freed, you drop the refcount. So, going from 32 PCMD entries to 31 entries in a page, you go from 34->33.
When that refcount drops to 2, you know there is no more data in the page that's useful. At that point you can truncate it out of the backing storage.
There's no reason to scan the page, or a boatload of other metadata. Just keep a refcount. Just use the *existing* 'struct page' refcount.
Right! Thank you, I'll use this approach, and check that the refcount actually behaves that way you described.
/Jarkko
On 11/7/21 2:28 PM, Jarkko Sakkinen wrote:
When you place PCMD in a page, you do a get_page(). The refcount goes up by one. So, a PCMD page with one PCMD will (I think) have a refcount of 3. If you totally fill it up with 31 *more* PCMD entries, it will have a refcount of 34. You do *not* do a put_page() on the PCMD page at the end of the allocation phase.
When the backing storage is freed, you drop the refcount. So, going from 32 PCMD entries to 31 entries in a page, you go from 34->33.
When that refcount drops to 2, you know there is no more data in the page that's useful. At that point you can truncate it out of the backing storage.
There's no reason to scan the page, or a boatload of other metadata. Just keep a refcount. Just use the *existing* 'struct page' refcount.
Right! Thank you, I'll use this approach, and check that the refcount actually behaves that way you described.
Thinking about this a bit more... We don't want to use the normal get/put_page() refcount for this. If we do, it will pin the page while there is any data in it, preventing it from being reclaimed (swapped).
That isn't to say that we can't keep *a* refcount, just that we can't use the page refcount for this.
I still like the idea of just scanning the whole page for zeros.
On Sun, Nov 07, 2021 at 10:34:02PM -0800, Dave Hansen wrote:
On 11/7/21 2:28 PM, Jarkko Sakkinen wrote:
When you place PCMD in a page, you do a get_page(). The refcount goes up by one. So, a PCMD page with one PCMD will (I think) have a refcount of 3. If you totally fill it up with 31 *more* PCMD entries, it will have a refcount of 34. You do *not* do a put_page() on the PCMD page at the end of the allocation phase.
When the backing storage is freed, you drop the refcount. So, going from 32 PCMD entries to 31 entries in a page, you go from 34->33.
When that refcount drops to 2, you know there is no more data in the page that's useful. At that point you can truncate it out of the backing storage.
There's no reason to scan the page, or a boatload of other metadata. Just keep a refcount. Just use the *existing* 'struct page' refcount.
Right! Thank you, I'll use this approach, and check that the refcount actually behaves that way you described.
Thinking about this a bit more... We don't want to use the normal get/put_page() refcount for this. If we do, it will pin the page while there is any data in it, preventing it from being reclaimed (swapped).
That isn't to say that we can't keep *a* refcount, just that we can't use the page refcount for this.
I still like the idea of just scanning the whole page for zeros.
I can try that route first. I like the property in zeroing that it has predicatable O(1) cost.
/Jarkko
On 11/3/21 4:22 PM, Jarkko Sakkinen wrote:
--- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -22,6 +22,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, { unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK; struct sgx_encl *encl = encl_page->encl;
- struct inode *inode = file_inode(encl->backing); struct sgx_pageinfo pginfo; struct sgx_backing b; pgoff_t page_index;
@@ -60,6 +61,9 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, sgx_encl_put_backing(&b, false);
- /* Free the backing memory. */
- shmem_truncate_range(inode, PFN_PHYS(page_index), PFN_PHYS(page_index) + PAGE_SIZE - 1);
- return ret;
}
This also misses tearing down the backing storage if it is in place at sgx_encl_release().
Does a entry->epc_page==NULL page in there guarantee that it has backing storage?
On Thu, Nov 04, 2021 at 03:38:55PM -0700, Dave Hansen wrote:
On 11/3/21 4:22 PM, Jarkko Sakkinen wrote:
--- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -22,6 +22,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, { unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK; struct sgx_encl *encl = encl_page->encl;
- struct inode *inode = file_inode(encl->backing); struct sgx_pageinfo pginfo; struct sgx_backing b; pgoff_t page_index;
@@ -60,6 +61,9 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, sgx_encl_put_backing(&b, false);
- /* Free the backing memory. */
- shmem_truncate_range(inode, PFN_PHYS(page_index), PFN_PHYS(page_index) + PAGE_SIZE - 1);
- return ret;
}
This also misses tearing down the backing storage if it is in place at sgx_encl_release().
Hmm... sgx_encl_release() does fput(). Isn't that enough to tear it down, or does it require explicit truncate, i.e. something like
shmem_truncate_range(file_inode(encl->backing), encl->base, encl->size - 1);
Does a entry->epc_page==NULL page in there guarantee that it has backing storage?
Yes, it is an invariant. That what I was thinking to use for PCMD: iterate 32 pages and check if they have a faulted page.
/Jarkko
On 11/7/21 10:06 AM, Jarkko Sakkinen wrote:
On Thu, Nov 04, 2021 at 03:38:55PM -0700, Dave Hansen wrote:
On 11/3/21 4:22 PM, Jarkko Sakkinen wrote:
--- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -22,6 +22,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, { unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK; struct sgx_encl *encl = encl_page->encl;
- struct inode *inode = file_inode(encl->backing); struct sgx_pageinfo pginfo; struct sgx_backing b; pgoff_t page_index;
@@ -60,6 +61,9 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, sgx_encl_put_backing(&b, false);
- /* Free the backing memory. */
- shmem_truncate_range(inode, PFN_PHYS(page_index), PFN_PHYS(page_index) + PAGE_SIZE - 1);
- return ret;
}
This also misses tearing down the backing storage if it is in place at sgx_encl_release().
Hmm... sgx_encl_release() does fput(). Isn't that enough to tear it down, or does it require explicit truncate, i.e. something like
shmem_truncate_range(file_inode(encl->backing), encl->base, encl->size - 1);
That's true, the page cache should all be torn down along with the fput(). *But*, it would be a very nice property if the backing storage was empty by this point. It essentially ensures that no enclave-runtime cases missed truncating the backing storage away.
Does a entry->epc_page==NULL page in there guarantee that it has backing storage?
Yes, it is an invariant. That what I was thinking to use for PCMD: iterate 32 pages and check if they have a faulted page.
I think the rule should be that entry->epc_page==NULL enclave pages have backing storage. All entry->epc_page!=NULL do *not* have backing storage.
On Sun, Nov 07, 2021 at 11:06:01AM -0800, Dave Hansen wrote:
On 11/7/21 10:06 AM, Jarkko Sakkinen wrote:
On Thu, Nov 04, 2021 at 03:38:55PM -0700, Dave Hansen wrote:
On 11/3/21 4:22 PM, Jarkko Sakkinen wrote:
--- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -22,6 +22,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, { unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK; struct sgx_encl *encl = encl_page->encl;
- struct inode *inode = file_inode(encl->backing); struct sgx_pageinfo pginfo; struct sgx_backing b; pgoff_t page_index;
@@ -60,6 +61,9 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, sgx_encl_put_backing(&b, false);
- /* Free the backing memory. */
- shmem_truncate_range(inode, PFN_PHYS(page_index), PFN_PHYS(page_index) + PAGE_SIZE - 1);
- return ret;
}
This also misses tearing down the backing storage if it is in place at sgx_encl_release().
Hmm... sgx_encl_release() does fput(). Isn't that enough to tear it down, or does it require explicit truncate, i.e. something like
shmem_truncate_range(file_inode(encl->backing), encl->base, encl->size - 1);
That's true, the page cache should all be torn down along with the fput(). *But*, it would be a very nice property if the backing storage was empty by this point. It essentially ensures that no enclave-runtime cases missed truncating the backing storage away.
What if an enclave is released a point when all of its pages are swapped out? Or even simpler case would an enclave that is larger than all of EPC.
What can be made sure is that for all pages, which are in EPC, the backing page is truncated.
Does a entry->epc_page==NULL page in there guarantee that it has backing storage?
Yes, it is an invariant. That what I was thinking to use for PCMD: iterate 32 pages and check if they have a faulted page.
I think the rule should be that entry->epc_page==NULL enclave pages have backing storage. All entry->epc_page!=NULL do *not* have backing storage.
Yes, that is the goal of this patch.
/Jarkko
On 11/7/21 11:45 AM, Jarkko Sakkinen wrote:
On Sun, Nov 07, 2021 at 11:06:01AM -0800, Dave Hansen wrote:
That's true, the page cache should all be torn down along with the fput(). *But*, it would be a very nice property if the backing storage was empty by this point. It essentially ensures that no enclave-runtime cases missed truncating the backing storage away.
What if an enclave is released a point when all of its pages are swapped out? Or even simpler case would an enclave that is larger than all of EPC.
In this loop:
void sgx_encl_release(struct kref *ref) { ... xa_for_each(&encl->page_array, index, entry) { if (entry->epc_page == NULL) // truncate backing storage
What can be made sure is that for all pages, which are in EPC, the backing page is truncated.
Right, and that should be utterly trivial to do.
linux-stable-mirror@lists.linaro.org