Re: [PATCH 1/3] userfaultfd/selftests: fix feature support detection

24 Sep 2021


      On Wed, Sep 22, 2021 at 10:43:40PM -0700, Jue Wang wrote:
[...]
...
...
...
Could I know what's the workaround?  Normally if the workaround works solidly,
then there's less need to introduce a kernel interface for that.  Otherwise I'm
glad to look into such a formal proposal.
The workaround is, for the region that you want to zap, run through
this sequence of syscalls: mumap, mmap, and re-register with
userfaultfd if it was registered before. If we're using tmpfs, we can
use madvise(DONTNEED) instead, but this is kind of an abuse of the
API. I don't think there's a guarantee that the PTEs will get zapped,
but currently they will always get zapped if we're using tmpfs. I
really like the idea of adding a new madvise() mode that is guaranteed
to zap the PTEs.
I see.
...
...
...
...
It's also useful for memory poisoning, I think, if the host
decides some page(s) are "bad" and wants to intercept any future guest
accesses to those page(s).
Curious: isn't hwpoison information come from MCEs; or say, host kernel side?
Then I thought the host kernel will have full control of it already.
Or there's other way that the host can try to detect some pages are going to be
rotten?  So the userspace can do something before the kernel handles those
exceptions?
Here's a general idea of how we would like to use userfaultfd to support MPR:
If a guest accesses a poisoned page for the first time, we will get an
MCE through the host kernel and send an MCE to the guest. The guest
will now no longer be able to access this page, and we have to enforce
this. After a live migration, the pages that were poisoned before
probably won't still be poisoned (from the host's perspective), so we
can't rely on the host kernel's MCE handling path. This is where
userfaultfd and this new madvise mode come in: we can just
madvise(MADV_ZAP) the poisoned page(s) on the target during a
migration. Now all accesses will be routed to the VMM and we can
inject an MCE. We don't *need* the new madvise mode, as we can also
use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it
would be more convenient if we didn't have to use fallocate.
Jue Wang can provide more context here, so I've cc'd him. There may be
some things I'm wrong about, so Jue feel free to correct me.
James is right.
The page is marked PG_HWPoison in the source VM host's kernel. The need
of intercepting guest accesses to it exist on the target VM host, where
the same physical page is no longer poisoned.
On the target host, the hypervisor needs to intercept all guest accesses
to pages poisoned from the source VM host.
Thanks for these information, James, Jue, Axel.  I'm not familiar with memory
failures yet, so please bare with me with a few naive questions.
So now I can undertand that hw-poisonsed pages on src host do not mean these
pages will be hw-poisoned on dest host too, but I may have missed the reason on
why dest host needs to trap it with pgtable removed.
AFAIU after pages got hw-poisoned on src, and after vmm injects MCEs into the
guest, the guest shouldn't be accessing these pages any more, am I right?  Then
after migration completes, IIUC the guest shouldn't be accessing these pages
too.  My current understanding is, instead of trapping these pages on dest, we
should just (somehow, which I have no real idea...) un-hw-poison these pages
after migration because these pages are very possibly normal pages there.  When
there's real hw-poisoned pages reported on dst host, we should re-inject MCE
errors to guest with another set of pages.
Could you tell me where did I miss?
Thanks,
-- 
Peter Xu

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 1/3] userfaultfd/selftests: fix feature support detection