Re: BUG selftests/mm]

12 Mar 2024


      On Mon, Mar 11, 2024 at 03:28:28PM -0700, Jiaqi Yan wrote:
...
On Mon, Mar 11, 2024 at 2:27 PM James Houghton jthoughton@google.com wrote:
...
On Mon, Mar 11, 2024 at 12:28 PM Peter Xu peterx@redhat.com wrote:
...
On Mon, Mar 11, 2024 at 11:59:59AM -0700, Axel Rasmussen wrote:
...
I'd prefer not to require root or CAP_SYS_ADMIN or similar for
UFFDIO_POISON, because those control access to lots more things
besides, which we don't necessarily want the process using UFFD to be
able to do. :/
I agree; UFFDIO_POISON should not require CAP_SYS_ADMIN.
+1.
...
...
...
Ratelimiting seems fairly reasonable to me. I do see the concern about
dropping some addresses though.
Do you know how much could an admin rely on such addresses?  How frequent
would MCE generate normally in a sane system?
I'm not sure about how much admins rely on the address themselves. +cc
Jiaqi Yan
I think admins mostly care about MCEs from **real** hardware. For
example they may choose to perform some maintenance if the number of
hardware DIMM errors, keyed by PFN, exceeds some threshold. And I
think mcelog or /sys/devices/system/node/node${X}/memory_failure are
better tools than dmesg. In the case all memory errors are emulated by
hypervisor after a live migration, these dmesgs may confuse admins to
think there is dimm error on host but actually it is not the case. In
this sense, silencing these emulated by UFFDIO_POISON makes sense (if
not too complicated to do).
Now we have three types of such error: (1) PFN poisoned, (2) swapin error,
(3) emulated.  Both 1+2 should deserve a global message dump, while (3)
should be process-internal, and nobody else should need to care except the
process itself (via the signal + meta info).
If we want to differenciate (2) v.s. (3), we may need 1 more pte marker bit
to show whether such poison is "global" or "local" (while as of now 2+3
shares the usage of the same PTE_MARKER_POISONED bit); a swapin error can
still be seen as a "global" error (instead of a mem error, it can be a disk
error, and the err msg still applies to it describing a VA corrupt).
Another VM_FAULT_* flag is also needed to reflect that locality, then
ignore a global broadcast for "local" poison faults.
...
SIGBUS (and logged "MCE: Killing %s:%d due to hardware memory
corruption fault at %lx\n") emit by fault handler due to UFFDIO_POISON
are less useful to admins AFAIK. They are for sure crucial to
userspace / vmm / hypervisor, but the SIGBUS sent already contains the
poisoned address (in si_addr from force_sig_mceerr).
...
It's possible for a sane hypervisor dealing with a buggy guest / guest
userspace to trigger lots of these pr_errs. Consider the case where a
guest userspace uses HugeTLB-1G, finds poison (which HugeTLB used to
ignore), and then ignores SIGBUS. It will keep getting MCEs /
SIGBUSes.
The sane hypervisor will use UFFDIO_POISON to prevent the guest from
re-accessing *real* poison, but we will still get the pr_err, and we
still keep injecting MCEs into the guest. We have observed scenarios
like this before.
...
...
Perhaps we can mitigate that concern by defining our own ratelimit
interval/burst configuration?
Any details?
...
Another idea would be to only ratelimit it if !CONFIG_DEBUG_VM or
similar. Not sure if that's considered valid or not. :)
This, OTOH, sounds like an overkill..
I just checked again on the detail of ratelimit code, where we by default
it has:
#define DEFAULT_RATELIMIT_INTERVAL      (5 * HZ)
#define DEFAULT_RATELIMIT_BURST         10
So it allows a 10 times burst rather than 2.. IIUC it means even if
there're continous 10 MCEs it won't get suppressed, until the 11th came, in
5 seconds interval.  I think it means it's possibly even less of a concern
to directly use pr_err_ratelimited().
I'm okay with any rate limiting everyone agrees on. IMO, silencing
these pr_errs if they came from UFFDIO_POISON (or, perhaps, if they
did not come from real hardware MCE events) sounds like the most
correct thing to do, but I don't mind. Just don't make UFFDIO_POISON
require CAP_SYS_ADMIN. :)
Thanks.
-- 
Peter Xu

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: BUG selftests/mm]