On Thu, Oct 06, 2022 at 11:29:29AM +0200, David Hildenbrand wrote:
On 05.10.22 22:35, Peter Xu wrote:
On Fri, Sep 30, 2022 at 04:19:28PM +0200, David Hildenbrand wrote:
Let's stop breaking COW via a fake write fault and let's use FAULT_FLAG_UNSHARE instead. This avoids any wrong side effects of the fake write fault, such as mapping the PTE writable and marking the pte dirty/softdirty.
Also, this fixes KSM interaction with userfaultfd-wp: when we have a KSM page that's write-protected by userfaultfd, break_ksm()->handle_mm_fault() will fail with VM_FAULT_SIGBUS and will simpy return in break_ksm() with 0. The warning in dmesg indicates this wrong handling:
[ 230.096368] FAULT_FLAG_ALLOW_RETRY missing 881 [ 230.100822] CPU: 1 PID: 1643 Comm: ksm-uffd-wp [...] [ 230.110124] Hardware name: [...] [ 230.117775] Call Trace: [ 230.120227] <TASK> [ 230.122334] dump_stack_lvl+0x44/0x5c [ 230.126010] handle_userfault.cold+0x14/0x19 [ 230.130281] ? tlb_finish_mmu+0x65/0x170 [ 230.134207] ? uffd_wp_range+0x65/0xa0 [ 230.137959] ? _raw_spin_unlock+0x15/0x30 [ 230.141972] ? do_wp_page+0x50/0x590 [ 230.145551] __handle_mm_fault+0x9f5/0xf50 [ 230.149652] ? mmput+0x1f/0x40 [ 230.152712] handle_mm_fault+0xb9/0x2a0 [ 230.156550] break_ksm+0x141/0x180 [ 230.159964] unmerge_ksm_pages+0x60/0x90 [ 230.163890] ksm_madvise+0x3c/0xb0 [ 230.167295] do_madvise.part.0+0x10c/0xeb0 [ 230.171396] ? do_syscall_64+0x67/0x80 [ 230.175157] __x64_sys_madvise+0x5a/0x70 [ 230.179082] do_syscall_64+0x58/0x80 [ 230.182661] ? do_syscall_64+0x67/0x80 [ 230.186413] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Since it's already there, worth adding the test into ksm_test.c?
Yes, I can give it a try. What I dislike about ksm_test is that it's a mixture of benchmarks and test cases that have to explicitly triggered by parameters. It's not a simple "run all available test cases" tests as we know it. So maybe something separate (or having it as part of the uffd tests) makes more sense.
We can add an entry into run_vmtests.sh. That's also what current ksm_test does.
Yes adding into uffd test would work too, but I do have a plan that we should move functional tests out of userfaultfd.c, leaving that with the stress test only. Not really a big deal, though.
Consequently, we will no longer trigger a fake write fault and break COW without any such side-effects.
This is primarily a fix for KSM+userfaultfd-wp, however, the fake write fault was always questionable. As this fix is not easy to backport and it's not very critical, let's not cc stable.
A patch to cc most of the stable would probably need to still go with the old write approach but attaching ALLOW_RETRY. But I agree maybe that may not need to bother, or a report should have arrived earlier.. The unshare approach looks much cleaner indeed.
A fix without FAULT_FLAG_UNSHARE is not straight forward. We really don't want to notify user space about write events here (because there is none). And there is no way around the uffd handling in WP code without that.
FAULT_FLAG_ALLOW_RETRY would rely on userfaultfd triggering and having to resolve the WP event.
Right it'll be very much a false positive, but the userspace should be fine with it e.g. for live snapshot we need to copy page earlier; it still won't stop the process from running along the way. But I agree that's not ideal.