Greetings,
I trust you are well. I sent you an email yesterday, I just want to confirm if you received it.
Please let me know as soon as possible,
Regard
Mrs Alice Walton
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [A][B] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [C] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
[A] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[B] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[C] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
*Changes in v6:*
- Updated the interface and made cosmetic changes
*Cover Letter in v5:*
Hello,
This patch series implements IOCTL on the pagemap procfs file to get the
information about the page table entries (PTEs). The following operations
are supported in this ioctl:
- Get the information if the pages are soft-dirty, file mapped, present
or swapped.
- Clear the soft-dirty PTE bit of the pages.
- Get and clear the soft-dirty PTE bit of the pages atomically.
Soft-dirty PTE bit of the memory pages can be read by using the pagemap
procfs file. The soft-dirty PTE bit for the whole memory range of the
process can be cleared by writing to the clear_refs file. There are other
methods to mimic this information entirely in userspace with poor
performance:
- The mprotect syscall and SIGSEGV handler for bookkeeping
- The userfaultfd syscall with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty PTE bit status and clear operation
possible.
- The soft-dirty PTE bit of only a part of memory cannot be cleared.
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows. This syscall is used by games to
keep track of dirty pages to process only the dirty pages.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project[2][3]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project[2].
The IOCTL returns the addresses of the pages which match the specific masks.
The page addresses are returned in struct page_region in a compact form.
The max_pages is needed to support a use case where user only wants to get
a specific number of pages. So there is no need to find all the pages of
interest in the range when max_pages is specified. The IOCTL returns when
the maximum number of the pages are found. The max_pages is optional. If
max_pages is specified, it must be equal or greater than the vec_size.
This restriction is needed to handle worse case when one page_region only
contains info of one page and it cannot be compacted. This is needed to
emulate the Windows getWriteWatch() syscall.
Some non-dirty pages get marked as dirty because of the kernel's
internal activity (such as VMA merging as soft-dirty bit difference isn't
considered while deciding to merge VMAs). The dirty bit of the pages is
stored in the VMA flags and in the per page flags. If any of these two bits
are set, the page is considered to be soft dirty. Suppose you have cleared
the soft dirty bit of half of VMA which will be done by splitting the VMA
and clearing soft dirty bit flag in the half VMA and the pages in it. Now
kernel may decide to merge the VMAs again. So the half VMA becomes dirty
again. This splitting/merging costs performance. The application receives
a lot of pages which aren't dirty in reality but marked as dirty.
Performance is lost again here. Also sometimes user doesn't want the newly
allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
solves both the problems. It is used to not depend on the soft dirty flag
in the VMA flags. So VMA splitting and merging doesn't happen. It only
depends on the soft dirty bit of the individual pages. Thus by using this
flag, there may be a scenerio such that the new memory regions which are
just created, doesn't look dirty when seen with the IOCTL, but look dirty
when seen from procfs. This seems okay as the user of this flag know the
implication of using it.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[3] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
userfaultfd: Add UFFD WP Async support
userfaultfd: split mwriteprotect_range()
fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
PTEs
selftests: vm: add pagemap ioctl tests
fs/proc/task_mmu.c | 300 +++++++
fs/userfaultfd.c | 161 ++--
include/linux/userfaultfd_k.h | 10 +
include/uapi/linux/fs.h | 50 ++
include/uapi/linux/userfaultfd.h | 6 +
mm/userfaultfd.c | 40 +-
tools/include/uapi/linux/fs.h | 50 ++
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 5 +-
tools/testing/selftests/vm/pagemap_ioctl.c | 884 +++++++++++++++++++++
10 files changed, 1424 insertions(+), 83 deletions(-)
create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
--
2.30.2
Before these patches, the in-kernel Path-Manager would not allow, for
the same MPTCP connection, having a mix of subflows in v4 and v6.
MPTCP's RFC 8684 doesn't forbid that and it is even recommended to do so
as the path in v4 and v6 are likely different. Some networks are also
v4 or v6 only, we cannot assume they all have both v4 and v6 support.
Patch 1 then removes this artificial constraint in the in-kernel PM
currently enforcing there are no mixed subflows in place, either in
address announcement or in subflow creation areas.
Patch 2 makes sure the sk_ipv6only attribute is also propagated to
subflows, just in case a new PM wouldn't respect it.
Some selftests have also been added for the in-kernel PM (patch 3).
Patches 4 to 8 are just some cleanups and small improvements in the
printed messages in the userspace PM. It is not linked to the rest but
identified when working on a related patch modifying this selftest,
already in -net:
commit 4656d72c1efa ("selftests: mptcp: userspace: validate v4-v6 subflows mix")
---
Matthieu Baerts (6):
mptcp: propagate sk_ipv6only to subflows
mptcp: userspace pm: use a single point of exit
selftests: mptcp: userspace: print titles
selftests: mptcp: userspace: refactor asserts
selftests: mptcp: userspace: print error details if any
selftests: mptcp: userspace: avoid read errors
Paolo Abeni (2):
mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addresses
selftests: mptcp: add test-cases for mixed v4/v6 subflows
net/mptcp/pm_netlink.c | 58 ++++----
net/mptcp/pm_userspace.c | 5 +-
net/mptcp/sockopt.c | 1 +
tools/testing/selftests/net/mptcp/mptcp_join.sh | 53 ++++++--
tools/testing/selftests/net/mptcp/userspace_pm.sh | 153 +++++++++++++---------
5 files changed, 171 insertions(+), 99 deletions(-)
---
base-commit: 4373a023e0388fc19e27d37f61401bce6ff4c9d7
change-id: 20230123-upstream-net-next-pm-v4-v6-b186481a4b00
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
From: Andrei <andrei.gherzan(a)canonical.com>
The udpgro_frglist.sh uses nat6to4.o which is tested for existence in
bpf/nat6to4.o (relative to the script). This is where the object is
compiled. Even so, the script attempts to use it as part of tc with a
different path (../bpf/nat6to4.o). As a consequence, this fails the script:
Error opening object ../bpf/nat6to4.o: No such file or directory
Cannot initialize ELF context!
Unable to load program
This change refactors these references to use a variable for consistency
and also reformats two long lines.
Signed-off-by: Andrei <andrei.gherzan(a)canonical.com>
---
tools/testing/selftests/net/udpgro_frglist.sh | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/net/udpgro_frglist.sh b/tools/testing/selftests/net/udpgro_frglist.sh
index c9c4b9d65839..1fdf2d53944d 100755
--- a/tools/testing/selftests/net/udpgro_frglist.sh
+++ b/tools/testing/selftests/net/udpgro_frglist.sh
@@ -6,6 +6,7 @@
readonly PEER_NS="ns-peer-$(mktemp -u XXXXXX)"
BPF_FILE="../bpf/xdp_dummy.bpf.o"
+BPF_NAT6TO4_FILE="./bpf/nat6to4.o"
cleanup() {
local -r jobs="$(jobs -p)"
@@ -40,8 +41,12 @@ run_one() {
ip -n "${PEER_NS}" link set veth1 xdp object ${BPF_FILE} section xdp
tc -n "${PEER_NS}" qdisc add dev veth1 clsact
- tc -n "${PEER_NS}" filter add dev veth1 ingress prio 4 protocol ipv6 bpf object-file ../bpf/nat6to4.o section schedcls/ingress6/nat_6 direct-action
- tc -n "${PEER_NS}" filter add dev veth1 egress prio 4 protocol ip bpf object-file ../bpf/nat6to4.o section schedcls/egress4/snat4 direct-action
+ tc -n "${PEER_NS}" filter add dev veth1 ingress prio 4 protocol \
+ ipv6 bpf object-file "$BPF_NAT6TO4_FILE" section \
+ schedcls/ingress6/nat_6 direct-action
+ tc -n "${PEER_NS}" filter add dev veth1 egress prio 4 protocol \
+ ip bpf object-file "$BPF_NAT6TO4_FILE" section \
+ schedcls/egress4/snat4 direct-action
echo ${rx_args}
ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} -r &
@@ -88,7 +93,7 @@ if [ ! -f ${BPF_FILE} ]; then
exit -1
fi
-if [ ! -f bpf/nat6to4.o ]; then
+if [ ! -f "$BPF_NAT6TO4_FILE" ]; then
echo "Missing nat6to4 helper. Build bpfnat6to4.o selftest first"
exit -1
fi
--
2.34.1