Currently, the test decides whether or not to test certain features (e.g., writeprotect support) essentially by examining command-line arguments. For example, if we're testing anonymous memory, then we should test writeprotect support as well (since it generally is supported for anonymous).
This is broken, however. Take writeprotect support as an example: sure it's supported for anon, but it also requires that we have CONFIG_HAVE_ARCH_USERFAULTFD_WP. I.e., it is not supported at all on aarch64. So, running the test on such an arch fails: it tries to test writeprotect for anon, but since it isn't *actually* supported, it fails.
So, instead of checking command-line arguments to the test, check the features the way the UFFD API intends: when we open a new userfaultfd, pass in the feature(s) this test case would like to try to exercise. The kernel reports back a subset of those features which are actually supported: check these returned flags to see if the features are *actually* supported.
(For a couple of cases, where *registration* would fail [with -EINVAL] even though UFFDIO_API reports the feature as supported, we have to check test_type as well as the feature flag.)
In some cases, we check immediately after opening the userfaultfd, and if the features are missing, we skip the entire test. In some other cases, we can proceed with "most" of the test, only skipping a few pieces.
This lets us remove the global test_uffdio_wp and test_uffdio_minor variables entirely.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- tools/testing/selftests/vm/userfaultfd.c | 94 +++++++++++------------- 1 file changed, 43 insertions(+), 51 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 10ab56c2484a..2366caf90435 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -79,10 +79,6 @@ static int test_type; #define ALARM_INTERVAL_SECS 10 static volatile bool test_uffdio_copy_eexist = true; static volatile bool test_uffdio_zeropage_eexist = true; -/* Whether to test uffd write-protection */ -static bool test_uffdio_wp = false; -/* Whether to test uffd minor faults */ -static bool test_uffdio_minor = false;
static bool map_shared; static int shm_fd; @@ -90,6 +86,7 @@ static int huge_fd; static char *huge_fd_off0; static unsigned long long *count_verify; static int uffd = -1; +static uint64_t uffd_features; static int uffd_flags, finished, *pipefd; static char *area_src, *area_src_alias, *area_dst, *area_dst_alias; static char *zeropage; @@ -345,7 +342,7 @@ static struct uffd_test_ops hugetlb_uffd_test_ops = {
static struct uffd_test_ops *uffd_test_ops;
-static void userfaultfd_open(uint64_t *features) +static void userfaultfd_open(uint64_t features) { struct uffdio_api uffdio_api;
@@ -355,14 +352,20 @@ static void userfaultfd_open(uint64_t *features) uffd_flags = fcntl(uffd, F_GETFD, NULL);
uffdio_api.api = UFFD_API; - uffdio_api.features = *features; + uffdio_api.features = features; if (ioctl(uffd, UFFDIO_API, &uffdio_api)) err("UFFDIO_API failed.\nPlease make sure to " "run with either root or ptrace capability."); if (uffdio_api.api != UFFD_API) err("UFFDIO_API error: %" PRIu64, (uint64_t)uffdio_api.api);
- *features = uffdio_api.features; + uffd_features = uffdio_api.features; +} + +static inline bool uffd_wp_supported(void) +{ + return test_type == TEST_ANON && + (uffd_features & UFFD_FEATURE_PAGEFAULT_FLAG_WP); }
static inline void munmap_area(void **area) @@ -397,6 +400,7 @@ static void uffd_test_ctx_clear(void) err("close uffd"); uffd = -1; } + uffd_features = 0;
huge_fd_off0 = NULL; munmap_area((void **)&area_src); @@ -405,7 +409,7 @@ static void uffd_test_ctx_clear(void) munmap_area((void **)&area_dst_alias); }
-static void uffd_test_ctx_init_ext(uint64_t *features) +static void uffd_test_ctx_init(uint64_t features) { unsigned long nr, cpu;
@@ -445,11 +449,6 @@ static void uffd_test_ctx_init_ext(uint64_t *features) err("pipe"); }
-static inline void uffd_test_ctx_init(uint64_t features) -{ - uffd_test_ctx_init_ext(&features); -} - static int my_bcmp(char *str1, char *str2, size_t n) { unsigned long i; @@ -587,7 +586,7 @@ static int __copy_page(int ufd, unsigned long offset, bool retry) uffdio_copy.dst = (unsigned long) area_dst + offset; uffdio_copy.src = (unsigned long) area_src + offset; uffdio_copy.len = page_size; - if (test_uffdio_wp) + if (uffd_wp_supported()) uffdio_copy.mode = UFFDIO_COPY_MODE_WP; else uffdio_copy.mode = 0; @@ -778,7 +777,7 @@ static void *background_thread(void *arg) * at least the first half of the pages mapped already which * can be write-protected for testing */ - if (test_uffdio_wp) + if (uffd_wp_supported()) wp_range(uffd, (unsigned long)area_dst + start_nr * page_size, nr_pages_per_cpu * page_size, true);
@@ -1062,12 +1061,12 @@ static int userfaultfd_zeropage_test(void) printf("testing UFFDIO_ZEROPAGE: "); fflush(stdout);
- uffd_test_ctx_init(0); + uffd_test_ctx_init(UFFD_FEATURE_PAGEFAULT_FLAG_WP);
uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; - if (test_uffdio_wp) + if (uffd_wp_supported()) uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure"); @@ -1089,7 +1088,7 @@ static int userfaultfd_events_test(void) struct uffdio_register uffdio_register; unsigned long expected_ioctls; pthread_t uffd_mon; - int err, features; + int err; pid_t pid; char c; struct uffd_stats stats = { 0 }; @@ -1097,16 +1096,15 @@ static int userfaultfd_events_test(void) printf("testing events (fork, remap, remove): "); fflush(stdout);
- features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP | - UFFD_FEATURE_EVENT_REMOVE; - uffd_test_ctx_init(features); + uffd_test_ctx_init(UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP | + UFFD_FEATURE_EVENT_REMOVE | UFFD_FEATURE_PAGEFAULT_FLAG_WP);
fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; - if (test_uffdio_wp) + if (uffd_wp_supported()) uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure"); @@ -1144,7 +1142,7 @@ static int userfaultfd_sig_test(void) unsigned long expected_ioctls; unsigned long userfaults; pthread_t uffd_mon; - int err, features; + int err; pid_t pid; char c; struct uffd_stats stats = { 0 }; @@ -1152,15 +1150,15 @@ static int userfaultfd_sig_test(void) printf("testing signal delivery: "); fflush(stdout);
- features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS; - uffd_test_ctx_init(features); + uffd_test_ctx_init(UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_SIGBUS | + UFFD_FEATURE_PAGEFAULT_FLAG_WP);
fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; - if (test_uffdio_wp) + if (uffd_wp_supported()) uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure"); @@ -1209,25 +1207,23 @@ static int userfaultfd_minor_test(void) void *expected_page; char c; struct uffd_stats stats = { 0 }; - uint64_t req_features, features_out; - - if (!test_uffdio_minor) - return 0; + uint64_t features;
printf("testing minor faults: "); fflush(stdout);
- if (test_type == TEST_HUGETLB) - req_features = UFFD_FEATURE_MINOR_HUGETLBFS; + if (test_type == TEST_HUGETLB && map_shared) + features = UFFD_FEATURE_MINOR_HUGETLBFS; else if (test_type == TEST_SHMEM) - req_features = UFFD_FEATURE_MINOR_SHMEM; - else - return 1; + features = UFFD_FEATURE_MINOR_SHMEM; + else { + printf("skipping test due to unsupported memory type\n"); + return 0; + }
- features_out = req_features; - uffd_test_ctx_init_ext(&features_out); + uffd_test_ctx_init(features); /* If kernel reports required features aren't supported, skip test. */ - if ((features_out & req_features) != req_features) { + if ((uffd_features & features) != features) { printf("skipping test due to lack of feature support\n"); fflush(stdout); return 0; @@ -1349,10 +1345,6 @@ static void userfaultfd_pagemap_test(unsigned int test_pgsize) int pagemap_fd; uint64_t value;
- /* Pagemap tests uffd-wp only */ - if (!test_uffdio_wp) - return; - /* Not enough memory to test this page size */ if (test_pgsize > nr_pages * page_size) return; @@ -1361,7 +1353,12 @@ static void userfaultfd_pagemap_test(unsigned int test_pgsize) /* Flush so it doesn't flush twice in parent/child later */ fflush(stdout);
- uffd_test_ctx_init(0); + uffd_test_ctx_init(UFFD_FEATURE_PAGEFAULT_FLAG_WP); + /* Pagemap tests uffd-wp only */ + if (!uffd_wp_supported()) { + printf("skipping test due to lack of feature support\n"); + return; + }
if (test_pgsize > page_size) { /* This is a thp test */ @@ -1426,7 +1423,7 @@ static int userfaultfd_stress(void) struct uffdio_register uffdio_register; struct uffd_stats uffd_stats[nr_cpus];
- uffd_test_ctx_init(0); + uffd_test_ctx_init(UFFD_FEATURE_PAGEFAULT_FLAG_WP);
if (posix_memalign(&area, page_size, page_size)) err("out of memory"); @@ -1464,7 +1461,7 @@ static int userfaultfd_stress(void) uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; - if (test_uffdio_wp) + if (uffd_wp_supported()) uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure"); @@ -1513,7 +1510,7 @@ static int userfaultfd_stress(void) return 1;
/* Clear all the write protections if there is any */ - if (test_uffdio_wp) + if (uffd_wp_supported()) wp_range(uffd, (unsigned long)area_dst, nr_pages * page_size, false);
@@ -1595,8 +1592,6 @@ static void set_test_type(const char *type) if (!strcmp(type, "anon")) { test_type = TEST_ANON; uffd_test_ops = &anon_uffd_test_ops; - /* Only enable write-protect test for anonymous test */ - test_uffdio_wp = true; } else if (!strcmp(type, "hugetlb")) { test_type = TEST_HUGETLB; uffd_test_ops = &hugetlb_uffd_test_ops; @@ -1604,13 +1599,10 @@ static void set_test_type(const char *type) map_shared = true; test_type = TEST_HUGETLB; uffd_test_ops = &hugetlb_uffd_test_ops; - /* Minor faults require shared hugetlb; only enable here. */ - test_uffdio_minor = true; } else if (!strcmp(type, "shmem")) { map_shared = true; test_type = TEST_SHMEM; uffd_test_ops = &shmem_uffd_test_ops; - test_uffdio_minor = true; } else { err("Unknown test type: %s", type); }
Currently, the list of expected ioctls is stored in uffd_test_ops. This design comes with the implicit assumption that the set of expected ioctls is a function of the memory type (anon, shmem, or hugetlb).
However, this is not the case. For example:
- UFFDIO_WRITEPROTECT is supported for anon in general, *but not on aarch64* (which lacks CONFIG_HAVE_ARCH_USERFAULTFD_WP). - UFFDIO_WRITEPROTECT is supported for anon, but only if one registers the region with UFFD_REGISTER_MODE_WP. - We want to test (and expect) UFFDIO_CONTINUE, but only for *shared* hugetlb, not private.
So, instead, remove the set of expected ioctls from this structure. Define a new function which can compute the correct set by examining the memory type, the userfaultfd registration mode(s), whether map_shared is set, and what set of features the kernel reports as supported (from UFFDIO_API).
In addition to being more correct, this improves code re-use - less boilerplate in each test trying to perform this assertion.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- tools/testing/selftests/vm/userfaultfd.c | 103 ++++++++++++----------- 1 file changed, 56 insertions(+), 47 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 2366caf90435..aad5211f5012 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -304,37 +304,24 @@ static void shmem_alias_mapping(__u64 *start, size_t len, unsigned long offset) }
struct uffd_test_ops { - unsigned long expected_ioctls; void (*allocate_area)(void **alloc_area); void (*release_pages)(char *rel_area); void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset); };
-#define SHMEM_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \ - (1 << _UFFDIO_COPY) | \ - (1 << _UFFDIO_ZEROPAGE)) - -#define ANON_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \ - (1 << _UFFDIO_COPY) | \ - (1 << _UFFDIO_ZEROPAGE) | \ - (1 << _UFFDIO_WRITEPROTECT)) - static struct uffd_test_ops anon_uffd_test_ops = { - .expected_ioctls = ANON_EXPECTED_IOCTLS, .allocate_area = anon_allocate_area, .release_pages = anon_release_pages, .alias_mapping = noop_alias_mapping, };
static struct uffd_test_ops shmem_uffd_test_ops = { - .expected_ioctls = SHMEM_EXPECTED_IOCTLS, .allocate_area = shmem_allocate_area, .release_pages = shmem_release_pages, .alias_mapping = shmem_alias_mapping, };
static struct uffd_test_ops hugetlb_uffd_test_ops = { - .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC & ~(1 << _UFFDIO_CONTINUE), .allocate_area = hugetlb_allocate_area, .release_pages = hugetlb_release_pages, .alias_mapping = hugetlb_alias_mapping, @@ -368,6 +355,47 @@ static inline bool uffd_wp_supported(void) (uffd_features & UFFD_FEATURE_PAGEFAULT_FLAG_WP); }
+static inline uint64_t uffd_minor_feature(void) +{ + if (test_type == TEST_HUGETLB && map_shared) + return UFFD_FEATURE_MINOR_HUGETLBFS; + else if (test_type == TEST_SHMEM) + return UFFD_FEATURE_MINOR_SHMEM; + else + return 0; +} + +static inline bool uffd_minor_supported(void) +{ + return uffd_features & uffd_minor_feature(); +} + +static unsigned long get_expected_ioctls(uint64_t mode) +{ + unsigned long ioctls = UFFD_API_RANGE_IOCTLS; + + if (test_type == TEST_HUGETLB) + ioctls &= ~(1 << _UFFDIO_ZEROPAGE); + + if (!((mode & UFFDIO_REGISTER_MODE_WP) && uffd_wp_supported())) + ioctls &= ~(1 << _UFFDIO_WRITEPROTECT); + + if (!((mode & UFFDIO_REGISTER_MODE_MINOR) && uffd_minor_supported())) + ioctls &= ~(1 << _UFFDIO_CONTINUE); + + return ioctls; +} + +static inline void assert_expected_ioctls_present(uint64_t mode, uint64_t ioctls) +{ + uint64_t expected = get_expected_ioctls(mode); + uint64_t actual = ioctls & expected; + + if (actual != expected) + err("missing ioctl(s); expected: %"PRIx64" actual: %"PRIx64, + expected, actual); +} + static inline void munmap_area(void **area) { if (*area) @@ -1012,11 +1040,9 @@ static int __uffdio_zeropage(int ufd, unsigned long offset, bool retry) { struct uffdio_zeropage uffdio_zeropage; int ret; - unsigned long has_zeropage; + bool has_zeropage = get_expected_ioctls(0) & (1 << _UFFDIO_ZEROPAGE); __s64 res;
- has_zeropage = uffd_test_ops->expected_ioctls & (1 << _UFFDIO_ZEROPAGE); - if (offset >= nr_pages * page_size) err("unexpected offset %lu", offset); uffdio_zeropage.range.start = (unsigned long) area_dst + offset; @@ -1056,7 +1082,6 @@ static int uffdio_zeropage(int ufd, unsigned long offset) static int userfaultfd_zeropage_test(void) { struct uffdio_register uffdio_register; - unsigned long expected_ioctls;
printf("testing UFFDIO_ZEROPAGE: "); fflush(stdout); @@ -1071,9 +1096,8 @@ static int userfaultfd_zeropage_test(void) if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure");
- expected_ioctls = uffd_test_ops->expected_ioctls; - if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) - err("unexpected missing ioctl for anon memory"); + assert_expected_ioctls_present( + uffdio_register.mode, uffdio_register.ioctls);
if (uffdio_zeropage(uffd, 0)) if (my_bcmp(area_dst, zeropage, page_size)) @@ -1086,7 +1110,6 @@ static int userfaultfd_zeropage_test(void) static int userfaultfd_events_test(void) { struct uffdio_register uffdio_register; - unsigned long expected_ioctls; pthread_t uffd_mon; int err; pid_t pid; @@ -1109,9 +1132,8 @@ static int userfaultfd_events_test(void) if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure");
- expected_ioctls = uffd_test_ops->expected_ioctls; - if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) - err("unexpected missing ioctl for anon memory"); + assert_expected_ioctls_present( + uffdio_register.mode, uffdio_register.ioctls);
if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) err("uffd_poll_thread create"); @@ -1139,7 +1161,6 @@ static int userfaultfd_events_test(void) static int userfaultfd_sig_test(void) { struct uffdio_register uffdio_register; - unsigned long expected_ioctls; unsigned long userfaults; pthread_t uffd_mon; int err; @@ -1163,9 +1184,8 @@ static int userfaultfd_sig_test(void) if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure");
- expected_ioctls = uffd_test_ops->expected_ioctls; - if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) - err("unexpected missing ioctl for anon memory"); + assert_expected_ioctls_present( + uffdio_register.mode, uffdio_register.ioctls);
if (faulting_process(1)) err("faulting process failed"); @@ -1200,30 +1220,25 @@ static int userfaultfd_sig_test(void) static int userfaultfd_minor_test(void) { struct uffdio_register uffdio_register; - unsigned long expected_ioctls; unsigned long p; pthread_t uffd_mon; uint8_t expected_byte; void *expected_page; char c; struct uffd_stats stats = { 0 }; - uint64_t features; + uint64_t features = uffd_minor_feature();
printf("testing minor faults: "); fflush(stdout);
- if (test_type == TEST_HUGETLB && map_shared) - features = UFFD_FEATURE_MINOR_HUGETLBFS; - else if (test_type == TEST_SHMEM) - features = UFFD_FEATURE_MINOR_SHMEM; - else { + if (!features) { printf("skipping test due to unsupported memory type\n"); + fflush(stdout); return 0; }
uffd_test_ctx_init(features); - /* If kernel reports required features aren't supported, skip test. */ - if ((uffd_features & features) != features) { + if (!uffd_minor_supported()) { printf("skipping test due to lack of feature support\n"); fflush(stdout); return 0; @@ -1235,10 +1250,8 @@ static int userfaultfd_minor_test(void) if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure");
- expected_ioctls = uffd_test_ops->expected_ioctls; - expected_ioctls |= 1 << _UFFDIO_CONTINUE; - if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) - err("unexpected missing ioctl(s)"); + assert_expected_ioctls_present( + uffdio_register.mode, uffdio_register.ioctls);
/* * After registering with UFFD, populate the non-UFFD-registered side of @@ -1436,8 +1449,6 @@ static int userfaultfd_stress(void) pthread_attr_setstacksize(&attr, 16*1024*1024);
while (bounces--) { - unsigned long expected_ioctls; - printf("bounces: %d, mode:", bounces); if (bounces & BOUNCE_RANDOM) printf(" rnd"); @@ -1465,10 +1476,8 @@ static int userfaultfd_stress(void) uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure"); - expected_ioctls = uffd_test_ops->expected_ioctls; - if ((uffdio_register.ioctls & expected_ioctls) != - expected_ioctls) - err("unexpected missing ioctl for anon memory"); + assert_expected_ioctls_present( + uffdio_register.mode, uffdio_register.ioctls);
if (area_dst_alias) { uffdio_register.range.start = (unsigned long)
Two arguments for doing this:
First, and maybe most importantly, the resulting code is significantly shorter / simpler.
Then, we avoid using GNU libc extensions. Why does this matter? It makes testing userfaultfd with the selftest easier e.g. on distros which use something other than glibc (e.g., Alpine, which uses musl); basically, it makes the test more portable.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- tools/testing/selftests/vm/userfaultfd.c | 26 ++++-------------------- 1 file changed, 4 insertions(+), 22 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index aad5211f5012..9d9f60e71524 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -57,6 +57,7 @@ #include <assert.h> #include <inttypes.h> #include <stdint.h> +#include <sys/random.h>
#include "../kselftest.h"
@@ -528,22 +529,10 @@ static void continue_range(int ufd, __u64 start, __u64 len) static void *locking_thread(void *arg) { unsigned long cpu = (unsigned long) arg; - struct random_data rand; unsigned long page_nr = *(&(page_nr)); /* uninitialized warning */ - int32_t rand_nr; unsigned long long count; - char randstate[64]; - unsigned int seed;
- if (bounces & BOUNCE_RANDOM) { - seed = (unsigned int) time(NULL) - bounces; - if (!(bounces & BOUNCE_RACINGFAULTS)) - seed += cpu; - bzero(&rand, sizeof(rand)); - bzero(&randstate, sizeof(randstate)); - if (initstate_r(seed, randstate, sizeof(randstate), &rand)) - err("initstate_r failed"); - } else { + if (!(bounces & BOUNCE_RANDOM)) { page_nr = -bounces; if (!(bounces & BOUNCE_RACINGFAULTS)) page_nr += cpu * nr_pages_per_cpu; @@ -551,15 +540,8 @@ static void *locking_thread(void *arg)
while (!finished) { if (bounces & BOUNCE_RANDOM) { - if (random_r(&rand, &rand_nr)) - err("random_r failed"); - page_nr = rand_nr; - if (sizeof(page_nr) > sizeof(rand_nr)) { - if (random_r(&rand, &rand_nr)) - err("random_r failed"); - page_nr |= (((unsigned long) rand_nr) << 16) << - 16; - } + if (getrandom(&page_nr, sizeof(page_nr), 0) != sizeof(page_nr)) + err("getrandom failed"); } else page_nr += 1; page_nr %= nr_pages;
On Tue, Sep 21, 2021 at 09:33:23AM -0700, Axel Rasmussen wrote:
Two arguments for doing this:
First, and maybe most importantly, the resulting code is significantly shorter / simpler.
Then, we avoid using GNU libc extensions. Why does this matter? It makes testing userfaultfd with the selftest easier e.g. on distros which use something other than glibc (e.g., Alpine, which uses musl); basically, it makes the test more portable.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
Reviewed-by: Peter Xu peterx@redhat.com
Hi, Axel,
On Tue, Sep 21, 2021 at 09:33:21AM -0700, Axel Rasmussen wrote:
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 10ab56c2484a..2366caf90435 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -79,10 +79,6 @@ static int test_type; #define ALARM_INTERVAL_SECS 10 static volatile bool test_uffdio_copy_eexist = true; static volatile bool test_uffdio_zeropage_eexist = true; -/* Whether to test uffd write-protection */ -static bool test_uffdio_wp = false; -/* Whether to test uffd minor faults */ -static bool test_uffdio_minor = false;
IMHO it's not a fault to have these variables; they're still the fastest way to do branching. It's just that in some cases we should set them to "false" rather than "true", am I right?
How about we just set them properly in set_test_type? Say, we can fetch the feature bits in set_test_type rather than assuming it's only related to the type of memory.
Thanks,
On Tue, Sep 21, 2021 at 10:44 AM Peter Xu peterx@redhat.com wrote:
Hi, Axel,
On Tue, Sep 21, 2021 at 09:33:21AM -0700, Axel Rasmussen wrote:
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 10ab56c2484a..2366caf90435 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -79,10 +79,6 @@ static int test_type; #define ALARM_INTERVAL_SECS 10 static volatile bool test_uffdio_copy_eexist = true; static volatile bool test_uffdio_zeropage_eexist = true; -/* Whether to test uffd write-protection */ -static bool test_uffdio_wp = false; -/* Whether to test uffd minor faults */ -static bool test_uffdio_minor = false;
IMHO it's not a fault to have these variables; they're still the fastest way to do branching. It's just that in some cases we should set them to "false" rather than "true", am I right?
How about we just set them properly in set_test_type? Say, we can fetch the feature bits in set_test_type rather than assuming it's only related to the type of memory.
We could do that, but it would require opening a userfaultfd, issuing a UFFDIO_API ioctl, and getting the feature bits in set_test_type. And then I guess just closing the UFFD again, as we aren't yet setting up for any particular test. To me, it seemed "messier" than this approach.
Another thing to consider is, for the next patch we don't just want to know "does this kernel support $FEATURE in general?" but also "is $FEATURE supported for this particular memory region I've registered?", and we can't have a single global answer to that. It seemed a bit cleaner to me to write the code as if I was dealing with that case, and then re-use the infrastructure I'd built for patch 2/3.
Basically, I didn't initially have a goal of getting rid of these variables, but it ended up being the cleanest way (IMHO).
Just trying to explain the thinking. :) In the end, I think it's a stylistic choice and don't feel super strongly about it, either way could work. So, I can change it if you or others do feel strongly.
Thanks,
-- Peter Xu
On Tue, Sep 21, 2021 at 11:26:14AM -0700, Axel Rasmussen wrote:
On Tue, Sep 21, 2021 at 10:44 AM Peter Xu peterx@redhat.com wrote:
Hi, Axel,
On Tue, Sep 21, 2021 at 09:33:21AM -0700, Axel Rasmussen wrote:
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 10ab56c2484a..2366caf90435 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -79,10 +79,6 @@ static int test_type; #define ALARM_INTERVAL_SECS 10 static volatile bool test_uffdio_copy_eexist = true; static volatile bool test_uffdio_zeropage_eexist = true; -/* Whether to test uffd write-protection */ -static bool test_uffdio_wp = false; -/* Whether to test uffd minor faults */ -static bool test_uffdio_minor = false;
IMHO it's not a fault to have these variables; they're still the fastest way to do branching. It's just that in some cases we should set them to "false" rather than "true", am I right?
How about we just set them properly in set_test_type? Say, we can fetch the feature bits in set_test_type rather than assuming it's only related to the type of memory.
We could do that, but it would require opening a userfaultfd, issuing a UFFDIO_API ioctl, and getting the feature bits in set_test_type. And then I guess just closing the UFFD again, as we aren't yet setting up for any particular test. To me, it seemed "messier" than this approach.
Another thing to consider is, for the next patch we don't just want to know "does this kernel support $FEATURE in general?" but also "is $FEATURE supported for this particular memory region I've registered?", and we can't have a single global answer to that.
Could I ask why? For each run, the memory type doesn't change, isn't it? Then I think the capability it should support is a constant?
Btw, note that "open an uffd, detect features, close uffd quickly" during setup phase is totally fine to me just for probing the capabilities, and instead of thinking it being messy I see it a very clean approach..
It seemed a bit cleaner to me to write the code as if I was dealing with that case, and then re-use the infrastructure I'd built for patch 2/3.
I didn't comment on patch 2, but I had the same confusion - aren't all these information constant after we settle the hardware, the kernel and the memory type to test?
Basically, I didn't initially have a goal of getting rid of these variables, but it ended up being the cleanest way (IMHO).
Just trying to explain the thinking. :) In the end, I think it's a stylistic choice and don't feel super strongly about it, either way could work. So, I can change it if you or others do feel strongly.
I have no strong opinion as long as the code works (which I trust you on :). We can keep it in Andrew's queue unless you do feel the other way is better.
Thanks,
On Tue, Sep 21, 2021 at 12:21 PM Peter Xu peterx@redhat.com wrote:
On Tue, Sep 21, 2021 at 11:26:14AM -0700, Axel Rasmussen wrote:
On Tue, Sep 21, 2021 at 10:44 AM Peter Xu peterx@redhat.com wrote:
Hi, Axel,
On Tue, Sep 21, 2021 at 09:33:21AM -0700, Axel Rasmussen wrote:
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 10ab56c2484a..2366caf90435 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -79,10 +79,6 @@ static int test_type; #define ALARM_INTERVAL_SECS 10 static volatile bool test_uffdio_copy_eexist = true; static volatile bool test_uffdio_zeropage_eexist = true; -/* Whether to test uffd write-protection */ -static bool test_uffdio_wp = false; -/* Whether to test uffd minor faults */ -static bool test_uffdio_minor = false;
IMHO it's not a fault to have these variables; they're still the fastest way to do branching. It's just that in some cases we should set them to "false" rather than "true", am I right?
How about we just set them properly in set_test_type? Say, we can fetch the feature bits in set_test_type rather than assuming it's only related to the type of memory.
We could do that, but it would require opening a userfaultfd, issuing a UFFDIO_API ioctl, and getting the feature bits in set_test_type. And then I guess just closing the UFFD again, as we aren't yet setting up for any particular test. To me, it seemed "messier" than this approach.
Another thing to consider is, for the next patch we don't just want to know "does this kernel support $FEATURE in general?" but also "is $FEATURE supported for this particular memory region I've registered?", and we can't have a single global answer to that.
Could I ask why? For each run, the memory type doesn't change, isn't it? Then I think the capability it should support is a constant?
Ah, it has to do with us asserting the list of expected ioctls. The kernel changes the list of ioctls it reports in response to a UFFDIO_REGISTER, depending on the particular kind of vma being registered, **as well as what mode(s) it is being registered with**.
So for example, consider the hugetlb_shared test. When registering, the kernel might set the UFFDIO_CONTINUE bit or not, depending on whether we registered with the MINOR mode bit set in particular. So it will be present in one test case, but not in another, and so the set of expected ioctls has to be computed at test time, rather than in set_test_type.
Btw, note that "open an uffd, detect features, close uffd quickly" during setup phase is totally fine to me just for probing the capabilities, and instead of thinking it being messy I see it a very clean approach..
It seemed a bit cleaner to me to write the code as if I was dealing with that case, and then re-use the infrastructure I'd built for patch 2/3.
I didn't comment on patch 2, but I had the same confusion - aren't all these information constant after we settle the hardware, the kernel and the memory type to test?
Basically, I didn't initially have a goal of getting rid of these variables, but it ended up being the cleanest way (IMHO).
Just trying to explain the thinking. :) In the end, I think it's a stylistic choice and don't feel super strongly about it, either way could work. So, I can change it if you or others do feel strongly.
I have no strong opinion as long as the code works (which I trust you on :). We can keep it in Andrew's queue unless you do feel the other way is better.
Thanks,
-- Peter Xu
On Tue, Sep 21, 2021 at 01:31:12PM -0700, Axel Rasmussen wrote:
Ah, it has to do with us asserting the list of expected ioctls. The kernel changes the list of ioctls it reports in response to a UFFDIO_REGISTER, depending on the particular kind of vma being registered, **as well as what mode(s) it is being registered with**.
So for example, consider the hugetlb_shared test. When registering, the kernel might set the UFFDIO_CONTINUE bit or not, depending on whether we registered with the MINOR mode bit set in particular.
I can understand your point, but the "capability set" of the kernel is still the same. In this case we should have UFFDIO_CONTINUE capability for hugetlb_shared test globally, as long as the kernel supports it, irrelevant of what test case we're going to have.
Then in the test, if we don't register with MINOR mode, IMHO we should just mask out the expected_ioctls with UFFDIO_CONTINUE because it does not make sense to request UFFDIO_CONTINUE if we will never use it in the test.
In other words, having a "uffd_features" global variable and having it changing all the time during tests is odd to me, but I agree it's not a big deal. :)
Thanks for discussing the design Peter. I have some ideas which might make for a nicer v2; I'll massage the code a bit and see what I can come up with.
On Tue, Sep 21, 2021 at 5:29 PM Peter Xu peterx@redhat.com wrote:
On Tue, Sep 21, 2021 at 01:31:12PM -0700, Axel Rasmussen wrote:
Ah, it has to do with us asserting the list of expected ioctls. The kernel changes the list of ioctls it reports in response to a UFFDIO_REGISTER, depending on the particular kind of vma being registered, **as well as what mode(s) it is being registered with**.
So for example, consider the hugetlb_shared test. When registering, the kernel might set the UFFDIO_CONTINUE bit or not, depending on whether we registered with the MINOR mode bit set in particular.
I can understand your point, but the "capability set" of the kernel is still the same. In this case we should have UFFDIO_CONTINUE capability for hugetlb_shared test globally, as long as the kernel supports it, irrelevant of what test case we're going to have.
Then in the test, if we don't register with MINOR mode, IMHO we should just mask out the expected_ioctls with UFFDIO_CONTINUE because it does not make sense to request UFFDIO_CONTINUE if we will never use it in the test.
Right, this is how it was before. I didn't love how the base set included everything, and then each test is responsible for removing the things it isn't testing. It seems reversed: why not just have each test compute the set of things it *is* testing?
In other words, having a "uffd_features" global variable and having it changing all the time during tests is odd to me, but I agree it's not a big deal. :)
100% agree with this. From my perspective this is tech debt since:
8ba6e86408 userfaultfd/selftests: reinitialize test context in each test
It used to be that we just had one global context (variables like uffd, count_verify, area_src, area_dst, etc). But this had the problem where some previous test can mutate the context, breaking or affecting following tests. After 8ba6e86408, we clear and reinitialize all these variables for each test, but they're still global. I think it would be cleaner if these instead were in a struct, which each test initialized and then destroyed within its own scope. If we were to do such a refactor, I would put uffd_features in that struct too - it should be private to a test, since it's a property we get from the uffd.
But, I wasn't sure it was worth the churn to do something like this.
-- Peter Xu
Hello, Axel,
On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
Thanks for discussing the design Peter. I have some ideas which might make for a nicer v2; I'll massage the code a bit and see what I can come up with.
Sure thing. Note again that as I don't have a strong opinion on that, feel free to keep it. However if you provide v2, I'll read.
[off-topic below]
Another thing I probably have forgot but need your confirmation is, when you worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
When I'm reworking the uffd-wp series, I noticed that commit e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected", 2020-04-07) was actually awkward and not efficient, as we can simply lookup the vma flags for detecting uffd-wp enablement. I'm preparing a patch for it to do it by checking vmas (and that patch will also pave the way for file-backed).
Then I noticed we need similar thing for minor mode?
I think the answer is yes, but I didn't see any code that explicitly handled thp for minor mode, do you remember?
To be explicit, what if in mcontinue_atomic_pte() we get a shmem_getpage() call with a thp returned? Will minor mode break?
I plan to post the khugepaged patch soon and I plan to cover minor mode too there, but I'm not sure whether that's enough, as the thp can be there from the 1st day I think, but I could have missed something.
On Wed, Sep 22, 2021 at 10:33 AM Peter Xu peterx@redhat.com wrote:
Hello, Axel,
On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
Thanks for discussing the design Peter. I have some ideas which might make for a nicer v2; I'll massage the code a bit and see what I can come up with.
Sure thing. Note again that as I don't have a strong opinion on that, feel free to keep it. However if you provide v2, I'll read.
[off-topic below]
Another thing I probably have forgot but need your confirmation is, when you worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
I gave a more detailed answer in the other thread, but: currently it is allowed, but this was a bug / oversight on my part. :) THP collapse can break the guarantees minor fault registration is trying to provide.
I think your approach of checking the VMA flags *in retract_page_tables specifically* is correct, and a similar thing should be done for minor registered VMAs too.
When I'm reworking the uffd-wp series, I noticed that commit e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected", 2020-04-07) was actually awkward and not efficient, as we can simply lookup the vma flags for detecting uffd-wp enablement. I'm preparing a patch for it to do it by checking vmas (and that patch will also pave the way for file-backed).
Then I noticed we need similar thing for minor mode?
I think the answer is yes, but I didn't see any code that explicitly handled thp for minor mode, do you remember?
To be explicit, what if in mcontinue_atomic_pte() we get a shmem_getpage() call with a thp returned? Will minor mode break?
Ah so this I am not quite as sure about.
The issue I was describing in the other thread was more about THP collapse racing with UFFDIO_CONTINUE. E.g., collapsing after registration has happened, but before faults have been resolved.
But there's another scenario: what if the collapse happened well before registration happened? I *think* the existing code deals with THPs correctly in that case, but then again I don't think our selftest really covers this case, and it's not something I've tested in production either (to work around the other bug, we currently MADV_NOHUGEPAGE the area until after VM demand paging completes, and the UFFD registration is removed), so I am not super confident this is the case.
I plan to post the khugepaged patch soon and I plan to cover minor mode too there, but I'm not sure whether that's enough, as the thp can be there from the 1st day I think, but I could have missed something.
-- Peter Xu
On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
On Wed, Sep 22, 2021 at 10:33 AM Peter Xu peterx@redhat.com wrote:
Hello, Axel,
On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
Thanks for discussing the design Peter. I have some ideas which might make for a nicer v2; I'll massage the code a bit and see what I can come up with.
Sure thing. Note again that as I don't have a strong opinion on that, feel free to keep it. However if you provide v2, I'll read.
[off-topic below]
Another thing I probably have forgot but need your confirmation is, when you worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
I gave a more detailed answer in the other thread, but: currently it is allowed, but this was a bug / oversight on my part. :) THP collapse can break the guarantees minor fault registration is trying to provide.
I've replied there:
https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/
We can try to keep the discussion unified there regarding this.
But there's another scenario: what if the collapse happened well before registration happened?
Maybe yes, but my understanding of the current uffd-minor scenario tells me that this is fine too. Meanwhile I actually have another idea regarding minor mode, please continue reading.
Firstly, let me try to re-cap on how minor mode is used in your production systems: I believe there should have two processes A and B, if A is the main process, B could be the migration process. B migrates pages in the background, while A so far should have been stopped and never ran. When we want to start A, we should register A with uffd-minor upon the whole range (note: I think so far A does not have any pgtable mapped within uffd-minor range). Then any page access of A should kick B and asking "whether it is the latest page", if yes then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE afterwards. Am I right above?
So if that's the case, then A should have no page table at all.
Then, is that a problem if the shmem file that A maps contains huge thps? I think no - because UFFDIO_CONTINUE will only install small pages.
Let me know if I'm understanding it right above; I'll be happy to be corrected.
Actually besides this scenario, I'm also thinking of another scenario of using minor fault in a single process - that's mostly what QEMU is doing right now, as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable. So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down all the file-backed memory pgtables of a specific range. I think it'll suite perfectly for the minor fault use case, and it can be used for other things too. Let me know what you think about this idea, and whether that'll help in your case too (e.g., if you worry a current process A mapped huge shmem thp somewhere, we can use madvise(MADV_ZAP) to drop it).
I *think* the existing code deals with THPs correctly in that case, but then again I don't think our selftest really covers this case, and it's not something I've tested in production either (to work around the other bug, we currently MADV_NOHUGEPAGE the area until after VM demand paging completes, and the UFFD registration is removed), so I am not super confident this is the case.
In all cases, enhancing the test program will always be welcomed.
Thanks,
On Wed, Sep 22, 2021 at 2:52 PM Peter Xu peterx@redhat.com wrote:
On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
On Wed, Sep 22, 2021 at 10:33 AM Peter Xu peterx@redhat.com wrote:
Hello, Axel,
On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
Thanks for discussing the design Peter. I have some ideas which might make for a nicer v2; I'll massage the code a bit and see what I can come up with.
Sure thing. Note again that as I don't have a strong opinion on that, feel free to keep it. However if you provide v2, I'll read.
[off-topic below]
Another thing I probably have forgot but need your confirmation is, when you worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
I gave a more detailed answer in the other thread, but: currently it is allowed, but this was a bug / oversight on my part. :) THP collapse can break the guarantees minor fault registration is trying to provide.
I've replied there:
https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/
We can try to keep the discussion unified there regarding this.
But there's another scenario: what if the collapse happened well before registration happened?
Maybe yes, but my understanding of the current uffd-minor scenario tells me that this is fine too. Meanwhile I actually have another idea regarding minor mode, please continue reading.
Firstly, let me try to re-cap on how minor mode is used in your production systems: I believe there should have two processes A and B, if A is the main process, B could be the migration process. B migrates pages in the background, while A so far should have been stopped and never ran. When we want to start A, we should register A with uffd-minor upon the whole range (note: I think so far A does not have any pgtable mapped within uffd-minor range). Then any page access of A should kick B and asking "whether it is the latest page", if yes then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE afterwards. Am I right above?
So if that's the case, then A should have no page table at all.
Then, is that a problem if the shmem file that A maps contains huge thps? I think no - because UFFDIO_CONTINUE will only install small pages.
Let me know if I'm understanding it right above; I'll be happy to be corrected.
Right, except that our use case is even more similar to QEMU: the code doing UFFDIO_CONTINUE / demand paging, and the code running the vCPUs, are in the same process (same mm) - just different threads.
Actually besides this scenario, I'm also thinking of another scenario of using minor fault in a single process - that's mostly what QEMU is doing right now, as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable. So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down all the file-backed memory pgtables of a specific range. I think it'll suite perfectly for the minor fault use case, and it can be used for other things too. Let me know what you think about this idea, and whether that'll help in your case too (e.g., if you worry a current process A mapped huge shmem thp somewhere, we can use madvise(MADV_ZAP) to drop it).
Yes, this would be convenient for our implementation too. :) There are workarounds if the feature doesn't exist, but it would be nice to have. It's also useful for memory poisoning, I think, if the host decides some page(s) are "bad" and wants to intercept any future guest accesses to those page(s).
I *think* the existing code deals with THPs correctly in that case, but then again I don't think our selftest really covers this case, and it's not something I've tested in production either (to work around the other bug, we currently MADV_NOHUGEPAGE the area until after VM demand paging completes, and the UFFD registration is removed), so I am not super confident this is the case.
In all cases, enhancing the test program will always be welcomed.
Thanks,
-- Peter Xu
On Wed, Sep 22, 2021 at 03:29:42PM -0700, Axel Rasmussen wrote:
On Wed, Sep 22, 2021 at 2:52 PM Peter Xu peterx@redhat.com wrote:
On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
On Wed, Sep 22, 2021 at 10:33 AM Peter Xu peterx@redhat.com wrote:
Hello, Axel,
On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
Thanks for discussing the design Peter. I have some ideas which might make for a nicer v2; I'll massage the code a bit and see what I can come up with.
Sure thing. Note again that as I don't have a strong opinion on that, feel free to keep it. However if you provide v2, I'll read.
[off-topic below]
Another thing I probably have forgot but need your confirmation is, when you worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
I gave a more detailed answer in the other thread, but: currently it is allowed, but this was a bug / oversight on my part. :) THP collapse can break the guarantees minor fault registration is trying to provide.
I've replied there:
https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/
We can try to keep the discussion unified there regarding this.
But there's another scenario: what if the collapse happened well before registration happened?
Maybe yes, but my understanding of the current uffd-minor scenario tells me that this is fine too. Meanwhile I actually have another idea regarding minor mode, please continue reading.
Firstly, let me try to re-cap on how minor mode is used in your production systems: I believe there should have two processes A and B, if A is the main process, B could be the migration process. B migrates pages in the background, while A so far should have been stopped and never ran. When we want to start A, we should register A with uffd-minor upon the whole range (note: I think so far A does not have any pgtable mapped within uffd-minor range). Then any page access of A should kick B and asking "whether it is the latest page", if yes then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE afterwards. Am I right above?
So if that's the case, then A should have no page table at all.
Then, is that a problem if the shmem file that A maps contains huge thps? I think no - because UFFDIO_CONTINUE will only install small pages.
Let me know if I'm understanding it right above; I'll be happy to be corrected.
Right, except that our use case is even more similar to QEMU: the code doing UFFDIO_CONTINUE / demand paging, and the code running the vCPUs, are in the same process (same mm) - just different threads.
I see.
Actually besides this scenario, I'm also thinking of another scenario of using minor fault in a single process - that's mostly what QEMU is doing right now, as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable. So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down all the file-backed memory pgtables of a specific range. I think it'll suite perfectly for the minor fault use case, and it can be used for other things too. Let me know what you think about this idea, and whether that'll help in your case too (e.g., if you worry a current process A mapped huge shmem thp somewhere, we can use madvise(MADV_ZAP) to drop it).
Yes, this would be convenient for our implementation too. :) There are workarounds if the feature doesn't exist, but it would be nice to have.
Could I know what's the workaround? Normally if the workaround works solidly, then there's less need to introduce a kernel interface for that. Otherwise I'm glad to look into such a formal proposal.
It's also useful for memory poisoning, I think, if the host decides some page(s) are "bad" and wants to intercept any future guest accesses to those page(s).
Curious: isn't hwpoison information come from MCEs; or say, host kernel side? Then I thought the host kernel will have full control of it already.
Or there's other way that the host can try to detect some pages are going to be rotten? So the userspace can do something before the kernel handles those exceptions?
On Wed, Sep 22, 2021 at 4:49 PM Peter Xu peterx@redhat.com wrote:
On Wed, Sep 22, 2021 at 03:29:42PM -0700, Axel Rasmussen wrote:
On Wed, Sep 22, 2021 at 2:52 PM Peter Xu peterx@redhat.com wrote:
On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
On Wed, Sep 22, 2021 at 10:33 AM Peter Xu peterx@redhat.com wrote:
Hello, Axel,
On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
Thanks for discussing the design Peter. I have some ideas which might make for a nicer v2; I'll massage the code a bit and see what I can come up with.
Sure thing. Note again that as I don't have a strong opinion on that, feel free to keep it. However if you provide v2, I'll read.
[off-topic below]
Another thing I probably have forgot but need your confirmation is, when you worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
I gave a more detailed answer in the other thread, but: currently it is allowed, but this was a bug / oversight on my part. :) THP collapse can break the guarantees minor fault registration is trying to provide.
I've replied there:
https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/
We can try to keep the discussion unified there regarding this.
But there's another scenario: what if the collapse happened well before registration happened?
Maybe yes, but my understanding of the current uffd-minor scenario tells me that this is fine too. Meanwhile I actually have another idea regarding minor mode, please continue reading.
Firstly, let me try to re-cap on how minor mode is used in your production systems: I believe there should have two processes A and B, if A is the main process, B could be the migration process. B migrates pages in the background, while A so far should have been stopped and never ran. When we want to start A, we should register A with uffd-minor upon the whole range (note: I think so far A does not have any pgtable mapped within uffd-minor range). Then any page access of A should kick B and asking "whether it is the latest page", if yes then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE afterwards. Am I right above?
So if that's the case, then A should have no page table at all.
Then, is that a problem if the shmem file that A maps contains huge thps? I think no - because UFFDIO_CONTINUE will only install small pages.
Let me know if I'm understanding it right above; I'll be happy to be corrected.
Right, except that our use case is even more similar to QEMU: the code doing UFFDIO_CONTINUE / demand paging, and the code running the vCPUs, are in the same process (same mm) - just different threads.
I see.
Actually besides this scenario, I'm also thinking of another scenario of using minor fault in a single process - that's mostly what QEMU is doing right now, as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable. So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down all the file-backed memory pgtables of a specific range. I think it'll suite perfectly for the minor fault use case, and it can be used for other things too. Let me know what you think about this idea, and whether that'll help in your case too (e.g., if you worry a current process A mapped huge shmem thp somewhere, we can use madvise(MADV_ZAP) to drop it).
Yes, this would be convenient for our implementation too. :) There are workarounds if the feature doesn't exist, but it would be nice to have.
Could I know what's the workaround? Normally if the workaround works solidly, then there's less need to introduce a kernel interface for that. Otherwise I'm glad to look into such a formal proposal.
The workaround is, for the region that you want to zap, run through this sequence of syscalls: mumap, mmap, and re-register with userfaultfd if it was registered before. If we're using tmpfs, we can use madvise(DONTNEED) instead, but this is kind of an abuse of the API. I don't think there's a guarantee that the PTEs will get zapped, but currently they will always get zapped if we're using tmpfs. I really like the idea of adding a new madvise() mode that is guaranteed to zap the PTEs.
It's also useful for memory poisoning, I think, if the host decides some page(s) are "bad" and wants to intercept any future guest accesses to those page(s).
Curious: isn't hwpoison information come from MCEs; or say, host kernel side? Then I thought the host kernel will have full control of it already.
Or there's other way that the host can try to detect some pages are going to be rotten? So the userspace can do something before the kernel handles those exceptions?
Here's a general idea of how we would like to use userfaultfd to support MPR:
If a guest accesses a poisoned page for the first time, we will get an MCE through the host kernel and send an MCE to the guest. The guest will now no longer be able to access this page, and we have to enforce this. After a live migration, the pages that were poisoned before probably won't still be poisoned (from the host's perspective), so we can't rely on the host kernel's MCE handling path. This is where userfaultfd and this new madvise mode come in: we can just madvise(MADV_ZAP) the poisoned page(s) on the target during a migration. Now all accesses will be routed to the VMM and we can inject an MCE. We don't *need* the new madvise mode, as we can also use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it would be more convenient if we didn't have to use fallocate.
Jue Wang can provide more context here, so I've cc'd him. There may be some things I'm wrong about, so Jue feel free to correct me.
- James
-- Peter Xu
On Wed, Sep 22, 2021 at 9:18 PM James Houghton jthoughton@google.com wrote:
On Wed, Sep 22, 2021 at 4:49 PM Peter Xu peterx@redhat.com wrote:
On Wed, Sep 22, 2021 at 03:29:42PM -0700, Axel Rasmussen wrote:
On Wed, Sep 22, 2021 at 2:52 PM Peter Xu peterx@redhat.com wrote:
On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
On Wed, Sep 22, 2021 at 10:33 AM Peter Xu peterx@redhat.com wrote:
Hello, Axel,
On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote: > Thanks for discussing the design Peter. I have some ideas which might > make for a nicer v2; I'll massage the code a bit and see what I can > come up with.
Sure thing. Note again that as I don't have a strong opinion on that, feel free to keep it. However if you provide v2, I'll read.
[off-topic below]
Another thing I probably have forgot but need your confirmation is, when you worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
I gave a more detailed answer in the other thread, but: currently it is allowed, but this was a bug / oversight on my part. :) THP collapse can break the guarantees minor fault registration is trying to provide.
I've replied there:
https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/
We can try to keep the discussion unified there regarding this.
But there's another scenario: what if the collapse happened well before registration happened?
Maybe yes, but my understanding of the current uffd-minor scenario tells me that this is fine too. Meanwhile I actually have another idea regarding minor mode, please continue reading.
Firstly, let me try to re-cap on how minor mode is used in your production systems: I believe there should have two processes A and B, if A is the main process, B could be the migration process. B migrates pages in the background, while A so far should have been stopped and never ran. When we want to start A, we should register A with uffd-minor upon the whole range (note: I think so far A does not have any pgtable mapped within uffd-minor range). Then any page access of A should kick B and asking "whether it is the latest page", if yes then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE afterwards. Am I right above?
So if that's the case, then A should have no page table at all.
Then, is that a problem if the shmem file that A maps contains huge thps? I think no - because UFFDIO_CONTINUE will only install small pages.
Let me know if I'm understanding it right above; I'll be happy to be corrected.
Right, except that our use case is even more similar to QEMU: the code doing UFFDIO_CONTINUE / demand paging, and the code running the vCPUs, are in the same process (same mm) - just different threads.
I see.
Actually besides this scenario, I'm also thinking of another scenario of using minor fault in a single process - that's mostly what QEMU is doing right now, as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable. So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down all the file-backed memory pgtables of a specific range. I think it'll suite perfectly for the minor fault use case, and it can be used for other things too. Let me know what you think about this idea, and whether that'll help in your case too (e.g., if you worry a current process A mapped huge shmem thp somewhere, we can use madvise(MADV_ZAP) to drop it).
Yes, this would be convenient for our implementation too. :) There are workarounds if the feature doesn't exist, but it would be nice to have.
Could I know what's the workaround? Normally if the workaround works solidly, then there's less need to introduce a kernel interface for that. Otherwise I'm glad to look into such a formal proposal.
The workaround is, for the region that you want to zap, run through this sequence of syscalls: mumap, mmap, and re-register with userfaultfd if it was registered before. If we're using tmpfs, we can use madvise(DONTNEED) instead, but this is kind of an abuse of the API. I don't think there's a guarantee that the PTEs will get zapped, but currently they will always get zapped if we're using tmpfs. I really like the idea of adding a new madvise() mode that is guaranteed to zap the PTEs.
It's also useful for memory poisoning, I think, if the host decides some page(s) are "bad" and wants to intercept any future guest accesses to those page(s).
Curious: isn't hwpoison information come from MCEs; or say, host kernel side? Then I thought the host kernel will have full control of it already.
Or there's other way that the host can try to detect some pages are going to be rotten? So the userspace can do something before the kernel handles those exceptions?
Here's a general idea of how we would like to use userfaultfd to support MPR:
If a guest accesses a poisoned page for the first time, we will get an MCE through the host kernel and send an MCE to the guest. The guest will now no longer be able to access this page, and we have to enforce this. After a live migration, the pages that were poisoned before probably won't still be poisoned (from the host's perspective), so we can't rely on the host kernel's MCE handling path. This is where userfaultfd and this new madvise mode come in: we can just madvise(MADV_ZAP) the poisoned page(s) on the target during a migration. Now all accesses will be routed to the VMM and we can inject an MCE. We don't *need* the new madvise mode, as we can also use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it would be more convenient if we didn't have to use fallocate.
Jue Wang can provide more context here, so I've cc'd him. There may be some things I'm wrong about, so Jue feel free to correct me.
James is right.
The page is marked PG_HWPoison in the source VM host's kernel. The need of intercepting guest accesses to it exist on the target VM host, where the same physical page is no longer poisoned.
On the target host, the hypervisor needs to intercept all guest accesses to pages poisoned from the source VM host.
- James
-- Peter Xu
On Wed, Sep 22, 2021 at 10:43:40PM -0700, Jue Wang wrote:
[...]
Could I know what's the workaround? Normally if the workaround works solidly, then there's less need to introduce a kernel interface for that. Otherwise I'm glad to look into such a formal proposal.
The workaround is, for the region that you want to zap, run through this sequence of syscalls: mumap, mmap, and re-register with userfaultfd if it was registered before. If we're using tmpfs, we can use madvise(DONTNEED) instead, but this is kind of an abuse of the API. I don't think there's a guarantee that the PTEs will get zapped, but currently they will always get zapped if we're using tmpfs. I really like the idea of adding a new madvise() mode that is guaranteed to zap the PTEs.
I see.
It's also useful for memory poisoning, I think, if the host decides some page(s) are "bad" and wants to intercept any future guest accesses to those page(s).
Curious: isn't hwpoison information come from MCEs; or say, host kernel side? Then I thought the host kernel will have full control of it already.
Or there's other way that the host can try to detect some pages are going to be rotten? So the userspace can do something before the kernel handles those exceptions?
Here's a general idea of how we would like to use userfaultfd to support MPR:
If a guest accesses a poisoned page for the first time, we will get an MCE through the host kernel and send an MCE to the guest. The guest will now no longer be able to access this page, and we have to enforce this. After a live migration, the pages that were poisoned before probably won't still be poisoned (from the host's perspective), so we can't rely on the host kernel's MCE handling path. This is where userfaultfd and this new madvise mode come in: we can just madvise(MADV_ZAP) the poisoned page(s) on the target during a migration. Now all accesses will be routed to the VMM and we can inject an MCE. We don't *need* the new madvise mode, as we can also use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it would be more convenient if we didn't have to use fallocate.
Jue Wang can provide more context here, so I've cc'd him. There may be some things I'm wrong about, so Jue feel free to correct me.
James is right.
The page is marked PG_HWPoison in the source VM host's kernel. The need of intercepting guest accesses to it exist on the target VM host, where the same physical page is no longer poisoned.
On the target host, the hypervisor needs to intercept all guest accesses to pages poisoned from the source VM host.
Thanks for these information, James, Jue, Axel. I'm not familiar with memory failures yet, so please bare with me with a few naive questions.
So now I can undertand that hw-poisonsed pages on src host do not mean these pages will be hw-poisoned on dest host too, but I may have missed the reason on why dest host needs to trap it with pgtable removed.
AFAIU after pages got hw-poisoned on src, and after vmm injects MCEs into the guest, the guest shouldn't be accessing these pages any more, am I right? Then after migration completes, IIUC the guest shouldn't be accessing these pages too. My current understanding is, instead of trapping these pages on dest, we should just (somehow, which I have no real idea...) un-hw-poison these pages after migration because these pages are very possibly normal pages there. When there's real hw-poisoned pages reported on dst host, we should re-inject MCE errors to guest with another set of pages.
Could you tell me where did I miss?
Thanks,
On Fri, Sep 24, 2021 at 1:09 PM Peter Xu peterx@redhat.com wrote:
On Wed, Sep 22, 2021 at 10:43:40PM -0700, Jue Wang wrote:
[...]
Could I know what's the workaround? Normally if the workaround works solidly, then there's less need to introduce a kernel interface for that. Otherwise I'm glad to look into such a formal proposal.
The workaround is, for the region that you want to zap, run through this sequence of syscalls: mumap, mmap, and re-register with userfaultfd if it was registered before. If we're using tmpfs, we can use madvise(DONTNEED) instead, but this is kind of an abuse of the API. I don't think there's a guarantee that the PTEs will get zapped, but currently they will always get zapped if we're using tmpfs. I really like the idea of adding a new madvise() mode that is guaranteed to zap the PTEs.
I see.
It's also useful for memory poisoning, I think, if the host decides some page(s) are "bad" and wants to intercept any future guest accesses to those page(s).
Curious: isn't hwpoison information come from MCEs; or say, host kernel side? Then I thought the host kernel will have full control of it already.
Or there's other way that the host can try to detect some pages are going to be rotten? So the userspace can do something before the kernel handles those exceptions?
Here's a general idea of how we would like to use userfaultfd to support MPR:
If a guest accesses a poisoned page for the first time, we will get an MCE through the host kernel and send an MCE to the guest. The guest will now no longer be able to access this page, and we have to enforce this. After a live migration, the pages that were poisoned before probably won't still be poisoned (from the host's perspective), so we can't rely on the host kernel's MCE handling path. This is where userfaultfd and this new madvise mode come in: we can just madvise(MADV_ZAP) the poisoned page(s) on the target during a migration. Now all accesses will be routed to the VMM and we can inject an MCE. We don't *need* the new madvise mode, as we can also use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it would be more convenient if we didn't have to use fallocate.
Jue Wang can provide more context here, so I've cc'd him. There may be some things I'm wrong about, so Jue feel free to correct me.
James is right.
The page is marked PG_HWPoison in the source VM host's kernel. The need of intercepting guest accesses to it exist on the target VM host, where the same physical page is no longer poisoned.
On the target host, the hypervisor needs to intercept all guest accesses to pages poisoned from the source VM host.
Thanks for these information, James, Jue, Axel. I'm not familiar with memory failures yet, so please bare with me with a few naive questions.
So now I can undertand that hw-poisonsed pages on src host do not mean these pages will be hw-poisoned on dest host too, but I may have missed the reason on why dest host needs to trap it with pgtable removed.
AFAIU after pages got hw-poisoned on src, and after vmm injects MCEs into the guest, the guest shouldn't be accessing these pages any more, am I right? Then
This is also our hope for the guest to behave but there is no guarantee on guest behavior.
The goal here is to have the hypervisor provide consistent behavior aligned to native hardware, i.e., a guest page with "memory errors" stay persistently in that state no matter on source or target.
after migration completes, IIUC the guest shouldn't be accessing these pages too. My current understanding is, instead of trapping these pages on dest, we should just (somehow, which I have no real idea...) un-hw-poison these pages after migration because these pages are very possibly normal pages there. When there's real hw-poisoned pages reported on dst host, we should re-inject MCE errors to guest with another set of pages.
Could you tell me where did I miss?
Thanks,
-- Peter Xu
linux-kselftest-mirror@lists.linaro.org