On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hello,
Thank you so much for the review. Do you have any thoughts on the build error on arc architecture? https://lore.kernel.org/all/e3c82373-256a-6297-bcb4-5e1179a2cbe2@collabora.c...
Maybe copy HPAGE_* defines from x86 and key on CONFIG_PGTABLE_LEVELS > 2? I don't know much about arc arch, though.
On 4/6/23 8:52 PM, Michał Mirosław wrote:
On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum usama.anjum@collabora.com wrote:>
[...]
+#define PM_SCAN_BITMAP(wt, file, present, swap) \
(wt | file << 1 | present << 2 | swap << 3)
Please parenthesize macro arguments ("(wt)", "(file)", etc.) to not have to worry about operator precedence when passed a complex expression.
Like this? #define PM_SCAN_BITMAP(wt, file, present, swap) \ ((wt) | (file << 1) | (present << 2) | (swap << 3))
The value would be: ( (wt) | ((file) << 1) | ... ) IOW, each parameter should have parentheses around its name.
[...]
cur->len += n_pages;
p->found_pages += n_pages;
if (p->max_pages && (p->found_pages == p->max_pages))
return PM_SCAN_FOUND_MAX_PAGES;
return 0;
}
if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
It looks that `if (p->vec_index < p->vec_len)` is enough here - if we have vec_len == 0 here, then we'd not fit the entry in the userspace buffer anyway. Am I missing something?
No. I'd explained it with diagram last time: https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.c...
I'll add a concise comment here.
So it seems, but I think the code changed a bit and maybe could be simplified now? Since p->vec_len == 0 is currently not valid, the field could count only the entries available in p->vec[] -- IOW: not include p->cur in the count.
BTW, `if (no space) return -ENOSPC` will avoid additional indentation for the non-merging case.
[...]
+static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
struct page_region __user *vec,
unsigned long *vec_index)
..._deposit() is used only in single place - please inline.
It is already inline.
Sorry. I mean: please paste the code in place of the single call.
[...]
/*
* Break huge page into small pages if the WP operation need to
* be performed is on a portion of the huge page or if max_pages
* pages limit would exceed.
BTW, could the `max_pages` limit be relaxed a bit (in that it would be possible to return more pages if they all merge into the last vector entry) so that it would not need to split otherwise-matching huge page? It would remove the need for this special handling in the kernel and splitting the page by this read-only-appearing ioctl?
No, this cannot be done. Otherwise we'll not be able to emulate Windows syscall GetWriteWatch() which specifies the exact number of pages. Usually in most of cases, either user will not use THP or not perform the operation on partial huge page. So this part is only there to keep things correct for those users who do use THP and partial pagemap_scan operations.
I see that `GetWriteWatch` returns a list of pages not ranges of pages. That makes sense (more or less). (BTW, It could be emulated in userspace by caching the last not-fully-consumed range.)
*/
if (is_written && PM_SCAN_OP_IS_WP(p) &&
((end - start < HPAGE_SIZE) ||
(p->max_pages &&
(p->max_pages - p->found_pages) < n_pages))) {
split_huge_pmd(vma, pmd, start);
goto process_smaller_pages;
}
if (p->max_pages &&
p->found_pages + n_pages > p->max_pages)
n_pages = p->max_pages - p->found_pages;
ret = pagemap_scan_output(is_written, is_file, is_present,
is_swap, p, start, n_pages);
if (ret < 0)
return ret;
So let's simplify this:
if (p->max_pages && n_pages > max_pages - found_pages) n_pages = max_pages - found_pages;
if (is_written && DO_WP && n_pages != HPAGE_SIZE / PAGE_SIZE) { split_thp(); goto process_smaller_pages; }
BTW, THP handling could be extracted to a function that would return -EAGAIN if it has split the page or it wasn't a THP -- and that would mean `goto process_smaller_pages`.
Why not propagate the error from uffd_wp_range()?
uffd_wp_range() returns status in long variable. We cannot return long in this function. So intead of type casting long to int and then return I've used -EINVAL. Would following be more suitable?
long ret2 = uffd_wp_range(vma, start, HPAGE_SIZE, true); if (ret2 < 0) return (int)ret2;
I think it's ok, since negative values are expected to be error codes. And since you can't overflow int with HPAGE_SIZE pages, then I wouldn't use `ret2` but cast the return and add a comment why it's safe.
[...]
start = (unsigned long)untagged_addr(arg.start);
vec = (struct page_region *)(unsigned long)untagged_addr(arg.vec);
Is the inner cast needed?
arg.vec remains 64-bit on 32-bit systems. So casting 64bit value directly to struct page_region pointer errors out. So I've added specific casting to unsigned long first before casting to pointers.
I see. So to convey the intention, the `arg.start` and `arg.vec` should be casted to unsigned long, not the `untagged_addr()` return values.
ret = pagemap_scan_args_valid(&arg, start, vec);
if (ret)
return ret;
end = start + arg.len;
p.max_pages = arg.max_pages;
p.found_pages = 0;
p.flags = arg.flags;
p.required_mask = arg.required_mask;
p.anyof_mask = arg.anyof_mask;
p.excluded_mask = arg.excluded_mask;
p.return_mask = arg.return_mask;
p.cur.len = 0;
p.cur.start = 0;
p.vec = NULL;
p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
Nit: parentheses are not needed here, please remove.
Will do.
/*
* Allocate smaller buffer to get output from inside the page walk
* functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
* we want to return output to user in compact form where no two
* consecutive regions should be continuous and have the same flags.
* So store the latest element in p.cur between different walks and
* store the p.cur at the end of the walk to the user buffer.
*/
p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
GFP_KERNEL);
if (!p.vec)
return -ENOMEM;
walk_start = walk_end = start;
while (walk_end < end && !ret) {
The loop will stop if a previous iteration returned ENOSPC (and the error will be lost) - is it intended?
It is intentional. -ENOSPC means that the user buffer is full even though there was more memory to walk over. We don't treat this error. So when buffer gets full, we stop walking over further as user buffer has gotten full and return as success.
Thanks. What's the difference between -ENOSPC and PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code flow).
[...]
--- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -210,6 +210,14 @@ extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
#else /* CONFIG_USERFAULTFD */
+static inline long uffd_wp_range(struct mm_struct *dst_mm,
struct vm_area_struct *vma,
unsigned long start, unsigned long len,
bool enable_wp)
+{
return 0;
+}
[...]
Shouldn't this part be in the patch introducing uffd_wp_range()?
We have not added uffd_wp_range() in previous patches. We just need this stub for this patch for the case when CONFIG_USERFAULTFD isn't enabled.
I'd this as separate patch before this patch. Mike asked me to merge it with this patch in order not to break bisectability. [1] https://lore.kernel.org/all/ZBK+86eMMazwfhdx@kernel.org
I would understand the reply [1] to mean that the uffd_wp_range() stub should go in the same patch where uffd_wp_range() is implemented. But uffd_wp_range() is already in (since f369b07c86140) so I don't see how having the stub in a separate commit sequenced before this one could break bisect?
Best Regards Michał Mirosław