On Thu, Apr 06, 2023 at 09:53:37AM -0700, Stefan Roesch wrote:
> + case PR_SET_MEMORY_MERGE:
> + if (mmap_write_lock_killable(me->mm))
> + return -EINTR;
> +
> + if (arg2) {
> + int err = ksm_add_mm(me->mm);
> + if (err)
> + return err;
You'll return to userspace with the mutex held, no?
On 06.04.23 18:53, Stefan Roesch wrote:
> This adds the general_profit KSM sysfs knob and the process profit metric
> and process merge type knobs to ksm_stat.
>
> 1) expose general_profit metric
>
> The documentation mentions a general profit metric, however this
> metric is not calculated. In addition the formula depends on the size
> of internal structures, which makes it more difficult for an
> administrator to make the calculation. Adding the metric for a better
> user experience.
>
> 2) document general_profit sysfs knob
>
> 3) calculate ksm process profit metric
>
> The ksm documentation mentions the process profit metric and how to
> calculate it. This adds the calculation of the metric.
>
> 4) add ksm_merge_type() function
>
> This adds the ksm_merge_type function. The function returns the
> merge type for the process. For madvise it returns "madvise", for
> prctl it returns "process" and otherwise it returns "none".
I'm curious, why exactly is this change required in this context? It
might be sufficient to observe if the prctl is set for a process. If
not, the ksm stats can reveal whether KSM is still active for that
process -> madvise.
For your use case, I'd assume it's pretty unnecessary to expose that.
If there is no compelling reason, I'd suggest to drop this and limit
this patch to exposing the general/per-mm profit, which I can understand
why it's desirable when fine-tuning a workload.
[...]
> Signed-off-by: Stefan Roesch <shr(a)devkernel.io>
> Reviewed-by: Bagas Sanjaya <bagasdotme(a)gmail.com>
> Cc: David Hildenbrand <david(a)redhat.com>
> Cc: Johannes Weiner <hannes(a)cmpxchg.org>
> Cc: Michal Hocko <mhocko(a)suse.com>
> Cc: Rik van Riel <riel(a)surriel.com>
> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
> ---
> Documentation/ABI/testing/sysfs-kernel-mm-ksm | 8 +++++
> Documentation/admin-guide/mm/ksm.rst | 8 ++++-
> fs/proc/base.c | 5 +++
> include/linux/ksm.h | 5 +++
> mm/ksm.c | 32 +++++++++++++++++++
> 5 files changed, 57 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-ksm b/Documentation/ABI/testing/sysfs-kernel-mm-ksm
> index d244674a9480..7768e90f7a8f 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-ksm
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-ksm
> @@ -51,3 +51,11 @@ Description: Control merging pages across different NUMA nodes.
>
> When it is set to 0 only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default).
> +
> +What: /sys/kernel/mm/ksm/general_profit
> +Date: January 2023
^ No
> +KernelVersion: 6.1
^ Outdated
(kind of weird having to come up with the right numbers before getting
it merged)
[...]
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 07463ad4a70a..c74450318e05 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -96,6 +96,7 @@
> #include <linux/time_namespace.h>
> #include <linux/resctrl.h>
> #include <linux/cn_proc.h>
> +#include <linux/ksm.h>
> #include <trace/events/oom.h>
> #include "internal.h"
> #include "fd.h"
> @@ -3199,6 +3200,7 @@ static int proc_pid_ksm_merging_pages(struct seq_file *m, struct pid_namespace *
>
> return 0;
> }
> +
^ unrelated change
> static int proc_pid_ksm_stat(struct seq_file *m, struct pid_namespace *ns,
> struct pid *pid, struct task_struct *task)
> {
> @@ -3208,6 +3210,9 @@ static int proc_pid_ksm_stat(struct seq_file *m, struct pid_namespace *ns,
> if (mm) {
> seq_printf(m, "ksm_rmap_items %lu\n", mm->ksm_rmap_items);
> seq_printf(m, "zero_pages_sharing %lu\n", mm->ksm_zero_pages_sharing);
> + seq_printf(m, "ksm_merging_pages %lu\n", mm->ksm_merging_pages);
> + seq_printf(m, "ksm_merge_type %s\n", ksm_merge_type(mm));
> + seq_printf(m, "ksm_process_profit %ld\n", ksm_process_profit(mm));
> mmput(mm);
> }
>
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> index c65455bf124c..4c32f9bca723 100644
> --- a/include/linux/ksm.h
> +++ b/include/linux/ksm.h
> @@ -60,6 +60,11 @@ struct page *ksm_might_need_to_copy(struct page *page,
> void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc);
> void folio_migrate_ksm(struct folio *newfolio, struct folio *folio);
>
> +#ifdef CONFIG_PROC_FS
> +long ksm_process_profit(struct mm_struct *);
> +const char *ksm_merge_type(struct mm_struct *mm);
> +#endif /* CONFIG_PROC_FS */
> +
> #else /* !CONFIG_KSM */
>
> static inline int ksm_add_mm(struct mm_struct *mm)
> diff --git a/mm/ksm.c b/mm/ksm.c
> index ab95ae0f9def..76b10ff840ac 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3042,6 +3042,25 @@ static void wait_while_offlining(void)
> }
> #endif /* CONFIG_MEMORY_HOTREMOVE */
>
> +#ifdef CONFIG_PROC_FS
> +long ksm_process_profit(struct mm_struct *mm)
> +{
> + return (long)mm->ksm_merging_pages * PAGE_SIZE -
Do we really need the cast to long? mm->ksm_merging_pages is defined as
"unsigned long". Just like "ksm_pages_sharing" below.
> + mm->ksm_rmap_items * sizeof(struct ksm_rmap_item);
> +}
> +
> +/* Return merge type name as string. */
> +const char *ksm_merge_type(struct mm_struct *mm)
> +{
> + if (test_bit(MMF_VM_MERGE_ANY, &mm->flags))
> + return "process";
> + else if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> + return "madvise";
> + else
> + return "none";
> +}
> +#endif /* CONFIG_PROC_FS */
> +
Apart from these nits, LGTM (again, I don't see why the merge type
should belong into this patch, and why there is a real need to expose it
like that).
Acked-by: David Hildenbrand <david(a)redhat.com>
--
Thanks,
David / dhildenb
At present the kselftest header can't be used with nolibc since it makes
use of vprintf() which is not available in nolibc. Fortunately nolibc
has a vfprintf() so we can just wrap that in order to allow kselftests
to be built with nolibc and still use the standard kselftest headers
with a small change to prevent the inclusion of the standard libc
headers. This allows us to avoid open coding of KTAP output for
selftests that need to use nolibc in order to test interfaces that are
controlled by libc, we've got several open coded examples of this in the
tree already.
As an example of using this the existing za-fork test is converted to
use kselftest.h. The changes to kselftest and nolibc don't have any
interaction until they are used by a test so could be merged separately
if desired.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Turns out nolibc has a vfprintf() already which we can use so do that.
- Link to v1: https://lore.kernel.org/r/20230405-kselftest-nolibc-v1-0-63fbcd70b202@kerne…
---
Mark Brown (3):
tools/nolibc/stdio: Implement vprintf()
kselftest: Support nolibc
kselftest/arm64: Convert za-fork to use kselftest.h
tools/include/nolibc/stdio.h | 6 ++
tools/testing/selftests/arm64/fp/Makefile | 2 +-
tools/testing/selftests/arm64/fp/za-fork.c | 88 ++++++------------------------
tools/testing/selftests/kselftest.h | 2 +
4 files changed, 25 insertions(+), 73 deletions(-)
---
base-commit: e8d018dd0257f744ca50a729e3d042cf2ec9da65
change-id: 20230405-kselftest-nolibc-cb2ce0446d09
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: Feng Zhou <zhoufeng.zf(a)bytedance.com>
When access traced function arguments with type is u32*, bpf verifier failed.
Because u32 have typedef, needs to skip modifier. Add btf_type_is_modifier in
is_int_ptr. Add a selftest to check it.
Feng Zhou (2):
bpf/btf: Fix is_int_ptr()
selftests/bpf: Add test to access u32 ptr argument in tracing program
Changelog:
v2->v3: Addressed comments from jirka
- Fix an issue that caused other test items to fail
Details in here:
https://lore.kernel.org/all/20230407084608.62296-1-zhoufeng.zf@bytedance.co…
v1->v2: Addressed comments from Martin KaFai Lau
- Add a selftest.
- use btf_type_skip_modifiers.
Some details in here:
https://lore.kernel.org/all/20221012125815.76120-1-zhouchengming@bytedance.…
kernel/bpf/btf.c | 8 ++------
net/bpf/test_run.c | 8 +++++++-
.../testing/selftests/bpf/verifier/btf_ctx_access.c | 13 +++++++++++++
3 files changed, 22 insertions(+), 7 deletions(-)
--
2.20.1
This is the basic functionality for iommufd to support
iommufd_device_replace() and IOMMU_HWPT_ALLOC for physical devices.
iommufd_device_replace() allows changing the HWPT associated with the
device to a new IOAS or HWPT. Replace does this in way that failure leaves
things unchanged, and utilizes the iommu iommu_group_replace_domain() API
to allow the iommu driver to perform an optional non-disruptive change.
IOMMU_HWPT_ALLOC allows HWPTs to be explicitly allocated by the user and
used by attach or replace. At this point it isn't very useful since the
HWPT is the same as the automatically managed HWPT from the IOAS. However
a following series will allow userspace to customize the created HWPT.
The implementation is complicated because we have to introduce some
per-iommu_group memory in iommufd and redo how we think about multi-device
groups to be more explicit. This solves all the locking problems in the
prior attempts.
This series is infrastructure work for the following series which:
- Add replace for attach
- Expose replace through VFIO APIs
- Implement driver parameters for HWPT creation (nesting)
I'll update the linux-next branch with this version and keep it on a side
branch and accumulate the following series when they are ready so we can have
a stable base and make more incremental progress. When we have all the parts
together to get a full implementation it can go to Linus.
This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
v5:
- Got back to the v3 version of the code, keep the comment changes from
v4. Syzkaller says the group lock change in v4 didn't work.
- Adjust the fail_nth test to cover the path syzkaller found. We need to
have an ioas with a mapped page installed to inject a failure during
domain attachment.
v4: https://lore.kernel.org/r/0-v4-9cd79ad52ee8+13f5-iommufd_alloc_jgg@nvidia.c…
- Refine comments and commit messages
- Move the group lock into iommufd_hw_pagetable_attach()
- Fix error unwind in iommufd_device_do_replace()
v3: https://lore.kernel.org/r/0-v3-61d41fd9e13e+1f5-iommufd_alloc_jgg@nvidia.com
- Refine comments and commit messages
- Adjust the flow in iommufd_device_auto_get_domain() so pt_id is only
set on success
- Reject replace on non-attached devices
- Add missing __reserved check for IOMMU_HWPT_ALLOC
v2: https://lore.kernel.org/r/0-v2-51b9896e7862+8a8c-iommufd_alloc_jgg@nvidia.c…
- Use WARN_ON for the igroup->group test and move that logic to a
function iommufd_group_try_get()
- Change igroup->devices to igroup->device list
Replace will need to iterate over all attached idevs
- Rename to iommufd_group_setup_msi()
- New patch to export iommu_get_resv_regions()
- New patch to use per-device reserved regions instead of per-group
regions
- Split out the reorganizing of iommufd_device_change_pt() from the
replace patch
- Replace uses the per-dev reserved regions
- Use stdev_id in a few more places in the selftest
- Fix error handling in IOMMU_HWPT_ALLOC
- Clarify comments
- Rebase on v6.3-rc1
v1: https://lore.kernel.org/all/0-v1-7612f88c19f5+2f21-iommufd_alloc_jgg@nvidia…
Jason Gunthorpe (15):
iommufd: Move isolated msi enforcement to iommufd_device_bind()
iommufd: Add iommufd_group
iommufd: Replace the hwpt->devices list with iommufd_group
iommu: Export iommu_get_resv_regions()
iommufd: Keep track of each device's reserved regions instead of
groups
iommufd: Use the iommufd_group to avoid duplicate MSI setup
iommufd: Make sw_msi_start a group global
iommufd: Move putting a hwpt to a helper function
iommufd: Add enforced_cache_coherency to iommufd_hw_pagetable_alloc()
iommufd: Reorganize iommufd_device_attach into
iommufd_device_change_pt
iommufd: Add iommufd_device_replace()
iommufd: Make destroy_rwsem use a lock class per object type
iommufd: Add IOMMU_HWPT_ALLOC
iommufd/selftest: Return the real idev id from selftest mock_domain
iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC
Nicolin Chen (2):
iommu: Introduce a new iommu_group_replace_domain() API
iommufd/selftest: Test iommufd_device_replace()
drivers/iommu/iommu-priv.h | 10 +
drivers/iommu/iommu.c | 41 +-
drivers/iommu/iommufd/device.c | 516 +++++++++++++-----
drivers/iommu/iommufd/hw_pagetable.c | 96 +++-
drivers/iommu/iommufd/io_pagetable.c | 27 +-
drivers/iommu/iommufd/iommufd_private.h | 51 +-
drivers/iommu/iommufd/iommufd_test.h | 6 +
drivers/iommu/iommufd/main.c | 17 +-
drivers/iommu/iommufd/selftest.c | 40 ++
include/linux/iommufd.h | 1 +
include/uapi/linux/iommufd.h | 26 +
tools/testing/selftests/iommu/iommufd.c | 67 ++-
.../selftests/iommu/iommufd_fail_nth.c | 67 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 63 ++-
14 files changed, 825 insertions(+), 203 deletions(-)
create mode 100644 drivers/iommu/iommu-priv.h
base-commit: fd8c1a4aee973e87d890a5861e106625a33b2c4e
--
2.40.0
From: Benjamin Berg <benjamin.berg(a)intel.com>
The existing KUNIT_ARRAY_PARAM macro requires a separate function to
get the description. However, in a lot of cases the description can
just be copied directly from the array. Add a second macro that
avoids having to write a static function just for a single strscpy.
Signed-off-by: Benjamin Berg <benjamin.berg(a)intel.com>
---
include/kunit/test.h | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/include/kunit/test.h b/include/kunit/test.h
index 08d3559dd703..519b90261c72 100644
--- a/include/kunit/test.h
+++ b/include/kunit/test.h
@@ -1414,6 +1414,25 @@ do { \
return NULL; \
}
+/**
+ * KUNIT_ARRAY_PARAM_DESC() - Define test parameter generator from an array.
+ * @name: prefix for the test parameter generator function.
+ * @array: array of test parameters.
+ * @desc_member: structure member from array element to use as description
+ *
+ * Define function @name_gen_params which uses @array to generate parameters.
+ */
+#define KUNIT_ARRAY_PARAM_DESC(name, array, desc_member) \
+ static const void *name##_gen_params(const void *prev, char *desc) \
+ { \
+ typeof((array)[0]) *__next = prev ? ((typeof(__next)) prev) + 1 : (array); \
+ if (__next - (array) < ARRAY_SIZE((array))) { \
+ strscpy(desc, __next->desc_member, KUNIT_PARAM_DESC_SIZE); \
+ return __next; \
+ } \
+ return NULL; \
+ }
+
// TODO(dlatypov(a)google.com): consider eventually migrating users to explicitly
// include resource.h themselves if they need it.
#include <kunit/resource.h>
--
2.39.2
No need to call maybe_mkwrite() to then wrprotect if the source PMD was not
writable.
It's worth nothing that this now allows for PTEs to be writable even if
the source PMD was not writable: if vma->vm_page_prot includes write
permissions.
As documented in commit 931298e103c2 ("mm/userfaultfd: rely on
vma->vm_page_prot in uffd_wp_range()"), any mechanism that intends to
have pages wrprotected (COW, writenotify, mprotect, uffd-wp, softdirty,
...) has to properly adjust vma->vm_page_prot upfront, to not include
write permissions. If vma->vm_page_prot includes write permissions, the
PTE/PMD can be writable as default.
This now mimics the handling in mm/migrate.c:remove_migration_pte() and in
mm/huge_memory.c:remove_migration_pmd(), which has been in place for a
long time (except that 96a9c287e25d ("mm/migrate: fix wrongly apply write
bit after mkdirty on sparc64") temporarily changed it).
Signed-off-by: David Hildenbrand <david(a)redhat.com>
---
mm/huge_memory.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6f3af65435c8..8332e16ac97b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2235,11 +2235,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = pte_swp_mkuffd_wp(entry);
} else {
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
- entry = maybe_mkwrite(entry, vma);
+ if (write)
+ entry = maybe_mkwrite(entry, vma);
if (anon_exclusive)
SetPageAnonExclusive(page + i);
- if (!write)
- entry = pte_wrprotect(entry);
if (!young)
entry = pte_mkold(entry);
/* NOTE: this may set soft-dirty too on some archs */
--
2.39.2