[RFCv1 0/6] Page Detective

List overview All Threads
Download

newer

older

[GIT PULL] Kselftest update for...

KUnit expectations in atomic...

Pasha Tatashin

16 Nov 2024 16 Nov '24

5:59 p.m.

Page Detective is a new kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It is often known that a particular page is corrupted, but it is hard to extract more information about such a page from live system. Examples are:

- Checksum failure during live migration - Filesystem journal failure - dump_page warnings on the console log - Unexcpected segfaults

Page Detective helps to extract more information from the kernel, so it can be used by developers to root cause the associated problem.

It operates through the Linux debugfs interface, with two files: "virt" and "phys".

The "virt" file takes a virtual address and PID and outputs information about the corresponding page.

The "phys" file takes a physical address and outputs information about that page.

The output is presented via kernel log messages (can be accessed with dmesg), and includes information such as the page's reference count, mapping, flags, and memory cgroup. It also shows whether the page is mapped in the kernel page table, and if so, how many times.

Pasha Tatashin (6): mm: Make get_vma_name() function public pagewalk: Add a page table walker for init_mm page table mm: Add a dump_page variant that accept log level argument misc/page_detective: Introduce Page Detective misc/page_detective: enable loadable module selftests/page_detective: Introduce self tests for Page Detective

Documentation/misc-devices/index.rst | 1 + Documentation/misc-devices/page_detective.rst | 78 ++ MAINTAINERS | 8 + drivers/misc/Kconfig | 11 + drivers/misc/Makefile | 1 + drivers/misc/page_detective.c | 808 ++++++++++++++++++ fs/inode.c | 18 +- fs/kernfs/dir.c | 1 + fs/proc/task_mmu.c | 61 -- include/linux/fs.h | 5 +- include/linux/mmdebug.h | 1 + include/linux/pagewalk.h | 2 + kernel/pid.c | 1 + mm/debug.c | 53 +- mm/memcontrol.c | 1 + mm/oom_kill.c | 1 + mm/pagewalk.c | 32 + mm/vma.c | 60 ++ tools/testing/selftests/Makefile | 1 + .../selftests/page_detective/.gitignore | 1 + .../testing/selftests/page_detective/Makefile | 7 + tools/testing/selftests/page_detective/config | 4 + .../page_detective/page_detective_test.c | 727 ++++++++++++++++ 23 files changed, 1787 insertions(+), 96 deletions(-) create mode 100644 Documentation/misc-devices/page_detective.rst create mode 100644 drivers/misc/page_detective.c create mode 100644 tools/testing/selftests/page_detective/.gitignore create mode 100644 tools/testing/selftests/page_detective/Makefile create mode 100644 tools/testing/selftests/page_detective/config create mode 100644 tools/testing/selftests/page_detective/page_detective_test.c

-- 2.47.0.338.g60cca15819-goog

Show replies by date

Pasha Tatashin

16 Nov 16 Nov

5:59 p.m.

New subject: [RFCv1 1/6] mm: Make get_vma_name() function public

Page Detective will be using get_vma_name() that is currently used by fs/proc to show names of VMAs in /proc/<pid>/smaps for example.

Move this function to mm/vma.c, and make it accessible by modules.

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com --- fs/proc/task_mmu.c | 61 ---------------------------------------------- include/linux/fs.h | 3 +++ mm/vma.c | 60 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 63 insertions(+), 61 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e52bd96137a6..b28c42b7a591 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -240,67 +240,6 @@ static int do_maps_open(struct inode *inode, struct file *file, sizeof(struct proc_maps_private)); }

-static void get_vma_name(struct vm_area_struct *vma, - const struct path **path, - const char **name, - const char **name_fmt) -{ - struct anon_vma_name *anon_name = vma->vm_mm ? anon_vma_name(vma) : NULL; - - *name = NULL; - *path = NULL; - *name_fmt = NULL; - - /* - * Print the dentry name for named mappings, and a - * special [heap] marker for the heap: - */ - if (vma->vm_file) { - /* - * If user named this anon shared memory via - * prctl(PR_SET_VMA ..., use the provided name. - */ - if (anon_name) { - *name_fmt = "[anon_shmem:%s]"; - *name = anon_name->name; - } else { - *path = file_user_path(vma->vm_file); - } - return; - } - - if (vma->vm_ops && vma->vm_ops->name) { - *name = vma->vm_ops->name(vma); - if (*name) - return; - } - - *name = arch_vma_name(vma); - if (*name) - return; - - if (!vma->vm_mm) { - *name = "[vdso]"; - return; - } - - if (vma_is_initial_heap(vma)) { - *name = "[heap]"; - return; - } - - if (vma_is_initial_stack(vma)) { - *name = "[stack]"; - return; - } - - if (anon_name) { - *name_fmt = "[anon:%s]"; - *name = anon_name->name; - return; - } -} - static void show_vma_header_prefix(struct seq_file *m, unsigned long start, unsigned long end, vm_flags_t flags, unsigned long long pgoff, diff --git a/include/linux/fs.h b/include/linux/fs.h index 3559446279c1..a25b72397af5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3474,6 +3474,9 @@ void setattr_copy(struct mnt_idmap *, struct inode *inode,

extern int file_update_time(struct file *file);

+void get_vma_name(struct vm_area_struct *vma, const struct path **path, + const char **name, const char **name_fmt); + static inline bool vma_is_dax(const struct vm_area_struct *vma) { return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); diff --git a/mm/vma.c b/mm/vma.c index 7621384d64cf..1bd589fbc3c7 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -2069,3 +2069,63 @@ void mm_drop_all_locks(struct mm_struct *mm)

mutex_unlock(&mm_all_locks_mutex); } + +void get_vma_name(struct vm_area_struct *vma, const struct path **path, + const char **name, const char **name_fmt) +{ + struct anon_vma_name *anon_name = vma->vm_mm ? anon_vma_name(vma) : NULL; + + *name = NULL; + *path = NULL; + *name_fmt = NULL; + + /* + * Print the dentry name for named mappings, and a + * special [heap] marker for the heap: + */ + if (vma->vm_file) { + /* + * If user named this anon shared memory via + * prctl(PR_SET_VMA ..., use the provided name. + */ + if (anon_name) { + *name_fmt = "[anon_shmem:%s]"; + *name = anon_name->name; + } else { + *path = file_user_path(vma->vm_file); + } + return; + } + + if (vma->vm_ops && vma->vm_ops->name) { + *name = vma->vm_ops->name(vma); + if (*name) + return; + } + + *name = arch_vma_name(vma); + if (*name) + return; + + if (!vma->vm_mm) { + *name = "[vdso]"; + return; + } + + if (vma_is_initial_heap(vma)) { + *name = "[heap]"; + return; + } + + if (vma_is_initial_stack(vma)) { + *name = "[stack]"; + return; + } + + if (anon_name) { + *name_fmt = "[anon:%s]"; + *name = anon_name->name; + return; + } +} +EXPORT_SYMBOL_GPL(get_vma_name);

-- 2.47.0.338.g60cca15819-goog

Lorenzo Stoakes

18 Nov 18 Nov

10:26 a.m.

New subject: [RFCv1 1/6] mm: Make get_vma_name() function public

On Sat, Nov 16, 2024 at 05:59:17PM +0000, Pasha Tatashin wrote:

...

Page Detective will be using get_vma_name() that is currently used by fs/proc to show names of VMAs in /proc/<pid>/smaps for example.

Move this function to mm/vma.c, and make it accessible by modules.

This is incorrect.

mm/vma.c is for internal VMA implementation details, whose interface is explicitly mm/vma.h. This is so we can maintain the internal mechanism separate from interfaces and, importantly, are able to userland unit test VMA functionality.

I think this _should_ be in mm/vma.c, but if it were to be exported it would need to be via a wrapper function in mm/mmap.c or somewhere like this.

Also you broke the vma tests, go run make in tools/testing/vma/...

Your patch also does not apply against Andrew's tree and the mm-unstable branch (i.e. against 6.13 in other words) which is what new mm patches should be based upon.

Maybe I'll comment on the cover letter, but I don't agree you should be doing mm implementation details in a driver.

The core of this should be in mm rather than exporting a bunch of stuff and have a driver do it. You're exposing internal implementation details unnecessarily.

...

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com

fs/proc/task_mmu.c | 61 ---------------------------------------------- include/linux/fs.h | 3 +++ mm/vma.c | 60 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 63 insertions(+), 61 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e52bd96137a6..b28c42b7a591 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -240,67 +240,6 @@ static int do_maps_open(struct inode *inode, struct file *file, sizeof(struct proc_maps_private)); }

-static void get_vma_name(struct vm_area_struct *vma,
	 const struct path **path,
	 const char **name,
	 const char **name_fmt)
-{
struct anon_vma_name *anon_name = vma->vm_mm ? anon_vma_name(vma) : NULL;

*name = NULL;

*path = NULL;

*name_fmt = NULL;

/*
* Print the dentry name for named mappings, and a
* special [heap] marker for the heap:
*/
if (vma->vm_file) {
/*
 * If user named this anon shared memory via
 * prctl(PR_SET_VMA ..., use the provided name.
 */
if (anon_name) {
	*name_fmt = "[anon_shmem:%s]";
	*name = anon_name->name;
} else {
	*path = file_user_path(vma->vm_file);
}
return;
}

if (vma->vm_ops && vma->vm_ops->name) {
*name = vma->vm_ops->name(vma);
if (*name)
	return;
}

*name = arch_vma_name(vma);

if (*name)
return;
if (!vma->vm_mm) {
*name = "[vdso]";
return;
}

if (vma_is_initial_heap(vma)) {
*name = "[heap]";
return;
}

if (vma_is_initial_stack(vma)) {
*name = "[stack]";
return;
}

if (anon_name) {
*name_fmt = "[anon:%s]";
*name = anon_name->name;
return;
}
-}

static void show_vma_header_prefix(struct seq_file *m, unsigned long start, unsigned long end, vm_flags_t flags, unsigned long long pgoff, diff --git a/include/linux/fs.h b/include/linux/fs.h index 3559446279c1..a25b72397af5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3474,6 +3474,9 @@ void setattr_copy(struct mnt_idmap *, struct inode *inode,

extern int file_update_time(struct file *file);

+void get_vma_name(struct vm_area_struct *vma, const struct path **path,
  const char **name, const char **name_fmt);

You're putting something in an mm/ C-file and the header in fs.h? Eh?

...

static inline bool vma_is_dax(const struct vm_area_struct *vma) { return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); diff --git a/mm/vma.c b/mm/vma.c index 7621384d64cf..1bd589fbc3c7 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -2069,3 +2069,63 @@ void mm_drop_all_locks(struct mm_struct *mm)

mutex_unlock(&mm_all_locks_mutex); }

+void get_vma_name(struct vm_area_struct *vma, const struct path **path,
  const char **name, const char **name_fmt)
+{

struct anon_vma_name *anon_name = vma->vm_mm ? anon_vma_name(vma) : NULL;

If we were to export this (I'm very dubious about that) I'd want to assert some lock state and that the vma exists too.

Because we're just assuming the VMA won't disappear from under us and now the driver will too, and also assuming we won't be passed NULL's...

But in general I'm not in favour of having this exported.

...

*name = NULL;

*path = NULL;

*name_fmt = NULL;

/*
* Print the dentry name for named mappings, and a
* special [heap] marker for the heap:
*/
if (vma->vm_file) {
/*
 * If user named this anon shared memory via
 * prctl(PR_SET_VMA ..., use the provided name.
 */
if (anon_name) {
	*name_fmt = "[anon_shmem:%s]";
	*name = anon_name->name;
} else {
	*path = file_user_path(vma->vm_file);
}
return;
}

if (vma->vm_ops && vma->vm_ops->name) {
*name = vma->vm_ops->name(vma);
if (*name)
	return;
}

*name = arch_vma_name(vma);

if (*name)
return;
if (!vma->vm_mm) {
*name = "[vdso]";
return;
}

if (vma_is_initial_heap(vma)) {
*name = "[heap]";
return;
}

if (vma_is_initial_stack(vma)) {
*name = "[stack]";
return;
}

if (anon_name) {
*name_fmt = "[anon:%s]";
*name = anon_name->name;
return;
}
+}

+EXPORT_SYMBOL_GPL(get_vma_name);

2.47.0.338.g60cca15819-goog

Pasha Tatashin

8:40 p.m.

New subject: [RFCv1 1/6] mm: Make get_vma_name() function public

On Mon, Nov 18, 2024 at 5:27 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:

...

On Sat, Nov 16, 2024 at 05:59:17PM +0000, Pasha Tatashin wrote:

...
Page Detective will be using get_vma_name() that is currently used by fs/proc to show names of VMAs in /proc/<pid>/smaps for example.

Move this function to mm/vma.c, and make it accessible by modules.

This is incorrect.

mm/vma.c is for internal VMA implementation details, whose interface is explicitly mm/vma.h. This is so we can maintain the internal mechanism separate from interfaces and, importantly, are able to userland unit test VMA functionality.

I think this _should_ be in mm/vma.c, but if it were to be exported it would need to be via a wrapper function in mm/mmap.c or somewhere like this.

Ok, I can do that in the next version.

...

Also you broke the vma tests, go run make in tools/testing/vma/...

Hm interesting, I will take a look, this is surprising, as this patch should not really change the behavior of anything. I guess it would be because of the out of kernel vma.c build?

...

Your patch also does not apply against Andrew's tree and the mm-unstable branch (i.e. against 6.13 in other words) which is what new mm patches should be based upon.

Maybe I'll comment on the cover letter, but I don't agree you should be doing mm implementation details in a driver.

The core of this should be in mm rather than exporting a bunch of stuff and have a driver do it. You're exposing internal implementation details unnecessarily.

This is not a problem, I will convert Page Detective to be in core mm.

...

...
@@ -3474,6 +3474,9 @@ void setattr_copy(struct mnt_idmap *, struct inode *inode,

extern int file_update_time(struct file *file);

+void get_vma_name(struct vm_area_struct *vma, const struct path **path,
          const char **name, const char **name_fmt);
You're putting something in an mm/ C-file and the header in fs.h? Eh?

This is done so we do not have to include struct path into vma.h. fs.h already has some vma functions like: vma_is_dax() and vma_is_fsdax().

Matthew Wilcox

8:44 p.m.

New subject: [RFCv1 1/6] mm: Make get_vma_name() function public

On Mon, Nov 18, 2024 at 03:40:57PM -0500, Pasha Tatashin wrote:

...

...
You're putting something in an mm/ C-file and the header in fs.h? Eh?

This is done so we do not have to include struct path into vma.h. fs.h already has some vma functions like: vma_is_dax() and vma_is_fsdax().

Yes, but DAX is a monumental layering violation, not something to be emulated.

Pasha Tatashin

10:26 p.m.

New subject: [RFCv1 1/6] mm: Make get_vma_name() function public

On Mon, Nov 18, 2024 at 3:44 PM Matthew Wilcox willy@infradead.org wrote:

...

On Mon, Nov 18, 2024 at 03:40:57PM -0500, Pasha Tatashin wrote:

...
...
You're putting something in an mm/ C-file and the header in fs.h? Eh?

This is done so we do not have to include struct path into vma.h. fs.h already has some vma functions like: vma_is_dax() and vma_is_fsdax().

Yes, but DAX is a monumental layering violation, not something to be emulated.

Fair enough, but I do not like adding "struct path" dependency to vma.h, is there a better place to put it?

Pasha

Pasha Tatashin

16 Nov 16 Nov

5:59 p.m.

New subject: [RFCv1 2/6] pagewalk: Add a page table walker for init_mm page table

Page Detective will use it to walk the kernel page table. Make this function accessible from modules, and also while here make walk_page_range() accessible from modules, so Page Detective could use it to walk user page tables.

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com --- include/linux/pagewalk.h | 2 ++ mm/pagewalk.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 34 insertions(+)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index f5eb5a32aeed..ff25374470f0 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -124,6 +124,8 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private); +int walk_page_range_kernel(unsigned long start, unsigned long end, + const struct mm_walk_ops *ops, void *private); int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops, void *private); int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 5f9f01532e67..050790aeb15f 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -478,6 +478,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, } while (start = next, start < end); return err; } +EXPORT_SYMBOL_GPL(walk_page_range);

/** * walk_page_range_novma - walk a range of pagetables not backed by a vma @@ -541,6 +542,37 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, return walk_pgd_range(start, end, &walk); }

+/** + * walk_page_range_kernel - walk a range of pagetables of kernel/init_mm + * @start: start address of the virtual address range + * @end: end address of the virtual address range + * @ops: operation to call during the walk + * @private: private data for callbacks' usage + * + * Similar to walk_page_range_novma() but specifically walks init_mm.pgd table. + * + * Note: This function takes two looks: get_online_mems(), and mmap_read, this + * is to prevent kernel page tables from being freed while walking. + */ +int walk_page_range_kernel(unsigned long start, unsigned long end, + const struct mm_walk_ops *ops, void *private) +{ + get_online_mems(); + if (mmap_read_lock_killable(&init_mm)) { + put_online_mems(); + return -EAGAIN; + } + + walk_page_range_novma(&init_mm, start, end, ops, + init_mm.pgd, private); + + mmap_read_unlock(&init_mm); + put_online_mems(); + + return 0; +} +EXPORT_SYMBOL_GPL(walk_page_range_kernel); + int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private)

-- 2.47.0.338.g60cca15819-goog

Christoph Hellwig

18 Nov 18 Nov

6:49 a.m.

New subject: [RFCv1 2/6] pagewalk: Add a page table walker for init_mm page table

On Sat, Nov 16, 2024 at 05:59:18PM +0000, Pasha Tatashin wrote:

...

} while (start = next, start < end); return err; } +EXPORT_SYMBOL_GPL(walk_page_range);

Umm, no. We really should not expose all these page table detail to modules.

...

+EXPORT_SYMBOL_GPL(walk_page_range_kernel);

Even more so here.

Lorenzo Stoakes

10:32 a.m.

New subject: [RFCv1 2/6] pagewalk: Add a page table walker for init_mm page table

On Sun, Nov 17, 2024 at 10:49:06PM -0800, Christoph Hellwig wrote:

...

On Sat, Nov 16, 2024 at 05:59:18PM +0000, Pasha Tatashin wrote:

...
} while (start = next, start < end); return err; } +EXPORT_SYMBOL_GPL(walk_page_range);

Umm, no. We really should not expose all these page table detail to modules.

...
+EXPORT_SYMBOL_GPL(walk_page_range_kernel);

Even more so here.

Very much agree. You basically then have the ability for any (GPL) driver to come in and modify page tables at will which is VERY MUCH not a good idea.

The rules around page table manipulation are very subtle and constantly changing, this is not something for anything outside of mm to be fiddling with.

Again, I find it bizarre we're exporting mm internal implementation details to a driver to do stuff with rather than adding functionality to mm.

Pasha Tatashin

8:42 p.m.

New subject: [RFCv1 2/6] pagewalk: Add a page table walker for init_mm page table

On Mon, Nov 18, 2024 at 1:49 AM Christoph Hellwig hch@infradead.org wrote:

...

On Sat, Nov 16, 2024 at 05:59:18PM +0000, Pasha Tatashin wrote:

...
  } while (start = next, start < end);
  return err;
} +EXPORT_SYMBOL_GPL(walk_page_range);
Umm, no. We really should not expose all these page table detail to modules.

...
+EXPORT_SYMBOL_GPL(walk_page_range_kernel);

Even more so here.

I will remove these exports in the next version, as I am going to convert Page Detective to be part of core mm instead of misc device.

Thanks, Pasha

Pasha Tatashin

16 Nov 16 Nov

5:59 p.m.

New subject: [RFCv1 3/6] mm: Add a dump_page variant that accept log level argument

Page Detective uses info level, while dump_page() uses warn level. Add a new function dump_page_lvl() that accepts log level argument to be able to dump pages at specific level. Also, this enables adding a modules specific prefix to output of this function.

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com --- fs/inode.c | 18 +++++++------- include/linux/fs.h | 2 +- include/linux/mmdebug.h | 1 + mm/debug.c | 53 ++++++++++++++++++++++------------------- 4 files changed, 39 insertions(+), 35 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c index 8dabb224f941..1114319d82b2 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -603,7 +603,7 @@ void __remove_inode_hash(struct inode *inode) } EXPORT_SYMBOL(__remove_inode_hash);

-void dump_mapping(const struct address_space *mapping) +void dump_mapping(const char *loglvl, const struct address_space *mapping) { struct inode *host; const struct address_space_operations *a_ops; @@ -619,31 +619,31 @@ void dump_mapping(const struct address_space *mapping) */ if (get_kernel_nofault(host, &mapping->host) || get_kernel_nofault(a_ops, &mapping->a_ops)) { - pr_warn("invalid mapping:%px\n", mapping); + printk("%sinvalid mapping:%px\n", loglvl, mapping); return; }

if (!host) { - pr_warn("aops:%ps\n", a_ops); + printk("%saops:%ps\n", loglvl, a_ops); return; }

if (get_kernel_nofault(dentry_first, &host->i_dentry.first) || get_kernel_nofault(ino, &host->i_ino)) { - pr_warn("aops:%ps invalid inode:%px\n", a_ops, host); + printk("%saops:%ps invalid inode:%px\n", loglvl, a_ops, host); return; }

if (!dentry_first) { - pr_warn("aops:%ps ino:%lx\n", a_ops, ino); + printk("%saops:%ps ino:%lx\n", loglvl, a_ops, ino); return; }

dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias); if (get_kernel_nofault(dentry, dentry_ptr) || !dentry.d_parent || !dentry.d_name.name) { - pr_warn("aops:%ps ino:%lx invalid dentry:%px\n", - a_ops, ino, dentry_ptr); + printk("%saops:%ps ino:%lx invalid dentry:%px\n", + loglvl, a_ops, ino, dentry_ptr); return; }

@@ -653,8 +653,8 @@ void dump_mapping(const struct address_space *mapping) * Even if strncpy_from_kernel_nofault() succeeded, * the fname could be unreliable */ - pr_warn("aops:%ps ino:%lx dentry name(?):"%s"\n", - a_ops, ino, fname); + printk("%saops:%ps ino:%lx dentry name(?):"%s"\n", + loglvl, a_ops, ino, fname); }

void clear_inode(struct inode *inode) diff --git a/include/linux/fs.h b/include/linux/fs.h index a25b72397af5..fa2b04bed9d6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3137,7 +3137,7 @@ extern void unlock_new_inode(struct inode *); extern void discard_new_inode(struct inode *); extern unsigned int get_next_ino(void); extern void evict_inodes(struct super_block *sb); -void dump_mapping(const struct address_space *); +void dump_mapping(const char *loglvl, const struct address_space *);

/* * Userspace may rely on the inode number being non-zero. For example, glibc diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h index 39a7714605a7..69849d457f4c 100644 --- a/include/linux/mmdebug.h +++ b/include/linux/mmdebug.h @@ -11,6 +11,7 @@ struct mm_struct; struct vma_iterator;

void dump_page(const struct page *page, const char *reason); +void dump_page_lvl(const char *loglvl, const struct page *page); void dump_vma(const struct vm_area_struct *vma); void dump_mm(const struct mm_struct *mm); void vma_iter_dump_tree(const struct vma_iterator *vmi); diff --git a/mm/debug.c b/mm/debug.c index aa57d3ffd4ed..0df242c77c7c 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -67,36 +67,38 @@ static const char *page_type_name(unsigned int page_type) return page_type_names[i]; }

-static void __dump_folio(struct folio *folio, struct page *page, - unsigned long pfn, unsigned long idx) +static void __dump_folio(const char *loglvl, struct folio *folio, + struct page *page, unsigned long pfn, + unsigned long idx) { struct address_space *mapping = folio_mapping(folio); int mapcount = atomic_read(&page->_mapcount); char *type = "";

mapcount = page_mapcount_is_type(mapcount) ? 0 : mapcount + 1; - pr_warn("page: refcount:%d mapcount:%d mapping:%p index:%#lx pfn:%#lx\n", - folio_ref_count(folio), mapcount, mapping, - folio->index + idx, pfn); + printk("%spage: refcount:%d mapcount:%d mapping:%p index:%#lx pfn:%#lx\n", + loglvl, folio_ref_count(folio), mapcount, mapping, + folio->index + idx, pfn); if (folio_test_large(folio)) { - pr_warn("head: order:%u mapcount:%d entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n", - folio_order(folio), - folio_mapcount(folio), - folio_entire_mapcount(folio), - folio_nr_pages_mapped(folio), - atomic_read(&folio->_pincount)); + printk("%shead: order:%u mapcount:%d entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n", + loglvl, + folio_order(folio), + folio_mapcount(folio), + folio_entire_mapcount(folio), + folio_nr_pages_mapped(folio), + atomic_read(&folio->_pincount)); }

#ifdef CONFIG_MEMCG if (folio->memcg_data) - pr_warn("memcg:%lx\n", folio->memcg_data); + printk("%smemcg:%lx\n", loglvl, folio->memcg_data); #endif if (folio_test_ksm(folio)) type = "ksm "; else if (folio_test_anon(folio)) type = "anon "; else if (mapping) - dump_mapping(mapping); + dump_mapping(loglvl, mapping); BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS + 1);

/* @@ -105,22 +107,22 @@ static void __dump_folio(struct folio *folio, struct page *page, * state for debugging, it should be fine to accept a bit of * inaccuracy here due to racing. */ - pr_warn("%sflags: %pGp%s\n", type, &folio->flags, - is_migrate_cma_folio(folio, pfn) ? " CMA" : ""); + printk("%s%sflags: %pGp%s\n", loglvl, type, &folio->flags, + is_migrate_cma_folio(folio, pfn) ? " CMA" : ""); if (page_has_type(&folio->page)) pr_warn("page_type: %x(%s)\n", folio->page.page_type >> 24, page_type_name(folio->page.page_type));

- print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32, - sizeof(unsigned long), page, - sizeof(struct page), false); + print_hex_dump(loglvl, "raw: ", DUMP_PREFIX_NONE, 32, + sizeof(unsigned long), page, + sizeof(struct page), false); if (folio_test_large(folio)) - print_hex_dump(KERN_WARNING, "head: ", DUMP_PREFIX_NONE, 32, - sizeof(unsigned long), folio, - 2 * sizeof(struct page), false); + print_hex_dump(loglvl, "head: ", DUMP_PREFIX_NONE, 32, + sizeof(unsigned long), folio, + 2 * sizeof(struct page), false); }

-static void __dump_page(const struct page *page) +void dump_page_lvl(const char *loglvl, const struct page *page) { struct folio *foliop, folio; struct page precise; @@ -149,22 +151,23 @@ static void __dump_page(const struct page *page) if (idx > nr_pages) { if (loops-- > 0) goto again; - pr_warn("page does not match folio\n"); + printk("%spage does not match folio\n", loglvl); precise.compound_head &= ~1UL; foliop = (struct folio *)&precise; idx = 0; }

dump: - __dump_folio(foliop, &precise, pfn, idx); + __dump_folio(loglvl, foliop, &precise, pfn, idx); } +EXPORT_SYMBOL_GPL(dump_page_lvl);

void dump_page(const struct page *page, const char *reason) { if (PagePoisoned(page)) pr_warn("page:%p is uninitialized and poisoned", page); else - __dump_page(page); + dump_page_lvl(KERN_WARNING, page); if (reason) pr_warn("page dumped because: %s\n", reason); dump_page_owner(page);

-- 2.47.0.338.g60cca15819-goog

Pasha Tatashin

5:59 p.m.

New subject: [RFCv1 4/6] misc/page_detective: Introduce Page Detective

Page Detective is a kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It operates through the Linux debugfs interface, providing access to both virtual and physical address inquiries. The output, presented via kernel log messages (accessible with dmesg), will help administrators and developers understand how specific pages are utilized by the system.

This tool can be used to investigate various memory-related issues, such as checksum failures during live migration, filesystem journal failures, general segfaults, or other corruptions.

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com --- Documentation/misc-devices/index.rst | 1 + Documentation/misc-devices/page_detective.rst | 78 ++ MAINTAINERS | 7 + drivers/misc/Kconfig | 11 + drivers/misc/Makefile | 1 + drivers/misc/page_detective.c | 808 ++++++++++++++++++ 6 files changed, 906 insertions(+) create mode 100644 Documentation/misc-devices/page_detective.rst create mode 100644 drivers/misc/page_detective.c

diff --git a/Documentation/misc-devices/index.rst b/Documentation/misc-devices/index.rst index 8c5b226d8313..d64723f20804 100644 --- a/Documentation/misc-devices/index.rst +++ b/Documentation/misc-devices/index.rst @@ -23,6 +23,7 @@ fit into other categories. max6875 mrvl_cn10k_dpi oxsemi-tornado + page_detective pci-endpoint-test spear-pcie-gadget tps6594-pfsm diff --git a/Documentation/misc-devices/page_detective.rst b/Documentation/misc-devices/page_detective.rst new file mode 100644 index 000000000000..06f666d5b3a9 --- /dev/null +++ b/Documentation/misc-devices/page_detective.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============== +Page Detective +============== + +Author: +Pasha Tatashin pasha.tatashin@soleen.com + +Overview +-------- + +Page Detective is a kernel debugging tool designed to provide in-depth +information about the usage and mapping of physical memory pages within the +Linux kernel. By leveraging the debugfs interface, it enables administrators +and developers to investigate the status and allocation of memory pages. + +This tool is valuable for diagnosing memory-related issues such as checksum +errors during live migration, filesystem journal failures, segmentation faults, +and other forms of corruption. + +Functionality +------------- + +Page Detective operates by accepting input through its debugfs interface files +located in ``/sys/kernel/debug/page_detective`` directory: + + * virt: Takes input in the format <pid> <virtual address>. It resolves the + provided virtual address within the specified process's address space and + outputs comprehensive information about the corresponding physical page's + mapping and usage. + + * phys: Takes a raw physical address as input. It directly investigates the + usage of the specified physical page and outputs relevant information. + +The output generated by Page Detective is delivered through kernel log messages +(accessible using dmesg). + +Usage +----- + +- Enable Page Detective: Ensure the CONFIG_PAGE_DETECTIVE kernel configuration + option is enabled. + +- Access debugfs: Mount the debugfs filesystem (if not already mounted): + ``mount -t debugfs nodev /sys/kernel/debug`` + +- Interact with Page Detective through one of two interfaces: + ``echo "<pid> <virtual address>" > /sys/kernel/debug/page_detective/virt`` + ``echo "<physical address>" > /sys/kernel/debug/page_detective/phys`` + +- The file page detective interface is accessible only to users with + CAP_SYS_ADMIN. + +Example +------- + +``` +# echo 0x1078fb000 > /sys/kernel/debug/page_detective/phys +Page Detective: Investigating physical[105bafc50] pfn[105baf] +Page Detective: metadata for Small Page pfn[105baf] folio[ffffea000416ebc0] order [0] +Page Detective: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7fffffffb pfn:0x105baf +Page Detective: memcg:ffff888106189000 +Page Detective: anon flags: 0x200000000020828(uptodate|lru|owner_2|swapbacked|node=0|zone=2) +Page Detective: raw: 0200000000020828 ffffea000416ec08 ffffea000416e7c8 ffff888106382bc9 +Page Detective: raw: 00000007fffffffb 0000000000000000 0000000100000000 ffff888106189000 +Page Detective: memcg: [/system.slice/system-serial\x2dgetty.slice/serial-getty@ttyS0.service ] [/system.slice/system-serial\x2dgetty.slice ] [/system.slice ] [/ ] +Page Detective: The page is direct mapped addr[ffff888105baf000] pmd entry[8000000105a001e3] +Page Detective: The page is not mapped into kernel vmalloc area +Page Detective: The page mapped into kernel page table: 1 times +Page Detective: Scanned kernel page table in [0.003353799s] +Page Detective: The page contains some data +Page Detective: mapped by PID[377] cmd[page_detective_] mm[ffff888101778000] pgd[ffff888100894000] at addr[7ffea333b000] pte[8000000105baf067] +Page Detective: vma[ffff888101701aa0] start[7ffea331e000] end[7ffea333f000] flags[0000000000100173] name: [stack] +Page Detective: Scanned [16] user page tables in [0.000297744s] +Page Detective: The page mapped into user page tables: 1 times +Page Detective: Finished investigation of physical[105bafc50] +``` diff --git a/MAINTAINERS b/MAINTAINERS index 21fdaa19229a..654d4650670d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17450,6 +17450,13 @@ F: mm/page-writeback.c F: mm/readahead.c F: mm/truncate.c

+PAGE DETECTIVE +M: Pasha Tatashin pasha.tatashin@soleen.com +L: linux-kernel@vger.kernel.org +S: Maintained +F: Documentation/misc-devices/page_detective.rst +F: drivers/misc/page_detective.c + PAGE POOL M: Jesper Dangaard Brouer hawk@kernel.org M: Ilias Apalodimas ilias.apalodimas@linaro.org diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig index 3fe7e2a9bd29..2965c3c7cdef 100644 --- a/drivers/misc/Kconfig +++ b/drivers/misc/Kconfig @@ -492,6 +492,17 @@ config MISC_RTSX tristate default MISC_RTSX_PCI || MISC_RTSX_USB

+config PAGE_DETECTIVE + depends on PAGE_TABLE_CHECK + depends on MEMCG + bool "Page Detective" + help + A debugging tool designed to provide detailed information about the + usage and mapping of physical memory pages. This tool operates through + the Linux debugfs interface, providing access to both virtual and + physical address inquiries. The output is presented via kernel log + messages. + config HISI_HIKEY_USB tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform" depends on (OF && GPIOLIB) || COMPILE_TEST diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile index a9f94525e181..411f17fcde6b 100644 --- a/drivers/misc/Makefile +++ b/drivers/misc/Makefile @@ -56,6 +56,7 @@ obj-$(CONFIG_PCI_ENDPOINT_TEST) += pci_endpoint_test.o obj-$(CONFIG_OCXL) += ocxl/ obj-$(CONFIG_BCM_VK) += bcm-vk/ obj-y += cardreader/ +obj-$(CONFIG_PAGE_DETECTIVE) += page_detective.o obj-$(CONFIG_PVPANIC) += pvpanic/ obj-$(CONFIG_UACCE) += uacce/ obj-$(CONFIG_XILINX_SDFEC) += xilinx_sdfec.o diff --git a/drivers/misc/page_detective.c b/drivers/misc/page_detective.c new file mode 100644 index 000000000000..300064d83dd3 --- /dev/null +++ b/drivers/misc/page_detective.c @@ -0,0 +1,808 @@ +// SPDX-License-Identifier: GPL-2.0+ + +/* + * Copyright (c) 2024, Google LLC. + * Pasha Tatashin pasha.tatashin@soleen.com + */ +#include <linux/ctype.h> +#include <linux/debugfs.h> +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/mm_inline.h> +#include <linux/slab.h> +#include <linux/sched/mm.h> +#include <linux/hugetlb.h> +#include <linux/pagewalk.h> +#include <linux/sched/clock.h> +#include <linux/oom.h> + +#undef pr_fmt +#define pr_fmt(fmt) "Page Detective: " fmt + +/* + * Walk 4T of VA space at a time, in order to periodically release the mmap + * lock + */ +#define PD_WALK_MAX_RANGE BIT(42) + +/* Synchronizes writes to virt and phys files */ +static DEFINE_MUTEX(page_detective_mutex); +static struct dentry *page_detective_debugfs_dir; + +static void page_detective_memcg(struct folio *folio) +{ + struct mem_cgroup *memcg; + + if (!folio_try_get(folio)) + return; + + memcg = get_mem_cgroup_from_folio(folio); + if (memcg) { + pr_info("memcg:"); + do { + pr_cont(" ["); + pr_cont_cgroup_path(memcg->css.cgroup); + pr_cont(" ]"); + } while ((memcg = parent_mem_cgroup(memcg))); + mem_cgroup_put(memcg); + pr_cont("\n"); + } + folio_put(folio); +} + +static void page_detective_metadata(unsigned long pfn) +{ + struct folio *folio = pfn_folio(pfn); + bool hugetlb, trans; + unsigned int order; + + if (!folio) { + pr_info("metadata for pfn[%lx] not found\n", pfn); + return; + } + + trans = folio_test_large(folio) && folio_test_large_rmappable(folio); + hugetlb = folio_test_hugetlb(folio); + order = folio_order(folio); + + pr_info("metadata for %s pfn[%lx] folio[%px] order [%u]\n", + (trans) ? "Transparent Huge Page" : (hugetlb) ? "HugeTLB" : + "Small Page", pfn, folio, order); + dump_page_lvl(KERN_INFO pr_fmt(""), &folio->page); + page_detective_memcg(folio); +} + +struct pd_private_kernel { + unsigned long pfn; + unsigned long direct_map_addr; + bool direct_map; + unsigned long vmalloc_maps; + long maps; +}; + +#define ENTRY_NAME(entry_page_size) ({ \ + unsigned long __entry_page_size = (entry_page_size); \ + \ + (__entry_page_size == PUD_SIZE) ? "pud" : \ + (__entry_page_size == PMD_SIZE) ? "pmd" : "pte"; \ +}) + +static void pd_print_entry_kernel(struct pd_private_kernel *pr, + unsigned long pfn_current, + unsigned long addr, + unsigned long entry_page_size, + unsigned long entry) +{ + unsigned long pfn = pr->pfn; + + if (pfn_current <= pfn && + pfn < (pfn_current + (entry_page_size >> PAGE_SHIFT))) { + bool v, d; + + addr += ((pfn << PAGE_SHIFT) & (entry_page_size - 1)); + v = (addr >= VMALLOC_START && addr < VMALLOC_END); + d = (pr->direct_map_addr == addr); + + if (v) { + pr_info("The page is mapped in vmalloc addr[%lx] %s entry[%lx]\n", + addr, ENTRY_NAME(entry_page_size), entry); + pr->vmalloc_maps++; + } else if (d) { + pr_info("The page is direct mapped addr[%lx] %s entry[%lx]\n", + addr, ENTRY_NAME(entry_page_size), entry); + pr->direct_map = true; + } else { + pr_info("The page is mapped into kernel addr[%lx] %s entry[%lx]\n", + addr, ENTRY_NAME(entry_page_size), entry); + } + + pr->maps++; + } +} + +static int pd_pud_entry_kernel(pud_t *pud, unsigned long addr, + unsigned long next, + struct mm_walk *walk) +{ + pud_t pudval = READ_ONCE(*pud); + + cond_resched(); + if (!pud_leaf(pudval)) + return 0; + + pd_print_entry_kernel(walk->private, pud_pfn(pudval), addr, + PUD_SIZE, pud_val(pudval)); + + return 0; +} + +static int pd_pmd_entry_kernel(pmd_t *pmd, unsigned long addr, + unsigned long next, + struct mm_walk *walk) +{ + pmd_t pmdval = READ_ONCE(*pmd); + + cond_resched(); + if (!pmd_leaf(pmdval)) + return 0; + + pd_print_entry_kernel(walk->private, pmd_pfn(pmdval), addr, + PMD_SIZE, pmd_val(pmdval)); + + return 0; +} + +static int pd_pte_entry_kernel(pte_t *pte, unsigned long addr, + unsigned long next, + struct mm_walk *walk) +{ + pte_t pteval = READ_ONCE(*pte); + + pd_print_entry_kernel(walk->private, pte_pfn(pteval), addr, + PAGE_SIZE, pte_val(pteval)); + + return 0; +} + +static const struct mm_walk_ops pd_kernel_ops = { + .pud_entry = pd_pud_entry_kernel, + .pmd_entry = pd_pmd_entry_kernel, + .pte_entry = pd_pte_entry_kernel, + .walk_lock = PGWALK_RDLOCK +}; + +/* + * Walk kernel page table, and print all mappings to this pfn, return 1 if + * pfn is mapped in direct map, return 0 if not mapped in direct map, and + * return -1 if operation canceled by user. + */ +static int page_detective_kernel_map_info(unsigned long pfn, + unsigned long direct_map_addr) +{ + struct pd_private_kernel pr = {0}; + unsigned long s, e; + + pr.direct_map_addr = direct_map_addr; + pr.pfn = pfn; + + for (s = PAGE_OFFSET; s != ~0ul; ) { + e = s + PD_WALK_MAX_RANGE; + if (e < s) + e = ~0ul; + + if (walk_page_range_kernel(s, e, &pd_kernel_ops, &pr)) { + pr_info("Received a cancel signal from user, while scanning kernel mappings\n"); + return -1; + } + cond_resched(); + s = e; + } + + if (!pr.vmalloc_maps) { + pr_info("The page is not mapped into kernel vmalloc area\n"); + } else if (pr.vmalloc_maps > 1) { + pr_info("The page is mapped into vmalloc area: %ld times\n", + pr.vmalloc_maps); + } + + if (!pr.direct_map) + pr_info("The page is not mapped into kernel direct map\n"); + + pr_info("The page mapped into kernel page table: %ld times\n", pr.maps); + + return pr.direct_map ? 1 : 0; +} + +/* Print kernel information about the pfn, return -1 if canceled by user */ +static int page_detective_kernel(unsigned long pfn) +{ + unsigned long *mem = __va((pfn) << PAGE_SHIFT); + unsigned long sum = 0; + int direct_map; + u64 s, e; + int i; + + s = sched_clock(); + direct_map = page_detective_kernel_map_info(pfn, (unsigned long)mem); + e = sched_clock() - s; + pr_info("Scanned kernel page table in [%llu.%09llus]\n", + e / NSEC_PER_SEC, e % NSEC_PER_SEC); + + /* Canceled by user or no direct map */ + if (direct_map < 1) + return direct_map; + + for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) + sum |= mem[i]; + + if (sum == 0) + pr_info("The page contains only zeroes\n"); + else + pr_info("The page contains some data\n"); + + return 0; +} + +static char __vma_name[PATH_MAX]; +static const char *vma_name(struct vm_area_struct *vma) +{ + const struct path *path; + const char *name_fmt, *name; + + get_vma_name(vma, &path, &name, &name_fmt); + + if (path) { + name = d_path(path, __vma_name, PATH_MAX); + if (IS_ERR(name)) { + strscpy(__vma_name, "[???]", PATH_MAX); + goto out; + } + } else if (name || name_fmt) { + snprintf(__vma_name, PATH_MAX, name_fmt ?: "%s", name); + } else { + if (vma_is_anonymous(vma)) + strscpy(__vma_name, "[anon]", PATH_MAX); + else if (vma_is_fsdax(vma)) + strscpy(__vma_name, "[fsdax]", PATH_MAX); + else if (vma_is_dax(vma)) + strscpy(__vma_name, "[dax]", PATH_MAX); + else + strscpy(__vma_name, "[other]", PATH_MAX); + } + +out: + return __vma_name; +} + +static void pd_show_vma_info(struct mm_struct *mm, unsigned long addr) +{ + struct vm_area_struct *vma = find_vma(mm, addr); + + if (!vma) { + pr_info("vma not found for this mapping\n"); + return; + } + + pr_info("vma[%px] start[%lx] end[%lx] flags[%016lx] name: %s\n", + vma, vma->vm_start, vma->vm_end, vma->vm_flags, vma_name(vma)); +} + +static void pd_get_comm_pid(struct mm_struct *mm, char *comm, int *pid) +{ + struct task_struct *task; + + rcu_read_lock(); + task = rcu_dereference(mm->owner); + if (task) { + strscpy(comm, task->comm, TASK_COMM_LEN); + *pid = task->pid; + } else { + strscpy(comm, "__ exited __", TASK_COMM_LEN); + *pid = -1; + } + rcu_read_unlock(); +} + +struct pd_private_user { + struct mm_struct *mm; + unsigned long pfn; + long maps; +}; + +static void pd_print_entry_user(struct pd_private_user *pr, + unsigned long pfn_current, + unsigned long addr, + unsigned long entry_page_size, + unsigned long entry, + bool is_hugetlb) +{ + unsigned long pfn = pr->pfn; + + if (pfn_current <= pfn && + pfn < (pfn_current + (entry_page_size >> PAGE_SHIFT))) { + char comm[TASK_COMM_LEN]; + int pid; + + pd_get_comm_pid(pr->mm, comm, &pid); + addr += ((pfn << PAGE_SHIFT) & (entry_page_size - 1)); + pr_info("%smapped by PID[%d] cmd[%s] mm[%px] pgd[%px] at addr[%lx] %s[%lx]\n", + is_hugetlb ? "hugetlb " : "", + pid, comm, pr->mm, pr->mm->pgd, addr, + ENTRY_NAME(entry_page_size), entry); + pd_show_vma_info(pr->mm, addr); + pr->maps++; + } +} + +static int pd_pud_entry_user(pud_t *pud, unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + pud_t pudval = READ_ONCE(*pud); + + cond_resched(); + if (!pud_user_accessible_page(pudval)) + return 0; + + pd_print_entry_user(walk->private, pud_pfn(pudval), addr, PUD_SIZE, + pud_val(pudval), false); + walk->action = ACTION_CONTINUE; + + return 0; +} + +static int pd_pmd_entry_user(pmd_t *pmd, unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + pmd_t pmdval = READ_ONCE(*pmd); + + cond_resched(); + if (!pmd_user_accessible_page(pmdval)) + return 0; + + pd_print_entry_user(walk->private, pmd_pfn(pmdval), addr, PMD_SIZE, + pmd_val(pmdval), false); + walk->action = ACTION_CONTINUE; + + return 0; +} + +static int pd_pte_entry_user(pte_t *pte, unsigned long addr, unsigned long next, + struct mm_walk *walk) +{ + pte_t pteval = READ_ONCE(*pte); + + if (!pte_user_accessible_page(pteval)) + return 0; + + pd_print_entry_user(walk->private, pte_pfn(pteval), addr, PAGE_SIZE, + pte_val(pteval), false); + walk->action = ACTION_CONTINUE; + + return 0; +} + +static int pd_hugetlb_entry(pte_t *pte, unsigned long hmask, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pte_t pteval = READ_ONCE(*pte); + + cond_resched(); + pd_print_entry_user(walk->private, pte_pfn(pteval), addr, next - addr, + pte_val(pteval), true); + walk->action = ACTION_CONTINUE; + + return 0; +} + +static const struct mm_walk_ops pd_user_ops = { + .pud_entry = pd_pud_entry_user, + .pmd_entry = pd_pmd_entry_user, + .pte_entry = pd_pte_entry_user, + .hugetlb_entry = pd_hugetlb_entry, + .walk_lock = PGWALK_RDLOCK +}; + +/* + * print information about mappings of pfn by mm, return -1 if canceled + * return number of mappings found. + */ +static long page_detective_user_mm_info(struct mm_struct *mm, unsigned long pfn) +{ + struct pd_private_user pr = {0}; + unsigned long s, e; + + pr.pfn = pfn; + pr.mm = mm; + + for (s = 0; s != TASK_SIZE; ) { + e = s + PD_WALK_MAX_RANGE; + if (e > TASK_SIZE || e < s) + e = TASK_SIZE; + + if (mmap_read_lock_killable(mm)) { + pr_info("Received a cancel signal from user, while scanning user mappings\n"); + return -1; + } + walk_page_range(mm, s, e, &pd_user_ops, &pr); + mmap_read_unlock(mm); + cond_resched(); + s = e; + } + return pr.maps; +} + +/* + * Report where/if PFN is mapped in user page tables, return -1 if canceled + * by user. + */ +static int page_detective_usermaps(unsigned long pfn) +{ + struct task_struct *task, *t; + struct mm_struct **mm_table, *mm; + unsigned long proc_nr, mm_nr, i; + bool canceled_by_user; + long maps, ret; + u64 s, e; + + s = sched_clock(); + /* Get the number of processes currently running */ + proc_nr = 0; + rcu_read_lock(); + for_each_process(task) + proc_nr++; + rcu_read_unlock(); + + /* Allocate mm_table to fit mm from every running process */ + mm_table = kvmalloc_array(proc_nr, sizeof(struct mm_struct *), + GFP_KERNEL); + + if (!mm_table) { + pr_info("No memory to traverse though user mappings\n"); + return 0; + } + + /* get mm from every processes and copy its pointer into mm_table */ + mm_nr = 0; + rcu_read_lock(); + for_each_process(task) { + if (mm_nr == proc_nr) { + pr_info("Number of processes increased while scanning, some will be skipped\n"); + break; + } + + t = find_lock_task_mm(task); + if (!t) + continue; + + mm = task->mm; + if (!mm || !mmget_not_zero(mm)) { + task_unlock(t); + continue; + } + task_unlock(t); + + mm_table[mm_nr++] = mm; + } + rcu_read_unlock(); + + /* Walk through every user page table,release mm reference afterwards */ + canceled_by_user = false; + maps = 0; + for (i = 0; i < mm_nr; i++) { + if (!canceled_by_user) { + ret = page_detective_user_mm_info(mm_table[i], pfn); + if (ret == -1) + canceled_by_user = true; + else + maps += ret; + } + mmput(mm_table[i]); + cond_resched(); + } + + kvfree(mm_table); + + e = sched_clock() - s; + pr_info("Scanned [%ld] user page tables in [%llu.%09llus]\n", + mm_nr, e / NSEC_PER_SEC, e % NSEC_PER_SEC); + pr_info("The page mapped into user page tables: %ld times\n", maps); + + return canceled_by_user ? -1 : 0; +} + +static void page_detective_iommu(unsigned long pfn) +{ +} + +static void page_detective_tdp(unsigned long pfn) +{ +} + +static void page_detective(unsigned long pfn) +{ + if (!pfn_valid(pfn)) { + pr_info("pfn[%lx] is invalid\n", pfn); + return; + } + + if (pfn == 0) { + pr_info("Skipping look-up for pfn[0] mapped many times into kernel page table\n"); + return; + } + + /* Report metadata information */ + page_detective_metadata(pfn); + + /* + * Report information about kernel mappings, and basic content + * information: i.e. all zero or not. + */ + if (page_detective_kernel(pfn) < 0) + return; + + /* Report where/if PFN is mapped in user page tables */ + if (page_detective_usermaps(pfn) < 0) + return; + + /* Report where/if PFN is mapped in IOMMU page tables */ + page_detective_iommu(pfn); + + /* Report where/if PFN is mapped in 2 dimensional paging */ + page_detective_tdp(pfn); +} + +static u64 pid_virt_to_phys(unsigned int pid, unsigned long virt_addr) +{ + unsigned long phys_addr = -1; + struct task_struct *task; + struct mm_struct *mm; + pgd_t *pgd, pgdval; + p4d_t *p4d, p4dval; + pud_t *pud, pudval; + pmd_t *pmd, pmdval; + pte_t *pte, pteval; + + if (virt_addr >= TASK_SIZE) { + pr_err("%s: virt_addr[%lx] is above TASK_SIZE[%lx]\n", + __func__, virt_addr, TASK_SIZE); + return -1; + } + + /* Find the task_struct using the PID */ + task = find_get_task_by_vpid(pid); + if (!task) { + pr_err("%s: Task not found for PID %d\n", __func__, pid); + return -1; + } + + mm = get_task_mm(task); + put_task_struct(task); + if (!mm) { + pr_err("%s: PID %d, can't get mm reference\n", __func__, pid); + return -1; + } + + if (mmap_read_lock_killable(mm)) { + pr_info("Received a cancel signal from user, while convirting virt to phys\n"); + mmput(mm); + return -1; + } + + pgd = pgd_offset(mm, virt_addr); + pgdval = READ_ONCE(*pgd); + if (!pgd_present(pgdval) || unlikely(pgd_bad(pgdval))) { + pr_err("%s: pgd[%llx] present[%d] bad[%d]\n", __func__, + (u64)pgd_val(pgdval), pgd_present(pgdval), + pgd_bad(pgdval)); + goto putmm_exit; + } + + p4d = p4d_offset(pgd, virt_addr); + p4dval = READ_ONCE(*p4d); + if (!p4d_present(p4dval) || unlikely(p4d_bad(p4dval))) { + pr_err("%s: p4d[%llx] present[%d] bad[%d]\n", __func__, + (u64)p4d_val(p4dval), p4d_present(p4dval), + p4d_bad(p4dval)); + goto putmm_exit; + } + + pud = pud_offset(p4d, virt_addr); + pudval = READ_ONCE(*pud); + if (!pud_present(pudval)) { + pr_err("%s: pud[%llx] present[%d]\n", __func__, + (u64)pud_val(pudval), pud_present(pudval)); + goto putmm_exit; + } + + if (pud_leaf(pudval)) { + phys_addr = (pud_pfn(pudval) << PAGE_SHIFT) + | (virt_addr & ~PUD_MASK); + goto putmm_exit; + } + + pmd = pmd_offset(pud, virt_addr); + pmdval = READ_ONCE(*pmd); + if (!pmd_present(pmdval)) { + pr_err("%s: pmd[%llx] present[%d]\n", __func__, + (u64)pmd_val(pmdval), pmd_present(pmdval)); + goto putmm_exit; + } + + if (pmd_leaf(pmdval)) { + phys_addr = (pmd_pfn(pmdval) << PAGE_SHIFT) + | (virt_addr & ~PMD_MASK); + goto putmm_exit; + } + + pte = pte_offset_kernel(pmd, virt_addr); + pteval = READ_ONCE(*pte); + if (!pte_present(pteval)) { + pr_err("%s: pte[%llx] present[%d]\n", __func__, + (u64)pte_val(pteval), pte_present(pteval)); + goto putmm_exit; + } + + phys_addr = pte_pfn(*pte) << PAGE_SHIFT; + +putmm_exit: + mmap_read_unlock(mm); + mmput(mm); + return phys_addr; +} + +static ssize_t page_detective_virt_write(struct file *file, + const char __user *data, + size_t count, loff_t *ppos) +{ + char *input_str, *pid_str, *virt_str; + unsigned int pid, err, i; + unsigned long virt_addr; + u64 phys_addr; + + /* If canceled by user simply return without printing anything */ + err = mutex_lock_killable(&page_detective_mutex); + if (err) + return count; + + input_str = kzalloc(count + 1, GFP_KERNEL); + if (!input_str) { + pr_err("%s: Unable to allocate input_str buffer\n", + __func__); + mutex_unlock(&page_detective_mutex); + return -EAGAIN; + } + + if (copy_from_user(input_str, data, count)) { + kfree(input_str); + pr_err("%s: Unable to copy user input into virt file\n", + __func__); + mutex_unlock(&page_detective_mutex); + return -EFAULT; + } + + virt_str = NULL; + pid_str = input_str; + for (i = 0; i < count - 1; i++) { + if (isspace(input_str[i])) { + input_str[i] = '\0'; + virt_str = &input_str[i + 1]; + break; + } + } + + if (!virt_str) { + kfree(input_str); + pr_err("%s: Invalid virt file input, should be: '<pid> <virtual address>'\n", + __func__); + mutex_unlock(&page_detective_mutex); + return -EINVAL; + } + + err = kstrtouint(pid_str, 0, &pid); + if (err) { + kfree(input_str); + pr_err("%s: Failed to parse pid\n", __func__); + mutex_unlock(&page_detective_mutex); + return err; + } + + err = kstrtoul(virt_str, 0, &virt_addr); + if (err) { + kfree(input_str); + pr_err("%s: Failed to parse virtual address\n", __func__); + mutex_unlock(&page_detective_mutex); + return err; + } + + kfree(input_str); + + phys_addr = pid_virt_to_phys(pid, virt_addr); + if (phys_addr == -1) { + pr_err("%s: Can't translate virtual to physical address\n", + __func__); + mutex_unlock(&page_detective_mutex); + return -EINVAL; + } + + pr_info("Investigating pid[%u] virtual[%lx] physical[%llx] pfn[%lx]\n", + pid, virt_addr, phys_addr, PHYS_PFN(phys_addr)); + page_detective(PHYS_PFN(phys_addr)); + pr_info("Finished investigation of virtual[%lx]\n", virt_addr); + mutex_unlock(&page_detective_mutex); + + return count; +} + +static ssize_t page_detective_phys_write(struct file *file, + const char __user *data, + size_t count, loff_t *ppos) +{ + u64 phys_addr; + int err; + + /* If canceled by user simply return without printing anything */ + err = mutex_lock_killable(&page_detective_mutex); + if (err) + return count; + + err = kstrtou64_from_user(data, count, 0, &phys_addr); + + if (err) { + pr_err("%s: Failed to parse physical address\n", __func__); + mutex_unlock(&page_detective_mutex); + return err; + } + + pr_info("Investigating physical[%llx] pfn[%lx]\n", phys_addr, + PHYS_PFN(phys_addr)); + page_detective(PHYS_PFN(phys_addr)); + pr_info("Finished investigation of physical[%llx]\n", phys_addr); + mutex_unlock(&page_detective_mutex); + + return count; +} + +static int page_detective_open(struct inode *inode, struct file *file) +{ + /* Deny access if not CAP_SYS_ADMIN */ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + return simple_open(inode, file); +} + +static const struct file_operations page_detective_virt_fops = { + .owner = THIS_MODULE, + .open = page_detective_open, + .write = page_detective_virt_write, +}; + +static const struct file_operations page_detective_phys_fops = { + .owner = THIS_MODULE, + .open = page_detective_open, + .write = page_detective_phys_write, +}; + +static int __init page_detective_init(void) +{ + page_detective_debugfs_dir = debugfs_create_dir("page_detective", NULL); + + debugfs_create_file("virt", 0200, page_detective_debugfs_dir, NULL, + &page_detective_virt_fops); + debugfs_create_file("phys", 0200, page_detective_debugfs_dir, NULL, + &page_detective_phys_fops); + + return 0; +} +module_init(page_detective_init); + +static void page_detective_exit(void) +{ + debugfs_remove_recursive(page_detective_debugfs_dir); +} +module_exit(page_detective_exit); + +MODULE_DESCRIPTION("Page Detective"); +MODULE_VERSION("1.0"); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Pasha Tatashin pasha.tatashin@soleen.com");

-- 2.47.0.338.g60cca15819-goog

Jonathan Corbet

10:20 p.m.

New subject: [RFCv1 4/6] misc/page_detective: Introduce Page Detective

Pasha Tatashin pasha.tatashin@soleen.com writes:

...

Page Detective is a kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It operates through the Linux debugfs interface, providing access to both virtual and physical address inquiries. The output, presented via kernel log messages (accessible with dmesg), will help administrators and developers understand how specific pages are utilized by the system.

This tool can be used to investigate various memory-related issues, such as checksum failures during live migration, filesystem journal failures, general segfaults, or other corruptions.

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com

Documentation/misc-devices/index.rst | 1 + Documentation/misc-devices/page_detective.rst | 78 ++

This seems like a strange place to bury this document - who will look for it here? Even if it is truly implemented as a misc device (I didn't look), the documentation would belong either in the admin guide or with the MM docs, it seems to me...?

Thanks,

jon

Pasha Tatashin

18 Nov 18 Nov

8:43 p.m.

New subject: [RFCv1 4/6] misc/page_detective: Introduce Page Detective

On Sat, Nov 16, 2024 at 5:20 PM Jonathan Corbet corbet@lwn.net wrote:

...

Pasha Tatashin pasha.tatashin@soleen.com writes:

...
Page Detective is a kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It operates through the Linux debugfs interface, providing access to both virtual and physical address inquiries. The output, presented via kernel log messages (accessible with dmesg), will help administrators and developers understand how specific pages are utilized by the system.

This tool can be used to investigate various memory-related issues, such as checksum failures during live migration, filesystem journal failures, general segfaults, or other corruptions.

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com

Documentation/misc-devices/index.rst | 1 + Documentation/misc-devices/page_detective.rst | 78 ++

This seems like a strange place to bury this document - who will look for it here? Even if it is truly implemented as a misc device (I didn't look), the documentation would belong either in the admin guide or with the MM docs, it seems to me...?

I will put it under MM docs in the next version, as I will also convert Page Detective to be part of core mm.

Thank you, Pasha

Lorenzo Stoakes

11:11 a.m.

New subject: [RFCv1 4/6] misc/page_detective: Introduce Page Detective

On Sat, Nov 16, 2024 at 05:59:20PM +0000, Pasha Tatashin wrote:

...

Page Detective is a kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It operates through the Linux debugfs interface, providing access to both virtual and physical address inquiries. The output, presented via kernel log messages (accessible with dmesg), will help administrators and developers understand how specific pages are utilized by the system.

This tool can be used to investigate various memory-related issues, such as checksum failures during live migration, filesystem journal failures, general segfaults, or other corruptions.

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com

Documentation/misc-devices/index.rst | 1 + Documentation/misc-devices/page_detective.rst | 78 ++ MAINTAINERS | 7 + drivers/misc/Kconfig | 11 + drivers/misc/Makefile | 1 + drivers/misc/page_detective.c | 808 ++++++++++++++++++ 6 files changed, 906 insertions(+) create mode 100644 Documentation/misc-devices/page_detective.rst create mode 100644 drivers/misc/page_detective.c

diff --git a/Documentation/misc-devices/index.rst b/Documentation/misc-devices/index.rst index 8c5b226d8313..d64723f20804 100644 --- a/Documentation/misc-devices/index.rst +++ b/Documentation/misc-devices/index.rst @@ -23,6 +23,7 @@ fit into other categories. max6875 mrvl_cn10k_dpi oxsemi-tornado

page_detective pci-endpoint-test spear-pcie-gadget tps6594-pfsm

diff --git a/Documentation/misc-devices/page_detective.rst b/Documentation/misc-devices/page_detective.rst new file mode 100644 index 000000000000..06f666d5b3a9 --- /dev/null +++ b/Documentation/misc-devices/page_detective.rst

This is _explicitly_ mm functionality. I find it odd that you are trying so hard to act as if it isn't.

...

@@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0+

+============== +Page Detective +==============

+Author: +Pasha Tatashin pasha.tatashin@soleen.com

+Overview +--------

+Page Detective is a kernel debugging tool designed to provide in-depth +information about the usage and mapping of physical memory pages within the +Linux kernel. By leveraging the debugfs interface, it enables administrators +and developers to investigate the status and allocation of memory pages.

+This tool is valuable for diagnosing memory-related issues such as checksum +errors during live migration, filesystem journal failures, segmentation faults, +and other forms of corruption.

+Functionality +-------------

+Page Detective operates by accepting input through its debugfs interface files +located in ``/sys/kernel/debug/page_detective`` directory:

virt: Takes input in the format <pid> <virtual address>. It resolves the

provided virtual address within the specified process's address space and

outputs comprehensive information about the corresponding physical page's

mapping and usage.

phys: Takes a raw physical address as input. It directly investigates the

usage of the specified physical page and outputs relevant information.

+The output generated by Page Detective is delivered through kernel log messages +(accessible using dmesg).

Everything is entirely racey and anything you output might only be partially populated at any given time due to racing page faults. You definitely need to mention this.

...

+Usage +-----

+- Enable Page Detective: Ensure the CONFIG_PAGE_DETECTIVE kernel configuration

option is enabled.

+- Access debugfs: Mount the debugfs filesystem (if not already mounted):

``mount -t debugfs nodev /sys/kernel/debug``

+- Interact with Page Detective through one of two interfaces:

``echo "<pid> <virtual address>" > /sys/kernel/debug/page_detective/virt``

``echo "<physical address>" > /sys/kernel/debug/page_detective/phys``

+- The file page detective interface is accessible only to users with

CAP_SYS_ADMIN.

+Example +-------

+``` +# echo 0x1078fb000 > /sys/kernel/debug/page_detective/phys +Page Detective: Investigating physical[105bafc50] pfn[105baf] +Page Detective: metadata for Small Page pfn[105baf] folio[ffffea000416ebc0] order [0] +Page Detective: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7fffffffb pfn:0x105baf +Page Detective: memcg:ffff888106189000 +Page Detective: anon flags: 0x200000000020828(uptodate|lru|owner_2|swapbacked|node=0|zone=2) +Page Detective: raw: 0200000000020828 ffffea000416ec08 ffffea000416e7c8 ffff888106382bc9 +Page Detective: raw: 00000007fffffffb 0000000000000000 0000000100000000 ffff888106189000 +Page Detective: memcg: [/system.slice/system-serial\x2dgetty.slice/serial-getty@ttyS0.service ] [/system.slice/system-serial\x2dgetty.slice ] [/system.slice ] [/ ] +Page Detective: The page is direct mapped addr[ffff888105baf000] pmd entry[8000000105a001e3] +Page Detective: The page is not mapped into kernel vmalloc area +Page Detective: The page mapped into kernel page table: 1 times +Page Detective: Scanned kernel page table in [0.003353799s] +Page Detective: The page contains some data +Page Detective: mapped by PID[377] cmd[page_detective_] mm[ffff888101778000] pgd[ffff888100894000] at addr[7ffea333b000] pte[8000000105baf067] +Page Detective: vma[ffff888101701aa0] start[7ffea331e000] end[7ffea333f000] flags[0000000000100173] name: [stack] +Page Detective: Scanned [16] user page tables in [0.000297744s] +Page Detective: The page mapped into user page tables: 1 times +Page Detective: Finished investigation of physical[105bafc50] +``` diff --git a/MAINTAINERS b/MAINTAINERS index 21fdaa19229a..654d4650670d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17450,6 +17450,13 @@ F: mm/page-writeback.c F: mm/readahead.c F: mm/truncate.c

+PAGE DETECTIVE +M: Pasha Tatashin pasha.tatashin@soleen.com +L: linux-kernel@vger.kernel.org +S: Maintained +F: Documentation/misc-devices/page_detective.rst +F: drivers/misc/page_detective.c

PAGE POOL M: Jesper Dangaard Brouer hawk@kernel.org M: Ilias Apalodimas ilias.apalodimas@linaro.org diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig index 3fe7e2a9bd29..2965c3c7cdef 100644 --- a/drivers/misc/Kconfig +++ b/drivers/misc/Kconfig @@ -492,6 +492,17 @@ config MISC_RTSX tristate default MISC_RTSX_PCI || MISC_RTSX_USB

+config PAGE_DETECTIVE
depends on PAGE_TABLE_CHECK

depends on MEMCG

bool "Page Detective"

help
 A debugging tool designed to provide detailed information about the
 usage and mapping of physical memory pages. This tool operates through
 the Linux debugfs interface, providing access to both virtual and
 physical address inquiries. The output is presented via kernel log
 messages.
config HISI_HIKEY_USB tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform" depends on (OF && GPIOLIB) || COMPILE_TEST diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile index a9f94525e181..411f17fcde6b 100644 --- a/drivers/misc/Makefile +++ b/drivers/misc/Makefile @@ -56,6 +56,7 @@ obj-$(CONFIG_PCI_ENDPOINT_TEST) += pci_endpoint_test.o obj-$(CONFIG_OCXL) += ocxl/ obj-$(CONFIG_BCM_VK) += bcm-vk/ obj-y += cardreader/ +obj-$(CONFIG_PAGE_DETECTIVE) += page_detective.o obj-$(CONFIG_PVPANIC) += pvpanic/ obj-$(CONFIG_UACCE) += uacce/ obj-$(CONFIG_XILINX_SDFEC) += xilinx_sdfec.o diff --git a/drivers/misc/page_detective.c b/drivers/misc/page_detective.c new file mode 100644 index 000000000000..300064d83dd3 --- /dev/null +++ b/drivers/misc/page_detective.c @@ -0,0 +1,808 @@ +// SPDX-License-Identifier: GPL-2.0+

+/*

Copyright (c) 2024, Google LLC.

Pasha Tatashin pasha.tatashin@soleen.com

*/

+#include <linux/ctype.h> +#include <linux/debugfs.h> +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/mm_inline.h> +#include <linux/slab.h> +#include <linux/sched/mm.h> +#include <linux/hugetlb.h> +#include <linux/pagewalk.h> +#include <linux/sched/clock.h> +#include <linux/oom.h>

+#undef pr_fmt +#define pr_fmt(fmt) "Page Detective: " fmt

+/*

Walk 4T of VA space at a time, in order to periodically release the mmap

lock

*/

+#define PD_WALK_MAX_RANGE BIT(42)

Seems rather arbitrary?

...

+/* Synchronizes writes to virt and phys files */ +static DEFINE_MUTEX(page_detective_mutex); +static struct dentry *page_detective_debugfs_dir;

+static void page_detective_memcg(struct folio *folio) +{
struct mem_cgroup *memcg;

if (!folio_try_get(folio))
return;
memcg = get_mem_cgroup_from_folio(folio);

if (memcg) {
pr_info("memcg:");
do {
	pr_cont(" [");
	pr_cont_cgroup_path(memcg->css.cgroup);
	pr_cont(" ]");
} while ((memcg = parent_mem_cgroup(memcg)));
mem_cgroup_put(memcg);
pr_cont("\n");
}

folio_put(folio);
+}

+static void page_detective_metadata(unsigned long pfn) +{
struct folio *folio = pfn_folio(pfn);

bool hugetlb, trans;

unsigned int order;

if (!folio) {
pr_info("metadata for pfn[%lx] not found\n", pfn);
return;
}

trans = folio_test_large(folio) && folio_test_large_rmappable(folio);

hugetlb = folio_test_hugetlb(folio);

order = folio_order(folio);

pr_info("metadata for %s pfn[%lx] folio[%px] order [%u]\n",
(trans) ? "Transparent Huge Page" : (hugetlb) ? "HugeTLB" :
"Small Page", pfn, folio, order);
dump_page_lvl(KERN_INFO pr_fmt(""), &folio->page);

page_detective_memcg(folio);
+}

+struct pd_private_kernel {

unsigned long pfn;

unsigned long direct_map_addr;

bool direct_map;

unsigned long vmalloc_maps;

long maps;

+};

+#define ENTRY_NAME(entry_page_size) ({ \
unsigned long __entry_page_size = (entry_page_size); \
							\
(__entry_page_size == PUD_SIZE) ? "pud" : \

(__entry_page_size == PMD_SIZE) ? "pmd" : "pte"; \
+})

+static void pd_print_entry_kernel(struct pd_private_kernel *pr,
		  unsigned long pfn_current,
		  unsigned long addr,
		  unsigned long entry_page_size,
		  unsigned long entry)
+{
unsigned long pfn = pr->pfn;

if (pfn_current <= pfn &&
   pfn < (pfn_current + (entry_page_size >> PAGE_SHIFT))) {
bool v, d;
addr += ((pfn << PAGE_SHIFT) & (entry_page_size - 1));
v = (addr >= VMALLOC_START && addr < VMALLOC_END);
d = (pr->direct_map_addr == addr);
if (v) {
	pr_info("The page is mapped in vmalloc addr[%lx] %s entry[%lx]\n",
		addr, ENTRY_NAME(entry_page_size), entry);
	pr->vmalloc_maps++;
} else if (d) {
	pr_info("The page is direct mapped addr[%lx] %s entry[%lx]\n",
		addr, ENTRY_NAME(entry_page_size), entry);
	pr->direct_map = true;
} else {
	pr_info("The page is mapped into kernel addr[%lx] %s entry[%lx]\n",
		addr, ENTRY_NAME(entry_page_size), entry);
}
pr->maps++;
}
+}

+static int pd_pud_entry_kernel(pud_t *pud, unsigned long addr,
	       unsigned long next,
	       struct mm_walk *walk)
+{
pud_t pudval = READ_ONCE(*pud);

cond_resched();

if (!pud_leaf(pudval))
return 0;
pd_print_entry_kernel(walk->private, pud_pfn(pudval), addr,
	      PUD_SIZE, pud_val(pudval));
return 0;
+}

+static int pd_pmd_entry_kernel(pmd_t *pmd, unsigned long addr,
	       unsigned long next,
	       struct mm_walk *walk)
+{
pmd_t pmdval = READ_ONCE(*pmd);

cond_resched();

if (!pmd_leaf(pmdval))
return 0;
pd_print_entry_kernel(walk->private, pmd_pfn(pmdval), addr,
	      PMD_SIZE, pmd_val(pmdval));
return 0;
+}

+static int pd_pte_entry_kernel(pte_t *pte, unsigned long addr,
	       unsigned long next,
	       struct mm_walk *walk)
+{
pte_t pteval = READ_ONCE(*pte);

pd_print_entry_kernel(walk->private, pte_pfn(pteval), addr,
	      PAGE_SIZE, pte_val(pteval));
return 0;
+}

+static const struct mm_walk_ops pd_kernel_ops = {

.pud_entry = pd_pud_entry_kernel,

.pmd_entry = pd_pmd_entry_kernel,

.pte_entry = pd_pte_entry_kernel,

.walk_lock = PGWALK_RDLOCK

+};

+/*

Walk kernel page table, and print all mappings to this pfn, return 1 if

pfn is mapped in direct map, return 0 if not mapped in direct map, and

return -1 if operation canceled by user.

*/

+static int page_detective_kernel_map_info(unsigned long pfn,
			  unsigned long direct_map_addr)
+{
struct pd_private_kernel pr = {0};

unsigned long s, e;

pr.direct_map_addr = direct_map_addr;

pr.pfn = pfn;

for (s = PAGE_OFFSET; s != ~0ul; ) {
e = s + PD_WALK_MAX_RANGE;
if (e < s)
	e = ~0ul;
if (walk_page_range_kernel(s, e, &pd_kernel_ops, &pr)) {
	pr_info("Received a cancel signal from user, while scanning kernel mappings\n");
	return -1;
}
cond_resched();
s = e;
}

if (!pr.vmalloc_maps) {
pr_info("The page is not mapped into kernel vmalloc area\n");
} else if (pr.vmalloc_maps > 1) {
pr_info("The page is mapped into vmalloc area: %ld times\n",
	pr.vmalloc_maps);
}

if (!pr.direct_map)
pr_info("The page is not mapped into kernel direct map\n");
pr_info("The page mapped into kernel page table: %ld times\n", pr.maps);

return pr.direct_map ? 1 : 0;
+}

+/* Print kernel information about the pfn, return -1 if canceled by user */ +static int page_detective_kernel(unsigned long pfn) +{
unsigned long *mem = __va((pfn) << PAGE_SHIFT);

unsigned long sum = 0;

int direct_map;

u64 s, e;

int i;

s = sched_clock();

direct_map = page_detective_kernel_map_info(pfn, (unsigned long)mem);

e = sched_clock() - s;

pr_info("Scanned kernel page table in [%llu.%09llus]\n",
e / NSEC_PER_SEC, e % NSEC_PER_SEC);
/* Canceled by user or no direct map */

if (direct_map < 1)
return direct_map;
for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++)
sum |= mem[i];
if (sum == 0)
pr_info("The page contains only zeroes\n");
else
pr_info("The page contains some data\n");
return 0;
+}

+static char __vma_name[PATH_MAX];

Having this as an arbitrary static variable at compilation unit scope here is kind of horrible.

...

+static const char *vma_name(struct vm_area_struct *vma) +{
const struct path *path;

const char *name_fmt, *name;

get_vma_name(vma, &path, &name, &name_fmt);

if (path) {
name = d_path(path, __vma_name, PATH_MAX);
if (IS_ERR(name)) {
	strscpy(__vma_name, "[???]", PATH_MAX);
	goto out;
}
} else if (name || name_fmt) {
snprintf(__vma_name, PATH_MAX, name_fmt ?: "%s", name);
} else {
if (vma_is_anonymous(vma))
	strscpy(__vma_name, "[anon]", PATH_MAX);
else if (vma_is_fsdax(vma))
	strscpy(__vma_name, "[fsdax]", PATH_MAX);
else if (vma_is_dax(vma))
	strscpy(__vma_name, "[dax]", PATH_MAX);
else
	strscpy(__vma_name, "[other]", PATH_MAX);
}
+out:

return __vma_name;

+}

Yeah this is sort of weird, you're establishing a new protocol as to what the 'VMA name' means vs. what we see in /proc/$pid/maps, making a Frakenstein out of that logic and your own.

I'd prefer we keep this in _one place_ and consistent.

...

+static void pd_show_vma_info(struct mm_struct *mm, unsigned long addr) +{
struct vm_area_struct *vma = find_vma(mm, addr);

if (!vma) {
pr_info("vma not found for this mapping\n");
return;
}

pr_info("vma[%px] start[%lx] end[%lx] flags[%016lx] name: %s\n",
vma, vma->vm_start, vma->vm_end, vma->vm_flags, vma_name(vma));
+}

+static void pd_get_comm_pid(struct mm_struct *mm, char *comm, int *pid) +{
struct task_struct *task;

rcu_read_lock();

task = rcu_dereference(mm->owner);

if (task) {
strscpy(comm, task->comm, TASK_COMM_LEN);
*pid = task->pid;
} else {
strscpy(comm, "__ exited __", TASK_COMM_LEN);
*pid = -1;
}

rcu_read_unlock();
+}

+struct pd_private_user {

struct mm_struct *mm;

unsigned long pfn;

long maps;

+};

+static void pd_print_entry_user(struct pd_private_user *pr,
		unsigned long pfn_current,
		unsigned long addr,
		unsigned long entry_page_size,
		unsigned long entry,
		bool is_hugetlb)
+{
unsigned long pfn = pr->pfn;

if (pfn_current <= pfn &&
   pfn < (pfn_current + (entry_page_size >> PAGE_SHIFT))) {
char comm[TASK_COMM_LEN];
int pid;
pd_get_comm_pid(pr->mm, comm, &pid);
addr += ((pfn << PAGE_SHIFT) & (entry_page_size - 1));
pr_info("%smapped by PID[%d] cmd[%s] mm[%px] pgd[%px] at addr[%lx] %s[%lx]\n",
	is_hugetlb ? "hugetlb " : "",
	pid, comm, pr->mm, pr->mm->pgd, addr,
	ENTRY_NAME(entry_page_size), entry);
pd_show_vma_info(pr->mm, addr);
pr->maps++;
}
+}

+static int pd_pud_entry_user(pud_t *pud, unsigned long addr, unsigned long next,
	     struct mm_walk *walk)
+{

pud_t pudval = READ_ONCE(*pud);

This should be pudp_get().

...

cond_resched();

if (!pud_user_accessible_page(pudval))
return 0;
pd_print_entry_user(walk->private, pud_pfn(pudval), addr, PUD_SIZE,
	    pud_val(pudval), false);
walk->action = ACTION_CONTINUE;

return 0;
+}

+static int pd_pmd_entry_user(pmd_t *pmd, unsigned long addr, unsigned long next,
	     struct mm_walk *walk)
+{

pmd_t pmdval = READ_ONCE(*pmd);

This should be pmdp_get().

...

cond_resched();

if (!pmd_user_accessible_page(pmdval))
return 0;
pd_print_entry_user(walk->private, pmd_pfn(pmdval), addr, PMD_SIZE,
	    pmd_val(pmdval), false);
walk->action = ACTION_CONTINUE;

return 0;
+}

+static int pd_pte_entry_user(pte_t *pte, unsigned long addr, unsigned long next,
	     struct mm_walk *walk)
+{

pte_t pteval = READ_ONCE(*pte);

This should be ptep_get().

...

if (!pte_user_accessible_page(pteval))
return 0;
pd_print_entry_user(walk->private, pte_pfn(pteval), addr, PAGE_SIZE,
	    pte_val(pteval), false);
walk->action = ACTION_CONTINUE;

return 0;
+}

+static int pd_hugetlb_entry(pte_t *pte, unsigned long hmask, unsigned long addr,
	    unsigned long next, struct mm_walk *walk)
+{

pte_t pteval = READ_ONCE(*pte);

This should be ptep_get().

...

cond_resched();

Do we really want to cond_resched() with mmap lock held on possibly every single process in the system?

...

pd_print_entry_user(walk->private, pte_pfn(pteval), addr, next - addr,
	    pte_val(pteval), true);
walk->action = ACTION_CONTINUE;

return 0;
+}

+static const struct mm_walk_ops pd_user_ops = {

.pud_entry = pd_pud_entry_user,

.pmd_entry = pd_pmd_entry_user,

.pte_entry = pd_pte_entry_user,

.hugetlb_entry = pd_hugetlb_entry,

.walk_lock = PGWALK_RDLOCK

+};

+/*

print information about mappings of pfn by mm, return -1 if canceled

return number of mappings found.

*/

+static long page_detective_user_mm_info(struct mm_struct *mm, unsigned long pfn) +{

struct pd_private_user pr = {0};

unsigned long s, e;

These variables names are really terrible. I have no idea what 's' or 'e' are supposed to be.

...

pr.pfn = pfn;

pr.mm = mm;

for (s = 0; s != TASK_SIZE; ) {
e = s + PD_WALK_MAX_RANGE;
if (e > TASK_SIZE || e < s)
	e = TASK_SIZE;
if (mmap_read_lock_killable(mm)) {
	pr_info("Received a cancel signal from user, while scanning user mappings\n");
	return -1;
}
walk_page_range(mm, s, e, &pd_user_ops, &pr);
mmap_read_unlock(mm);
cond_resched();
s = e;
}

return pr.maps;
+}

+/*

Report where/if PFN is mapped in user page tables, return -1 if canceled

by user.

*/

+static int page_detective_usermaps(unsigned long pfn) +{
struct task_struct *task, *t;

struct mm_struct **mm_table, *mm;

unsigned long proc_nr, mm_nr, i;

bool canceled_by_user;

long maps, ret;

u64 s, e;

s = sched_clock();

/* Get the number of processes currently running */

proc_nr = 0;

rcu_read_lock();

for_each_process(task)
proc_nr++;
rcu_read_unlock();

Was going to say is this racy, but I see you expect races below...

...

/* Allocate mm_table to fit mm from every running process */

mm_table = kvmalloc_array(proc_nr, sizeof(struct mm_struct *),
		  GFP_KERNEL);
if (!mm_table) {
pr_info("No memory to traverse though user mappings\n");
return 0;
}

/* get mm from every processes and copy its pointer into mm_table */

Typo but also this seems a bit crazy...

...

mm_nr = 0;

rcu_read_lock();

for_each_process(task) {

Including kernel threads?

...

```
if (mm_nr == proc_nr) {
```

	pr_info("Number of processes increased while scanning, some will be skipped\n");

```
	break;
```
```
}
```

Hmmm... is this even useful? Surely you'd want to try again or give up after a while?

...

```
t = find_lock_task_mm(task);
```
```
if (!t)
```
```
	continue;
```

We just give if this fails?

...

```
mm = task->mm;
```
```
if (!mm || !mmget_not_zero(mm)) {
```
```
	task_unlock(t);
```
```
	continue;
```
```
}
```
```
task_unlock(t);
```
```
mm_table[mm_nr++] = mm;
```

OK wait, so we get a reference on the mm of _every task_ in the system? What??

This seems pretty unwise...

...

}

rcu_read_unlock();

/* Walk through every user page table,release mm reference afterwards */

canceled_by_user = false;

maps = 0;

for (i = 0; i < mm_nr; i++) {
if (!canceled_by_user) {
	ret = page_detective_user_mm_info(mm_table[i], pfn);
	if (ret == -1)
		canceled_by_user = true;
	else
		maps += ret;
}
mmput(mm_table[i]);
cond_resched();
}

kvfree(mm_table);

e = sched_clock() - s;

pr_info("Scanned [%ld] user page tables in [%llu.%09llus]\n",
mm_nr, e / NSEC_PER_SEC, e % NSEC_PER_SEC);
pr_info("The page mapped into user page tables: %ld times\n", maps);

return canceled_by_user ? -1 : 0;
+}

+static void page_detective_iommu(unsigned long pfn) +{ +}

+static void page_detective_tdp(unsigned long pfn) +{ +}

Not sure it's really meaningful to just have empty placeholders like this?

...

+static void page_detective(unsigned long pfn) +{
if (!pfn_valid(pfn)) {
pr_info("pfn[%lx] is invalid\n", pfn);
return;
}

if (pfn == 0) {
pr_info("Skipping look-up for pfn[0] mapped many times into kernel page table\n");
return;
}

/* Report metadata information */

page_detective_metadata(pfn);

/*
* Report information about kernel mappings, and basic content
* information: i.e. all zero or not.
*/
if (page_detective_kernel(pfn) < 0)
return;
/* Report where/if PFN is mapped in user page tables */

if (page_detective_usermaps(pfn) < 0)
return;
/* Report where/if PFN is mapped in IOMMU page tables */

page_detective_iommu(pfn);

/* Report where/if PFN is mapped in 2 dimensional paging */

page_detective_tdp(pfn);
+}

+static u64 pid_virt_to_phys(unsigned int pid, unsigned long virt_addr) +{

I mean no no no no. NO.

Not another page table walker. Please. We HAVE SO MANY ALREADY. Let alone one outside of mm.

This just feels like going to enormous lengths to put mm logic in a driver, for some reason.

...

unsigned long phys_addr = -1;

struct task_struct *task;

struct mm_struct *mm;

pgd_t *pgd, pgdval;

p4d_t *p4d, p4dval;

pud_t *pud, pudval;

pmd_t *pmd, pmdval;

pte_t *pte, pteval;

if (virt_addr >= TASK_SIZE) {
pr_err("%s: virt_addr[%lx] is above TASK_SIZE[%lx]\n",
       __func__, virt_addr, TASK_SIZE);
return -1;
}

/* Find the task_struct using the PID */

task = find_get_task_by_vpid(pid);

if (!task) {
pr_err("%s: Task not found for PID %d\n", __func__, pid);
return -1;
}

mm = get_task_mm(task);

put_task_struct(task);

if (!mm) {
pr_err("%s: PID %d, can't get mm reference\n", __func__, pid);
return -1;
}

if (mmap_read_lock_killable(mm)) {
pr_info("Received a cancel signal from user, while convirting virt to phys\n");
mmput(mm);
return -1;
}

pgd = pgd_offset(mm, virt_addr);

pgdval = READ_ONCE(*pgd);

if (!pgd_present(pgdval) || unlikely(pgd_bad(pgdval))) {
pr_err("%s: pgd[%llx] present[%d] bad[%d]\n", __func__,
       (u64)pgd_val(pgdval), pgd_present(pgdval),
       pgd_bad(pgdval));
goto putmm_exit;
}

p4d = p4d_offset(pgd, virt_addr);

p4dval = READ_ONCE(*p4d);

if (!p4d_present(p4dval) || unlikely(p4d_bad(p4dval))) {
pr_err("%s: p4d[%llx] present[%d] bad[%d]\n", __func__,
       (u64)p4d_val(p4dval), p4d_present(p4dval),
       p4d_bad(p4dval));
goto putmm_exit;
}

pud = pud_offset(p4d, virt_addr);

pudval = READ_ONCE(*pud);

if (!pud_present(pudval)) {
pr_err("%s: pud[%llx] present[%d]\n", __func__,
       (u64)pud_val(pudval), pud_present(pudval));
goto putmm_exit;
}

if (pud_leaf(pudval)) {
phys_addr = (pud_pfn(pudval) << PAGE_SHIFT)
	| (virt_addr & ~PUD_MASK);
goto putmm_exit;
}

pmd = pmd_offset(pud, virt_addr);

pmdval = READ_ONCE(*pmd);

if (!pmd_present(pmdval)) {
pr_err("%s: pmd[%llx] present[%d]\n", __func__,
       (u64)pmd_val(pmdval), pmd_present(pmdval));
goto putmm_exit;
}

if (pmd_leaf(pmdval)) {
phys_addr = (pmd_pfn(pmdval) << PAGE_SHIFT)
	| (virt_addr & ~PMD_MASK);
goto putmm_exit;
}

pte = pte_offset_kernel(pmd, virt_addr);

pteval = READ_ONCE(*pte);

if (!pte_present(pteval)) {
pr_err("%s: pte[%llx] present[%d]\n", __func__,
       (u64)pte_val(pteval), pte_present(pteval));
goto putmm_exit;
}

phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+putmm_exit:

mmap_read_unlock(mm);

mmput(mm);

return phys_addr;

+}

+static ssize_t page_detective_virt_write(struct file *file,
			 const char __user *data,
			 size_t count, loff_t *ppos)
+{
char *input_str, *pid_str, *virt_str;

unsigned int pid, err, i;

unsigned long virt_addr;

u64 phys_addr;

/* If canceled by user simply return without printing anything */

err = mutex_lock_killable(&page_detective_mutex);

if (err)
return count;
input_str = kzalloc(count + 1, GFP_KERNEL);

if (!input_str) {
pr_err("%s: Unable to allocate input_str buffer\n",
       __func__);
mutex_unlock(&page_detective_mutex);
return -EAGAIN;

Feels like you could do with some good old fashioned C goto error handling since you duplicate this mutex unlock repeatedly...

...

}

if (copy_from_user(input_str, data, count)) {
kfree(input_str);
pr_err("%s: Unable to copy user input into virt file\n",
       __func__);
mutex_unlock(&page_detective_mutex);
return -EFAULT;
}

virt_str = NULL;

pid_str = input_str;

for (i = 0; i < count - 1; i++) {
if (isspace(input_str[i])) {
	input_str[i] = '\0';
	virt_str = &input_str[i + 1];
	break;
}
}

if (!virt_str) {
kfree(input_str);
pr_err("%s: Invalid virt file input, should be: '<pid> <virtual address>'\n",
       __func__);
mutex_unlock(&page_detective_mutex);
return -EINVAL;
}

err = kstrtouint(pid_str, 0, &pid);

if (err) {
kfree(input_str);
pr_err("%s: Failed to parse pid\n", __func__);
mutex_unlock(&page_detective_mutex);
return err;
}

err = kstrtoul(virt_str, 0, &virt_addr);

if (err) {
kfree(input_str);
pr_err("%s: Failed to parse virtual address\n", __func__);
mutex_unlock(&page_detective_mutex);
return err;
}

kfree(input_str);

phys_addr = pid_virt_to_phys(pid, virt_addr);

if (phys_addr == -1) {
pr_err("%s: Can't translate virtual to physical address\n",
       __func__);
mutex_unlock(&page_detective_mutex);
return -EINVAL;
}

pr_info("Investigating pid[%u] virtual[%lx] physical[%llx] pfn[%lx]\n",
pid, virt_addr, phys_addr, PHYS_PFN(phys_addr));
page_detective(PHYS_PFN(phys_addr));

pr_info("Finished investigation of virtual[%lx]\n", virt_addr);

mutex_unlock(&page_detective_mutex);

return count;
+}

+static ssize_t page_detective_phys_write(struct file *file,
			 const char __user *data,
			 size_t count, loff_t *ppos)
+{
u64 phys_addr;

int err;

/* If canceled by user simply return without printing anything */

err = mutex_lock_killable(&page_detective_mutex);

if (err)
return count;
err = kstrtou64_from_user(data, count, 0, &phys_addr);

if (err) {
pr_err("%s: Failed to parse physical address\n", __func__);
mutex_unlock(&page_detective_mutex);
return err;
}

pr_info("Investigating physical[%llx] pfn[%lx]\n", phys_addr,
PHYS_PFN(phys_addr));
page_detective(PHYS_PFN(phys_addr));

pr_info("Finished investigation of physical[%llx]\n", phys_addr);

mutex_unlock(&page_detective_mutex);

return count;
+}

+static int page_detective_open(struct inode *inode, struct file *file) +{
/* Deny access if not CAP_SYS_ADMIN */

if (!capable(CAP_SYS_ADMIN))
return -EPERM;
return simple_open(inode, file);
+}

+static const struct file_operations page_detective_virt_fops = {

.owner = THIS_MODULE,

.open = page_detective_open,

.write = page_detective_virt_write,

+};

+static const struct file_operations page_detective_phys_fops = {

.owner = THIS_MODULE,

.open = page_detective_open,

.write = page_detective_phys_write,

+};

+static int __init page_detective_init(void) +{
page_detective_debugfs_dir = debugfs_create_dir("page_detective", NULL);

debugfs_create_file("virt", 0200, page_detective_debugfs_dir, NULL,
	    &page_detective_virt_fops);
debugfs_create_file("phys", 0200, page_detective_debugfs_dir, NULL,
	    &page_detective_phys_fops);
return 0;
+} +module_init(page_detective_init);

+static void page_detective_exit(void) +{

debugfs_remove_recursive(page_detective_debugfs_dir);

+} +module_exit(page_detective_exit);

+MODULE_DESCRIPTION("Page Detective"); +MODULE_VERSION("1.0"); +MODULE_LICENSE("GPL");

+MODULE_AUTHOR("Pasha Tatashin pasha.tatashin@soleen.com");

2.47.0.338.g60cca15819-goog

Jann Horn

9:55 p.m.

New subject: [RFCv1 4/6] misc/page_detective: Introduce Page Detective

On Sat, Nov 16, 2024 at 6:59 PM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...

Page Detective is a kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It operates through the Linux debugfs interface, providing access to both virtual and physical address inquiries. The output, presented via kernel log messages (accessible with dmesg), will help administrators and developers understand how specific pages are utilized by the system.

This tool can be used to investigate various memory-related issues, such as checksum failures during live migration, filesystem journal failures, general segfaults, or other corruptions.

[...]

...

+/*

- Walk kernel page table, and print all mappings to this pfn, return 1 if
- pfn is mapped in direct map, return 0 if not mapped in direct map, and
- return -1 if operation canceled by user.
*/

+static int page_detective_kernel_map_info(unsigned long pfn,

                                    unsigned long direct_map_addr)

```
  struct pd_private_kernel pr = {0};
```
```
  unsigned long s, e;
```

  pr.direct_map_addr = direct_map_addr;

```
  pr.pfn = pfn;
```
```
  for (s = PAGE_OFFSET; s != ~0ul; ) {
```
```
          e = s + PD_WALK_MAX_RANGE;
```
```
          if (e < s)
```
```
                  e = ~0ul;
```

          if (walk_page_range_kernel(s, e, &pd_kernel_ops, &pr)) {

I think which parts of the kernel virtual address range you can safely pagewalk is somewhat architecture-specific; for example, X86 can run under Xen PV, in which case I think part of the page tables may not be walkable because they're owned by the hypervisor for its own use? Notably the x86 version of ptdump_walk_pgd_level_core starts walking at GUARD_HOLE_END_ADDR instead.

See also https://kernel.org/doc/html/latest/arch/x86/x86_64/mm.html for an ASCII table reference on address space regions.

...

                  pr_info("Received a cancel signal from user, while scanning kernel mappings\n");

```
                  return -1;
```
```
          }
```
```
          cond_resched();
```
```
          s = e;
```
```
  }
```
```
  if (!pr.vmalloc_maps) {
```

          pr_info("The page is not mapped into kernel vmalloc area\n");

```
  } else if (pr.vmalloc_maps > 1) {
```

          pr_info("The page is mapped into vmalloc area: %ld times\n",

```
                  pr.vmalloc_maps);
```
```
  }
```
```
  if (!pr.direct_map)
```

          pr_info("The page is not mapped into kernel direct map\n");

  pr_info("The page mapped into kernel page table: %ld times\n", pr.maps);

```
  return pr.direct_map ? 1 : 0;
```

+/* Print kernel information about the pfn, return -1 if canceled by user */ +static int page_detective_kernel(unsigned long pfn) +{

  unsigned long *mem = __va((pfn) << PAGE_SHIFT);

```
  unsigned long sum = 0;
```
```
  int direct_map;
```
```
  u64 s, e;
```
```
  int i;
```
```
  s = sched_clock();
```

  direct_map = page_detective_kernel_map_info(pfn, (unsigned long)mem);

```
  e = sched_clock() - s;
```

  pr_info("Scanned kernel page table in [%llu.%09llus]\n",

          e / NSEC_PER_SEC, e % NSEC_PER_SEC);

  /* Canceled by user or no direct map */

```
  if (direct_map < 1)
```
```
          return direct_map;
```

  for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++)

```
          sum |= mem[i];
```

If the purpose of this interface is to inspect pages in weird states, I wonder if it would make sense to use something like copy_mc_to_kernel() in case that helps avoid kernel crashes due to uncorrectable 2-bit ECC errors or such. But maybe that's not the kind of error you're concerned about here? And I also don't have any idea if copy_mc_to_kernel() actually does anything sensible for ECC errors. So don't treat this as a fix suggestion, more as a random idea that should probably be ignored unless someone who understands ECC errors says it makes sense.

But I think you should at least be using READ_ONCE(), since you're reading from memory that can change concurrently.

...

```
  if (sum == 0)
```

          pr_info("The page contains only zeroes\n");

```
  else
```

          pr_info("The page contains some data\n");

```
  return 0;
```

[...]

...

+/*

print information about mappings of pfn by mm, return -1 if canceled

return number of mappings found.

*/

+static long page_detective_user_mm_info(struct mm_struct *mm, unsigned long pfn) +{
  struct pd_private_user pr = {0};
  unsigned long s, e;
  pr.pfn = pfn;
  pr.mm = mm;
  for (s = 0; s != TASK_SIZE; ) {

TASK_SIZE does not make sense when inspecting another task, because TASK_SIZE depends on the virtual address space size of the current task (whether you are a 32-bit or 64-bit process). Please use TASK_SIZE_MAX for remote process access.

...

```
          e = s + PD_WALK_MAX_RANGE;
```
```
          if (e > TASK_SIZE || e < s)
```
```
                  e = TASK_SIZE;
```

          if (mmap_read_lock_killable(mm)) {

                  pr_info("Received a cancel signal from user, while scanning user mappings\n");

```
                  return -1;
```
```
          }
```

          walk_page_range(mm, s, e, &pd_user_ops, &pr);

```
          mmap_read_unlock(mm);
```
```
          cond_resched();
```
```
          s = e;
```
```
  }
```
```
  return pr.maps;
```

Pasha Tatashin

16 Nov 16 Nov

5:59 p.m.

New subject: [RFCv1 5/6] misc/page_detective: enable loadable module

Export the missing symbols, and allow page_detective to be built as a loadable module. This can be make this available in the field, where Page Detective is loaded only once it is needed.

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig index 2965c3c7cdef..b58b4f9567ff 100644 --- a/drivers/misc/Kconfig +++ b/drivers/misc/Kconfig @@ -495,7 +495,7 @@ config MISC_RTSX config PAGE_DETECTIVE depends on PAGE_TABLE_CHECK depends on MEMCG - bool "Page Detective" + tristate "Page Detective" help A debugging tool designed to provide detailed information about the usage and mapping of physical memory pages. This tool operates through diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 458519e416fe..84ad163a4281 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -279,6 +279,7 @@ void pr_cont_kernfs_path(struct kernfs_node *kn) out: spin_unlock_irqrestore(&kernfs_pr_cont_lock, flags); } +EXPORT_SYMBOL_GPL(pr_cont_kernfs_path);

/** * kernfs_get_parent - determine the parent node and pin it diff --git a/kernel/pid.c b/kernel/pid.c index 2715afb77eab..89454dc9535e 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -447,6 +447,7 @@ struct task_struct *find_get_task_by_vpid(pid_t nr)

return task; } +EXPORT_SYMBOL_GPL(find_get_task_by_vpid);

struct pid *get_task_pid(struct task_struct *task, enum pid_type type) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 53db98d2c4a1..389aeec06a04 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -967,6 +967,7 @@ struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) rcu_read_unlock(); return memcg; } +EXPORT_SYMBOL_GPL(get_mem_cgroup_from_folio);

/** * mem_cgroup_iter - iterate over memory cgroup hierarchy diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4d7a0004df2c..df230a091dcc 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -149,6 +149,7 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)

return t; } +EXPORT_SYMBOL_GPL(find_lock_task_mm);

/* * order == -1 means the oom kill is required by sysrq, otherwise only

-- 2.47.0.338.g60cca15819-goog

Pasha Tatashin

5:59 p.m.

New subject: [RFCv1 6/6] selftests/page_detective: Introduce self tests for Page Detective

Add self tests for Page Detective, it contains testing of several memory types, and also some negative/bad input tests.

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com --- MAINTAINERS | 1 + tools/testing/selftests/Makefile | 1 + .../selftests/page_detective/.gitignore | 1 + .../testing/selftests/page_detective/Makefile | 7 + tools/testing/selftests/page_detective/config | 4 + .../page_detective/page_detective_test.c | 727 ++++++++++++++++++ 6 files changed, 741 insertions(+) create mode 100644 tools/testing/selftests/page_detective/.gitignore create mode 100644 tools/testing/selftests/page_detective/Makefile create mode 100644 tools/testing/selftests/page_detective/config create mode 100644 tools/testing/selftests/page_detective/page_detective_test.c

diff --git a/MAINTAINERS b/MAINTAINERS index 654d4650670d..ec09b28776b0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17456,6 +17456,7 @@ L: linux-kernel@vger.kernel.org S: Maintained F: Documentation/misc-devices/page_detective.rst F: drivers/misc/page_detective.c +F: tools/testing/selftests/page_detective/

PAGE POOL M: Jesper Dangaard Brouer hawk@kernel.org diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 363d031a16f7..9c828025fdfa 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -72,6 +72,7 @@ TARGETS += net/packetdrill TARGETS += net/rds TARGETS += net/tcp_ao TARGETS += nsfs +TARGETS += page_detective TARGETS += perf_events TARGETS += pidfd TARGETS += pid_namespace diff --git a/tools/testing/selftests/page_detective/.gitignore b/tools/testing/selftests/page_detective/.gitignore new file mode 100644 index 000000000000..21a78bee7b4a --- /dev/null +++ b/tools/testing/selftests/page_detective/.gitignore @@ -0,0 +1 @@ +page_detective_test diff --git a/tools/testing/selftests/page_detective/Makefile b/tools/testing/selftests/page_detective/Makefile new file mode 100644 index 000000000000..43c4dccb6a13 --- /dev/null +++ b/tools/testing/selftests/page_detective/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0-only +CFLAGS += -g -Wall + +TEST_GEN_PROGS := page_detective_test + +include ../lib.mk + diff --git a/tools/testing/selftests/page_detective/config b/tools/testing/selftests/page_detective/config new file mode 100644 index 000000000000..ddfeed4ddf13 --- /dev/null +++ b/tools/testing/selftests/page_detective/config @@ -0,0 +1,4 @@ +CONFIG_PAGE_TABLE_CHECK=y +CONFIG_MEMCG=y +CONFIG_TRANSPARENT_HUGEPAGE=y +CONFIG_PAGE_DETECTIVE=y diff --git a/tools/testing/selftests/page_detective/page_detective_test.c b/tools/testing/selftests/page_detective/page_detective_test.c new file mode 100644 index 000000000000..f86cf0fdd8fc --- /dev/null +++ b/tools/testing/selftests/page_detective/page_detective_test.c @@ -0,0 +1,727 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2024, Google LLC. + * Pasha Tatashin pasha.tatashin@soleen.com + */ + +#define _GNU_SOURCE +#include <err.h> +#include <errno.h> +#include <fcntl.h> +#include <unistd.h> +#include <stdlib.h> +#include <stdio.h> +#include <string.h> +#include <stdint.h> +#include <time.h> +#include <sys/mman.h> +#include <alloca.h> +#include "../kselftest.h" + +#define OPT_STR "hpvaAsnHtbBS" +#define HELP_STR \ +"Usage: %s [-h] [-p] [-v] [-a] [-A] [-s] [-n] [-H] [-t] [-b] [-S]\n" \ +"-h\tshow this help\n" \ +"Interfaces:\n" \ +"-p\tphysical address page detective interface\n" \ +"-v\tvirtual address page detective interface\n" \ +"Tests:\n" \ +"-a\ttest anonymous page\n" \ +"-A\ttest anonymous huge page\n" \ +"-s\ttest anonymous shared page\n" \ +"-n\ttest named shared page\n" \ +"-H\ttest HugeTLB shared page\n" \ +"-t\ttest tmpfs page\n" \ +"-b\ttest bad/fail input cases\n" \ +"-S\ttest stack page\n" \ +"If no arguments specified all tests are performed\n" \ + +#define FIRST_LINE_PREFIX "Page Detective: Investigating " +#define LAST_LINE_PREFIX "Page Detective: Finished investigation of " + +#define TMP_FILE "/tmp/page_detective_test.out" + +#define NR_HUGEPAGES "/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages" + +#define NANO 1000000000ul +#define MICRO 1000000ul +#define MILLI 1000ul +#define BIT(nr) (UL(1) << (nr)) + +#define ARG_INTERFACE_PHYS BIT(0) +#define ARG_INTERFACE_VIRT BIT(1) + +#define ARG_TEST_ANON BIT(2) +#define ARG_TEST_ANON_HUGE BIT(3) +#define ARG_TEST_ANON_SHARED BIT(4) +#define ARG_TEST_NAMED_SHARED BIT(5) +#define ARG_TEST_HUGETLB_SHARED BIT(6) +#define ARG_TEST_TMPFS BIT(7) +#define ARG_TEST_FAIL_CASES BIT(8) +#define ARG_TEST_STACK BIT(9) + +#define ARG_DEFAULT (~0) /* Run verything by default */ + +#define ARG_INTERFACE_MASK (ARG_INTERFACE_PHYS | ARG_INTERFACE_VIRT) +#define ARG_TEST_MASK (~ARG_INTERFACE_MASK) + +#define INTERFACE_NAME(in) (((in) == ARG_INTERFACE_PHYS) ? \ + "/sys/kernel/debug/page_detective/phys" : \ + "/sys/kernel/debug/page_detective/virt") + +#define PAGE_SIZE ((unsigned long)sysconf(_SC_PAGESIZE)) +#define HUGE_PAGE_SIZE (PAGE_SIZE * PAGE_SIZE / sizeof(uint64_t)) + +#ifndef MAP_HUGE_2MB +#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) +#endif + +#ifndef MFD_HUGEPAGE +#define MFD_HUGEPAGE (MFD_GOOGLE_SPECIFIC_BASE << 0) +#endif + +#ifndef MFD_GOOGLE_SPECIFIC_BASE +#define MFD_GOOGLE_SPECIFIC_BASE 0x0200U +#endif + +static int old_dmesg; + +static uint64_t virt_to_phys(uint64_t virt, uint64_t *physp) +{ + uint64_t tbloff, offset, tblen, pfn; + int fd, nr; + + fd = open("/proc/self/pagemap", O_RDONLY); + if (fd < 0) { + ksft_print_msg("%s open(/proc/self/pagemap): %s\n", __func__, + strerror(errno)); + return 1; + } + + /* see: Documentation/admin-guide/mm/pagemap.rst */ + tbloff = virt / PAGE_SIZE * sizeof(uint64_t); + offset = lseek(fd, tbloff, SEEK_SET); + if (offset == (off_t)-1) { + ksft_print_msg("%s lseek: %s\n", __func__, strerror(errno)); + return 1; + } + + if (offset != tbloff) { + ksft_print_msg("%s: cannot find virtual address\n", __func__); + return 1; + } + + nr = read(fd, &tblen, sizeof(uint64_t)); + if (nr != sizeof(uint64_t)) { + ksft_print_msg("%s: read\n", __func__); + return 1; + } + close(fd); + + if (!(tblen & (1ul << 63))) { + ksft_print_msg("%s: present bit is not set\n", __func__); + return 1; + } + + /* Bits 0-54 page frame number (PFN) if present */ + pfn = tblen & 0x7fffffffffffffULL; + + *physp = PAGE_SIZE * pfn | (virt & (PAGE_SIZE - 1)); + + return 0; +} + +static int __phys_test(unsigned long long phys) +{ + char phys_str[128]; + int fd, nr; + + fd = open("/sys/kernel/debug/page_detective/phys", O_WRONLY); + if (fd < 0) { + ksft_print_msg("%s open: %s\n", __func__, strerror(errno)); + return 4; + } + + sprintf(phys_str, "%llu", phys); + + nr = write(fd, phys_str, strlen(phys_str)); + if (nr != strlen(phys_str)) { + ksft_print_msg("%s write failed\n", __func__); + return 1; + } + close(fd); + + return 0; +} + +static int phys_test(char *mem) +{ + uint64_t phys; + + if (virt_to_phys((uint64_t)mem, &phys)) + return 1; + + return __phys_test(phys); +} + +static int __virt_test(int pid, unsigned long virt) +{ + char virt_str[128]; + int fd, nr; + + fd = open("/sys/kernel/debug/page_detective/virt", O_WRONLY); + if (fd < 0) { + ksft_print_msg("%s open: %s\n", __func__, strerror(errno)); + return 4; + } + + sprintf(virt_str, "%d %#lx", pid, virt); + nr = write(fd, virt_str, strlen(virt_str)); + if (nr != strlen(virt_str)) { + ksft_print_msg("%s: write(%s): %s\n", __func__, virt_str, + strerror(errno)); + close(fd); + + return 1; + } + close(fd); + + return 0; +} + +static int virt_test(char *mem) +{ + return __virt_test(getpid(), (unsigned long)mem); +} + +static int test_anon(int in) +{ + char *mem; + int rv; + + mem = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + if (mem == MAP_FAILED) { + ksft_print_msg("%s mmap: %s\n", __func__, strerror(errno)); + return 1; + } + + if (madvise(mem, PAGE_SIZE, MADV_NOHUGEPAGE)) { + ksft_print_msg("%s madvise: %s\n", __func__, strerror(errno)); + return 1; + } + + mem[0] = 0; + + if (in == ARG_INTERFACE_PHYS) + rv = phys_test(mem); + else + rv = virt_test(mem); + + munmap(mem, PAGE_SIZE); + + return rv; +} + +static int test_anon_huge(int in) +{ + uint64_t i; + char *mem; + int rv; + + mem = mmap(NULL, HUGE_PAGE_SIZE * 8, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + if (mem == MAP_FAILED) { + ksft_print_msg("%s mmap: %s\n", __func__, strerror(errno)); + return 1; + } + + if (madvise(mem, HUGE_PAGE_SIZE * 8, MADV_HUGEPAGE)) { + ksft_print_msg("%s madvise: %s\n", __func__, strerror(errno)); + return 1; + } + + /* Fault huge pages */ + for (i = 0; i < HUGE_PAGE_SIZE * 8; i += HUGE_PAGE_SIZE) + mem[i] = 0; + + /* In case huge pages were not used for some reason */ + mem[HUGE_PAGE_SIZE * 7 + 101 * PAGE_SIZE] = 0; + + if (in == ARG_INTERFACE_PHYS) + rv = phys_test(mem + HUGE_PAGE_SIZE * 7 + 101 * PAGE_SIZE); + else + rv = virt_test(mem + HUGE_PAGE_SIZE * 7 + 101 * PAGE_SIZE); + + munmap(mem, HUGE_PAGE_SIZE * 8); + + return rv; +} + +static int test_anon_shared(int in) +{ + char *mem; + int rv; + + mem = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_SHARED, -1, 0); + if (mem == MAP_FAILED) { + ksft_print_msg("%s mmap: %s\n", __func__, strerror(errno)); + return 1; + } + + if (madvise(mem, PAGE_SIZE, MADV_NOHUGEPAGE)) { + ksft_print_msg("%s madvise: %s\n", __func__, strerror(errno)); + return 1; + } + + mem[0] = 0; + + if (in == ARG_INTERFACE_PHYS) + rv = phys_test(mem); + else + rv = virt_test(mem); + + munmap(mem, PAGE_SIZE); + + return rv; +} + +static int test_named_shared(int in) +{ + char *mem; + int rv; + + mem = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_SHARED, -1, 0); + if (mem == MAP_FAILED) { + ksft_print_msg("%s mmap: %s\n", __func__, strerror(errno)); + return 1; + } + + if (madvise(mem, PAGE_SIZE, MADV_NOHUGEPAGE)) { + ksft_print_msg("%s madvise: %s\n", __func__, strerror(errno)); + return 1; + } + + mem[0] = 0; + + if (in == ARG_INTERFACE_PHYS) + rv = phys_test(mem); + else + rv = virt_test(mem); + + munmap(mem, PAGE_SIZE); + + return rv; +} + +static int test_hugetlb_shared(int in) +{ + char hugepg_add_cmd[256], hugepg_rm_cmd[256]; + unsigned long nr_hugepages; + char *mem; + FILE *f; + int rv; + + f = fopen(NR_HUGEPAGES, "r"); + if (!f) { + ksft_print_msg("%s fopen: %s\n", __func__, strerror(errno)); + return 4; + } + + fscanf(f, "%lu", &nr_hugepages); + fclose(f); + sprintf(hugepg_add_cmd, "echo %lu > " NR_HUGEPAGES, nr_hugepages + 1); + sprintf(hugepg_rm_cmd, "echo %lu > " NR_HUGEPAGES, nr_hugepages); + + if (system(hugepg_add_cmd)) { + ksft_print_msg("%s system(hugepg_add_cmd): %s\n", __func__, + strerror(errno)); + return 4; + } + + mem = mmap(NULL, HUGE_PAGE_SIZE, PROT_READ | PROT_WRITE, + MAP_HUGETLB | MAP_HUGE_2MB | MAP_ANONYMOUS | MAP_SHARED, + -1, 0); + if (mem == MAP_FAILED) { + ksft_print_msg("%s mmap: %s\n", __func__, strerror(errno)); + return 1; + } + + mem[0] = 0; + + if (in == ARG_INTERFACE_PHYS) + rv = phys_test(mem); + else + rv = virt_test(mem); + + munmap(mem, HUGE_PAGE_SIZE); + + if (system(hugepg_rm_cmd)) { + ksft_print_msg("%s system(hugepg_rm_cmd): %s\n", __func__, + strerror(errno)); + return 1; + } + + return rv; +} + +static int test_tmpfs(int in) +{ + char *mem; + int fd; + int rv; + + fd = memfd_create("tmpfs_page", + MFD_CLOEXEC | MFD_ALLOW_SEALING); + if (fd < 0) { + ksft_print_msg("%s memfd_create: %s\n", __func__, + strerror(errno)); + return 1; + } + + if (ftruncate(fd, PAGE_SIZE) == -1) { + ksft_print_msg("%s ftruncate: %s\n", __func__, strerror(errno)); + return 1; + } + + mem = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, + MAP_SHARED, fd, 0); + if (mem == MAP_FAILED) { + ksft_print_msg("%s mmap: %s\n", __func__, strerror(errno)); + return 1; + } + + mem[0] = 0; + + if (in == ARG_INTERFACE_PHYS) + rv = phys_test(mem); + else + rv = virt_test(mem); + + munmap(mem, PAGE_SIZE); + close(fd); + + return rv; +} + +static int test_stack(int in) +{ + char *mem = alloca(PAGE_SIZE); + int rv; + + mem[0] = 0; + if (in == ARG_INTERFACE_PHYS) + rv = phys_test(mem); + else + rv = virt_test(mem); + + return rv; +} + +static int test_bad_phys(void) +{ + int rv; + + rv = __phys_test(0); + if (rv) + return rv; + + rv = __phys_test(1); + if (rv) + return rv; + + rv = __phys_test(~0); + + return rv; +} + +static int test_bad_virt(void) +{ + int rv; + + rv = __virt_test(0, 0); + if (rv == 4) + return rv; + + if (!rv) { + ksft_print_msg("%s: write(0, 0) did not fail\n", __func__); + return 1; + } + + if (!__virt_test(0, -1)) { + ksft_print_msg("%s: write(0, -1) did not fail\n", __func__); + return 1; + } + + if (!__virt_test(0, -1)) { + ksft_print_msg("%s: write(0, -1) did not fail\n", __func__); + return 1; + } + + if (!__virt_test(-1, 0)) { + ksft_print_msg("%s: write(-1, 0) did not fail\n", __func__); + return 1; + } + + return 0; +} + +static int test_fail_cases(int in) +{ + if (in == ARG_INTERFACE_VIRT) + return test_bad_virt(); + else + return test_bad_phys(); +} + +/* Return time in nanosecond since epoch */ +static uint64_t gethrtime(void) +{ + struct timespec ts; + + clock_gettime(CLOCK_REALTIME, &ts); + + return (ts.tv_sec * NANO) + ts.tv_nsec; +} + +static int sanity_check_fail_cases(int in, int test_ttpe, FILE *f) +{ + char *line; + size_t len; + ssize_t nr; + + line = NULL; + len = 0; + while ((nr = getline(&line, &len, f)) != -1) { + char *l = strchr(line, ']') + 2; /* skip time stamp */ + + if (l == (char *)2) + continue; + + ksft_print_msg("%s", l); + } + + free(line); + + return 0; +} + +static int sanity_check_test_result(int in, int test_type, uint64_t start) +{ + uint64_t end = gethrtime() + NANO; + char dmesg[256], *line; + int first_line, last_line; + size_t len; + ssize_t nr; + FILE *f; + + sprintf(dmesg, "dmesg --since=@%ld.%06ld --until=@%ld.%06ld > " + TMP_FILE, start / NANO, (start / MILLI) % MICRO, end / NANO, + (end / MILLI) % MICRO); + + if (old_dmesg) + system("dmesg > " TMP_FILE); + else + system(dmesg); + f = fopen(TMP_FILE, "r"); + if (!f) { + ksft_print_msg("%s: %s", __func__, strerror(errno)); + return 1; + } + + if (test_type == ARG_TEST_FAIL_CASES) { + sanity_check_fail_cases(in, test_type, f); + fclose(f); + unlink(TMP_FILE); + return 0; + } + + line = NULL; + len = 0; + first_line = 0; + last_line = 0; + while ((nr = getline(&line, &len, f)) != -1) { + char *l = strchr(line, ']') + 2; /* skip time stamp */ + + if (l == (char *)2) + continue; + + if (!first_line) { + first_line = !strncmp(l, FIRST_LINE_PREFIX, + strlen(FIRST_LINE_PREFIX)); + } else if (!last_line) { + last_line = !strncmp(l, LAST_LINE_PREFIX, + strlen(LAST_LINE_PREFIX)); + if (last_line) + ksft_print_msg("%s", l); + } + + /* + * Output everything between the first and last line of page + * detective output + */ + if (first_line && !last_line) + ksft_print_msg("%s", l); + } + ksft_print_msg("\n"); + + free(line); + fclose(f); + unlink(TMP_FILE); + + if (!first_line) { + ksft_print_msg("Test failed, bad first line\n"); + return 1; + } + if (!last_line) { + ksft_print_msg("Test failed, bad last line\n"); + return 1; + } + + return 0; +} + +/* + * interface ARG_INTERFACE_VIRT or ARG_INTERFACE_PHYS + * test_arg one of the ARG_TEST_* + * test_f function for this test + * arg the provided user arguments. + * + * Run the test using the provided interface if it is requested interface arg. + */ +#define TEST(interface, test_type, test_f, arg) \ + do { \ + int __in = (interface); \ + int __tt = (test_type); \ + int __rv; \ + \ + if ((arg) & __tt) { \ + uint64_t start = gethrtime(); \ + \ + if (old_dmesg) \ + system("dmesg -C"); \ + else \ + usleep(100000); \ + \ + __rv = test_f(__in); \ + if (__rv == 4) { \ + ksft_test_result_skip(#test_f " via [%s]\n", \ + INTERFACE_NAME(__in)); \ + break; \ + } \ + \ + if (__rv) { \ + ksft_test_result_fail(#test_f " via [%s]\n", \ + INTERFACE_NAME(__in)); \ + break; \ + } \ + \ + if (sanity_check_test_result(__in, __tt, \ + start)) { \ + ksft_test_result_fail(#test_f " via [%s]\n", \ + INTERFACE_NAME(__in)); \ + break; \ + } \ + ksft_test_result_pass(#test_f " via [%s]\n", \ + INTERFACE_NAME(__in)); \ + } \ + } while (0) + +static void run_tests(int in, int arg) +{ + if (in & arg) { + TEST(in, ARG_TEST_ANON, test_anon, arg); + TEST(in, ARG_TEST_ANON_HUGE, test_anon_huge, arg); + TEST(in, ARG_TEST_ANON_SHARED, test_anon_shared, arg); + TEST(in, ARG_TEST_NAMED_SHARED, test_named_shared, arg); + TEST(in, ARG_TEST_HUGETLB_SHARED, test_hugetlb_shared, arg); + TEST(in, ARG_TEST_TMPFS, test_tmpfs, arg); + TEST(in, ARG_TEST_STACK, test_stack, arg); + TEST(in, ARG_TEST_FAIL_CASES, test_fail_cases, arg); + } +} + +static int count_tests(int in, int arg) +{ + int tests = 0; + + if (in & arg) { + tests += (arg & ARG_TEST_ANON) ? 1 : 0; + tests += (arg & ARG_TEST_ANON_HUGE) ? 1 : 0; + tests += (arg & ARG_TEST_ANON_SHARED) ? 1 : 0; + tests += (arg & ARG_TEST_NAMED_SHARED) ? 1 : 0; + tests += (arg & ARG_TEST_HUGETLB_SHARED) ? 1 : 0; + tests += (arg & ARG_TEST_TMPFS) ? 1 : 0; + tests += (arg & ARG_TEST_STACK) ? 1 : 0; + tests += (arg & ARG_TEST_FAIL_CASES) ? 1 : 0; + } + + return tests; +} + +int main(int argc, char **argv) +{ + int arg = 0; + int opt; + + while ((opt = getopt(argc, argv, OPT_STR)) != -1) { + switch (opt) { + case 'h': + printf(HELP_STR, argv[0]); + exit(EXIT_SUCCESS); + case 'v': + arg |= ARG_INTERFACE_VIRT; + break; + case 'p': + arg |= ARG_INTERFACE_PHYS; + break; + case 'a': + arg |= ARG_TEST_ANON; + break; + case 'A': + arg |= ARG_TEST_ANON_HUGE; + break; + case 's': + arg |= ARG_TEST_ANON_SHARED; + break; + case 'n': + arg |= ARG_TEST_NAMED_SHARED; + break; + case 'H': + arg |= ARG_TEST_HUGETLB_SHARED; + break; + case 't': + arg |= ARG_TEST_TMPFS; + break; + case 'S': + arg |= ARG_TEST_STACK; + break; + case 'b': + arg |= ARG_TEST_FAIL_CASES; + break; + default: + errx(EXIT_FAILURE, HELP_STR, argv[0]); + } + } + + if (arg == 0) + arg = ARG_DEFAULT; + if (!(arg & ARG_INTERFACE_MASK)) + errx(EXIT_FAILURE, "No page detective interface specified"); + + if (!(arg & ARG_TEST_MASK)) + errx(EXIT_FAILURE, "No tests specified"); + + /* Return 1 when dmesg does not have --since and --until arguments */ + old_dmesg = system("dmesg --since @0 > /dev/null 2>&1"); + + ksft_print_header(); + ksft_set_plan(count_tests(ARG_INTERFACE_VIRT, arg) + + count_tests(ARG_INTERFACE_PHYS, arg)); + + run_tests(ARG_INTERFACE_VIRT, arg); + run_tests(ARG_INTERFACE_PHYS, arg); + ksft_finished(); +}

-- 2.47.0.338.g60cca15819-goog

Muhammad Usama Anjum

17 Nov 17 Nov

6:25 a.m.

New subject: [RFCv1 6/6] selftests/page_detective: Introduce self tests for Page Detective

On 11/16/24 10:59 PM, Pasha Tatashin wrote:

...

Add self tests for Page Detective, it contains testing of several memory types, and also some negative/bad input tests.

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com

MAINTAINERS | 1 + tools/testing/selftests/Makefile | 1 + .../selftests/page_detective/.gitignore | 1 +

No need to add a new directory. Please just add the tests in selftests/mm/ directory.

...

.../testing/selftests/page_detective/Makefile | 7 + tools/testing/selftests/page_detective/config | 4 + .../page_detective/page_detective_test.c | 727 ++++++++++++++++++ 6 files changed, 741 insertions(+) create mode 100644 tools/testing/selftests/page_detective/.gitignore create mode 100644 tools/testing/selftests/page_detective/Makefile create mode 100644 tools/testing/selftests/page_detective/config create mode 100644 tools/testing/selftests/page_detective/page_detective_test.c

-- BR, Muhammad Usama Anjum

Pasha Tatashin

18 Nov 18 Nov

8:27 p.m.

New subject: [RFCv1 6/6] selftests/page_detective: Introduce self tests for Page Detective

On Sun, Nov 17, 2024 at 1:25 AM Muhammad Usama Anjum Usama.Anjum@collabora.com wrote:

...

On 11/16/24 10:59 PM, Pasha Tatashin wrote:

...
Add self tests for Page Detective, it contains testing of several memory types, and also some negative/bad input tests.

Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com

MAINTAINERS | 1 + tools/testing/selftests/Makefile | 1 + .../selftests/page_detective/.gitignore | 1 +

No need to add a new directory. Please just add the tests in selftests/mm/ directory.

Thanks, I will move this to selftests/mm/ directory in the next version.

...

...
.../testing/selftests/page_detective/Makefile | 7 + tools/testing/selftests/page_detective/config | 4 + .../page_detective/page_detective_test.c | 727 ++++++++++++++++++ 6 files changed, 741 insertions(+) create mode 100644 tools/testing/selftests/page_detective/.gitignore create mode 100644 tools/testing/selftests/page_detective/Makefile create mode 100644 tools/testing/selftests/page_detective/config create mode 100644 tools/testing/selftests/page_detective/page_detective_test.c

-- BR, Muhammad Usama Anjum

Lorenzo Stoakes

11:17 a.m.

On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:

...

Page Detective is a new kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It is often known that a particular page is corrupted, but it is hard to extract more information about such a page from live system. Examples are:

Checksum failure during live migration

Filesystem journal failure

dump_page warnings on the console log

Unexcpected segfaults

Page Detective helps to extract more information from the kernel, so it can be used by developers to root cause the associated problem.

I like the _concept_ of providing more information like this.

But you've bizarrely gone to great lengths to expose mm internal implementation details to drivers in order to implement this as a driver.

This is _very clearly_ an mm thing, and _very clearly_ subject to change depending on how mm changes. It should live under mm/ and not be a loadable driver.

I am also very very much not in favour of re-implementing yet another page table walker, this time in driver code (!). Please no.

So NACK in its current form. This has to be implemented within mm if we are to take it.

I'm also concerned about its scalability and impact on the system, as it takes every single mm lock in the system at once, which seems kinda unwise or at least problematic, and not something we want happening outside of mm, at any rate.

...

It operates through the Linux debugfs interface, with two files: "virt" and "phys".

The "virt" file takes a virtual address and PID and outputs information about the corresponding page.

The "phys" file takes a physical address and outputs information about that page.

The output is presented via kernel log messages (can be accessed with dmesg), and includes information such as the page's reference count, mapping, flags, and memory cgroup. It also shows whether the page is mapped in the kernel page table, and if so, how many times.

I mean, even though I'm not a huge fan of kernel pointer hashing etc. this is obviously leaking as much information as you might want about kernel internal state to the point of maybe making the whole kernel pointer hashing thing moot.

I know this requires CAP_SYS_ADMIN, but there are things that also require that which _still_ obscure kernel pointers.

And you're outputting it all to dmesg.

So yeah, a security person (Jann?) would be better placed to comment on this than me, but are we sure we want to do this when not in a CONFIG_DEBUG_VM* kernel?

...

Pasha Tatashin (6): mm: Make get_vma_name() function public pagewalk: Add a page table walker for init_mm page table mm: Add a dump_page variant that accept log level argument misc/page_detective: Introduce Page Detective misc/page_detective: enable loadable module selftests/page_detective: Introduce self tests for Page Detective

Documentation/misc-devices/index.rst | 1 + Documentation/misc-devices/page_detective.rst | 78 ++ MAINTAINERS | 8 + drivers/misc/Kconfig | 11 + drivers/misc/Makefile | 1 + drivers/misc/page_detective.c | 808 ++++++++++++++++++ fs/inode.c | 18 +- fs/kernfs/dir.c | 1 + fs/proc/task_mmu.c | 61 -- include/linux/fs.h | 5 +- include/linux/mmdebug.h | 1 + include/linux/pagewalk.h | 2 + kernel/pid.c | 1 + mm/debug.c | 53 +- mm/memcontrol.c | 1 + mm/oom_kill.c | 1 + mm/pagewalk.c | 32 + mm/vma.c | 60 ++ tools/testing/selftests/Makefile | 1 + .../selftests/page_detective/.gitignore | 1 + .../testing/selftests/page_detective/Makefile | 7 + tools/testing/selftests/page_detective/config | 4 + .../page_detective/page_detective_test.c | 727 ++++++++++++++++ 23 files changed, 1787 insertions(+), 96 deletions(-) create mode 100644 Documentation/misc-devices/page_detective.rst create mode 100644 drivers/misc/page_detective.c create mode 100644 tools/testing/selftests/page_detective/.gitignore create mode 100644 tools/testing/selftests/page_detective/Makefile create mode 100644 tools/testing/selftests/page_detective/config create mode 100644 tools/testing/selftests/page_detective/page_detective_test.c

-- 2.47.0.338.g60cca15819-goog

Jann Horn

12:53 p.m.

On Mon, Nov 18, 2024 at 12:17 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:

...

On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:

...
It operates through the Linux debugfs interface, with two files: "virt" and "phys".

The "virt" file takes a virtual address and PID and outputs information about the corresponding page.

The "phys" file takes a physical address and outputs information about that page.

The output is presented via kernel log messages (can be accessed with dmesg), and includes information such as the page's reference count, mapping, flags, and memory cgroup. It also shows whether the page is mapped in the kernel page table, and if so, how many times.

I mean, even though I'm not a huge fan of kernel pointer hashing etc. this is obviously leaking as much information as you might want about kernel internal state to the point of maybe making the whole kernel pointer hashing thing moot.

I know this requires CAP_SYS_ADMIN, but there are things that also require that which _still_ obscure kernel pointers.

And you're outputting it all to dmesg.

So yeah, a security person (Jann?) would be better placed to comment on this than me, but are we sure we want to do this when not in a CONFIG_DEBUG_VM* kernel?

I guess there are two parts to this - what root is allowed to do, and what information we're fine with exposing to dmesg.

If the lockdown LSM is not set to LOCKDOWN_CONFIDENTIALITY_MAX, the kernel allows root to read kernel memory through some interfaces - in particular, BPF allows reading arbitrary kernel memory, and perf allows reading at least some stuff (like kernel register states). With lockdown in the most restrictive mode, the kernel tries to prevent root from reading arbitrary kernel memory, but we don't really change how much information goes into dmesg. (And I imagine you could probably still get kernel pointers out of BPF somehow even in the most restrictive lockdown mode, but that's probably not relevant.)

The main issue with dmesg is that some systems make its contents available to code that is not running with root privileges; and I think it is also sometimes stored persistently in unencrypted form (like in EFI pstore) even when everything else on the system is encrypted. So on one hand, we definitely shouldn't print the contents of random chunks of memory into dmesg without a good reason; on the other hand, for example we do already print kernel register state on WARN() (which often includes kernel pointers and could theoretically include more sensitive data too).

So I think showing page metadata to root when requested is probably okay as a tradeoff? And dumping that data into dmesg is maybe not great, but acceptable as long as only root can actually trigger this?

I don't really have a strong opinion on this...

To me, a bigger issue is that dump_page() looks like it might be racy, which is maybe not terrible in debugging code that only runs when something has already gone wrong, but bad if it is in code that root can trigger on demand? __dump_page() copies the given page with memcpy(), which I don't think guarantees enough atomicity with concurrent updates of page->mapping or such, so dump_mapping() could probably run on a bogus pointer. Even without torn pointers, I think there could be a UAF if the page's mapping is destroyed while we're going through dump_page(), since the page might not be locked. And in dump_mapping(), the strncpy_from_kernel_nofault() also doesn't guard against concurrent renaming of the dentry, which I think again would probably result in UAF. So I think dump_page() in its current form is not something we should expose to a userspace-reachable API.

Pasha Tatashin

10:24 p.m.

On Mon, Nov 18, 2024 at 7:54 AM Jann Horn jannh@google.com wrote:

...

On Mon, Nov 18, 2024 at 12:17 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:

...
On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:

...
It operates through the Linux debugfs interface, with two files: "virt" and "phys".

The "virt" file takes a virtual address and PID and outputs information about the corresponding page.

The "phys" file takes a physical address and outputs information about that page.

The output is presented via kernel log messages (can be accessed with dmesg), and includes information such as the page's reference count, mapping, flags, and memory cgroup. It also shows whether the page is mapped in the kernel page table, and if so, how many times.

I mean, even though I'm not a huge fan of kernel pointer hashing etc. this is obviously leaking as much information as you might want about kernel internal state to the point of maybe making the whole kernel pointer hashing thing moot.

I know this requires CAP_SYS_ADMIN, but there are things that also require that which _still_ obscure kernel pointers.

And you're outputting it all to dmesg.

So yeah, a security person (Jann?) would be better placed to comment on this than me, but are we sure we want to do this when not in a CONFIG_DEBUG_VM* kernel?

I guess there are two parts to this - what root is allowed to do, and what information we're fine with exposing to dmesg.

If the lockdown LSM is not set to LOCKDOWN_CONFIDENTIALITY_MAX, the kernel allows root to read kernel memory through some interfaces - in particular, BPF allows reading arbitrary kernel memory, and perf allows reading at least some stuff (like kernel register states). With lockdown in the most restrictive mode, the kernel tries to prevent root from reading arbitrary kernel memory, but we don't really change how much information goes into dmesg. (And I imagine you could probably still get kernel pointers out of BPF somehow even in the most restrictive lockdown mode, but that's probably not relevant.)

The main issue with dmesg is that some systems make its contents available to code that is not running with root privileges; and I think it is also sometimes stored persistently in unencrypted form (like in EFI pstore) even when everything else on the system is encrypted. So on one hand, we definitely shouldn't print the contents of random chunks of memory into dmesg without a good reason; on the other hand, for example we do already print kernel register state on WARN() (which often includes kernel pointers and could theoretically include more sensitive data too).

So I think showing page metadata to root when requested is probably okay as a tradeoff? And dumping that data into dmesg is maybe not great, but acceptable as long as only root can actually trigger this?

I don't really have a strong opinion on this...

To me, a bigger issue is that dump_page() looks like it might be racy, which is maybe not terrible in debugging code that only runs when something has already gone wrong, but bad if it is in code that root can trigger on demand?

Hi Jann, thank you for reviewing this proposal.

Presumably, the interface should be used only when something has gone wrong but has not been noticed by the kernel. That something is usually checksums failures that are outside of the kernel: i.e. during live migration, snapshotting, filesystem journaling, etc. We already have interfaces that provide data from the live kernel that could be racy, i.e. crash utility.

...

__dump_page() copies the given page with memcpy(), which I don't think guarantees enough atomicity with concurrent updates of page->mapping or such, so dump_mapping() could probably run on a bogus pointer. Even without torn pointers, I think there could be a UAF if the page's mapping is destroyed while we're going through dump_page(), since the page might not be locked. And in dump_mapping(), the strncpy_from_kernel_nofault() also doesn't guard against concurrent renaming of the dentry, which I think again would probably result in UAF.

Since we are holding a reference on the page at the time of dump_page(), the identity of the page should not really change, but dentry can be renamed.

...

So I think dump_page() in its current form is not something we should expose to a userspace-reachable API.

We use dump_page() all over WARN_ONs in MM code where pages might not be locked, but this is a good point, that while even the existing usage might be racy, providing a user-reachable API potentially makes it worse. I will see if I could add some locking before dump_page(), or make a dump_page variant that does not do dump_mapping().

Jann Horn

19 Nov 19 Nov

12:39 a.m.

On Mon, Nov 18, 2024 at 11:24 PM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...

On Mon, Nov 18, 2024 at 7:54 AM Jann Horn jannh@google.com wrote:

...
On Mon, Nov 18, 2024 at 12:17 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:

...
On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:

...
It operates through the Linux debugfs interface, with two files: "virt" and "phys".

The "virt" file takes a virtual address and PID and outputs information about the corresponding page.

The "phys" file takes a physical address and outputs information about that page.

The output is presented via kernel log messages (can be accessed with dmesg), and includes information such as the page's reference count, mapping, flags, and memory cgroup. It also shows whether the page is mapped in the kernel page table, and if so, how many times.

I mean, even though I'm not a huge fan of kernel pointer hashing etc. this is obviously leaking as much information as you might want about kernel internal state to the point of maybe making the whole kernel pointer hashing thing moot.

I know this requires CAP_SYS_ADMIN, but there are things that also require that which _still_ obscure kernel pointers.

And you're outputting it all to dmesg.

So yeah, a security person (Jann?) would be better placed to comment on this than me, but are we sure we want to do this when not in a CONFIG_DEBUG_VM* kernel?

I guess there are two parts to this - what root is allowed to do, and what information we're fine with exposing to dmesg.

If the lockdown LSM is not set to LOCKDOWN_CONFIDENTIALITY_MAX, the kernel allows root to read kernel memory through some interfaces - in particular, BPF allows reading arbitrary kernel memory, and perf allows reading at least some stuff (like kernel register states). With lockdown in the most restrictive mode, the kernel tries to prevent root from reading arbitrary kernel memory, but we don't really change how much information goes into dmesg. (And I imagine you could probably still get kernel pointers out of BPF somehow even in the most restrictive lockdown mode, but that's probably not relevant.)

The main issue with dmesg is that some systems make its contents available to code that is not running with root privileges; and I think it is also sometimes stored persistently in unencrypted form (like in EFI pstore) even when everything else on the system is encrypted. So on one hand, we definitely shouldn't print the contents of random chunks of memory into dmesg without a good reason; on the other hand, for example we do already print kernel register state on WARN() (which often includes kernel pointers and could theoretically include more sensitive data too).

So I think showing page metadata to root when requested is probably okay as a tradeoff? And dumping that data into dmesg is maybe not great, but acceptable as long as only root can actually trigger this?

I don't really have a strong opinion on this...

To me, a bigger issue is that dump_page() looks like it might be racy, which is maybe not terrible in debugging code that only runs when something has already gone wrong, but bad if it is in code that root can trigger on demand?

Hi Jann, thank you for reviewing this proposal.

Presumably, the interface should be used only when something has gone wrong but has not been noticed by the kernel. That something is usually checksums failures that are outside of the kernel: i.e. during live migration, snapshotting, filesystem journaling, etc. We already have interfaces that provide data from the live kernel that could be racy, i.e. crash utility.

Ah, yes, I'm drawing a distinction here between "something has gone wrong internally in the kernel and the kernel does some kinda-broken best-effort self-diagnostics" and "userspace thinks something is broken and asks the kernel".

...

...
__dump_page() copies the given page with memcpy(), which I don't think guarantees enough atomicity with concurrent updates of page->mapping or such, so dump_mapping() could probably run on a bogus pointer. Even without torn pointers, I think there could be a UAF if the page's mapping is destroyed while we're going through dump_page(), since the page might not be locked. And in dump_mapping(), the strncpy_from_kernel_nofault() also doesn't guard against concurrent renaming of the dentry, which I think again would probably result in UAF.

Since we are holding a reference on the page at the time of dump_page(), the identity of the page should not really change, but dentry can be renamed.

Can you point me to where a refcounted reference to the page comes from when page_detective_metadata() calls dump_page_lvl()?

...

...
So I think dump_page() in its current form is not something we should expose to a userspace-reachable API.

We use dump_page() all over WARN_ONs in MM code where pages might not be locked, but this is a good point, that while even the existing usage might be racy, providing a user-reachable API potentially makes it worse. I will see if I could add some locking before dump_page(), or make a dump_page variant that does not do dump_mapping().

To be clear, I am not that strongly opposed to racily reading data such that the data may not be internally consistent or such; but this is a case of racy use-after-free reads that might end up dumping entirely unrelated memory contents into dmesg. I think we should properly protect against that in an API that userspace can invoke. Otherwise, if we race, we might end up writing random memory contents into dmesg; and if we are particularly unlucky, those random memory contents could be PII or authentication tokens or such.

I'm not entirely sure what the right approach is here; I guess it makes sense that when the kernel internally detects corruption, dump_page doesn't take references on pages it accesses to avoid corrupting things further. If you are looking at a page based on a userspace request, I guess you could access the page with the necessary locking to access its properties under the normal locking rules?

(If anyone else has opinions either way on this line I'm trying to draw between kernel-internal debug paths and userspace-triggerable debugging, feel free to share; I hope my mental model makes sense but I could imagine other folks having a different model of this?)

Pasha Tatashin

1:29 a.m.

...

Can you point me to where a refcounted reference to the page comes from when page_detective_metadata() calls dump_page_lvl()?

I am sorry, I remembered incorrectly, we are getting reference right after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I will move the folio_try_get() to before dump_page_lvl().

...

...
...
So I think dump_page() in its current form is not something we should expose to a userspace-reachable API.

We use dump_page() all over WARN_ONs in MM code where pages might not be locked, but this is a good point, that while even the existing usage might be racy, providing a user-reachable API potentially makes it worse. I will see if I could add some locking before dump_page(), or make a dump_page variant that does not do dump_mapping().

To be clear, I am not that strongly opposed to racily reading data such that the data may not be internally consistent or such; but this is a case of racy use-after-free reads that might end up dumping entirely unrelated memory contents into dmesg. I think we should properly protect against that in an API that userspace can invoke. Otherwise, if we race, we might end up writing random memory contents into dmesg; and if we are particularly unlucky, those random memory contents could be PII or authentication tokens or such.

I'm not entirely sure what the right approach is here; I guess it makes sense that when the kernel internally detects corruption, dump_page doesn't take references on pages it accesses to avoid corrupting things further. If you are looking at a page based on a userspace request, I guess you could access the page with the necessary locking to access its properties under the normal locking rules?

I will take reference, as we already do that for memcg purpose, but have not included dump_page().

Thank you, Pasha

Jann Horn

12:52 p.m.

On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...

...
Can you point me to where a refcounted reference to the page comes from when page_detective_metadata() calls dump_page_lvl()?

I am sorry, I remembered incorrectly, we are getting reference right after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I will move the folio_try_get() to before dump_page_lvl().

...
...
...
So I think dump_page() in its current form is not something we should expose to a userspace-reachable API.

We use dump_page() all over WARN_ONs in MM code where pages might not be locked, but this is a good point, that while even the existing usage might be racy, providing a user-reachable API potentially makes it worse. I will see if I could add some locking before dump_page(), or make a dump_page variant that does not do dump_mapping().

To be clear, I am not that strongly opposed to racily reading data such that the data may not be internally consistent or such; but this is a case of racy use-after-free reads that might end up dumping entirely unrelated memory contents into dmesg. I think we should properly protect against that in an API that userspace can invoke. Otherwise, if we race, we might end up writing random memory contents into dmesg; and if we are particularly unlucky, those random memory contents could be PII or authentication tokens or such.

I'm not entirely sure what the right approach is here; I guess it makes sense that when the kernel internally detects corruption, dump_page doesn't take references on pages it accesses to avoid corrupting things further. If you are looking at a page based on a userspace request, I guess you could access the page with the necessary locking to access its properties under the normal locking rules?

I will take reference, as we already do that for memcg purpose, but have not included dump_page().

Note that taking a reference on the page does not make all of dump_page() fine; in particular, my understanding is that folio_mapping() requires that the page is locked in order to return a stable pointer, and some of the code in dump_mapping() would probably also require some other locks - probably at least on the inode and maybe also on the dentry, I think? Otherwise the inode's dentry list can probably change concurrently, and the dentry's name pointer can change too.

Pasha Tatashin

3:14 p.m.

On Tue, Nov 19, 2024 at 7:52 AM Jann Horn jannh@google.com wrote:

...

On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...
...
Can you point me to where a refcounted reference to the page comes from when page_detective_metadata() calls dump_page_lvl()?

I am sorry, I remembered incorrectly, we are getting reference right after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I will move the folio_try_get() to before dump_page_lvl().

...
...
...
So I think dump_page() in its current form is not something we should expose to a userspace-reachable API.

We use dump_page() all over WARN_ONs in MM code where pages might not be locked, but this is a good point, that while even the existing usage might be racy, providing a user-reachable API potentially makes it worse. I will see if I could add some locking before dump_page(), or make a dump_page variant that does not do dump_mapping().

To be clear, I am not that strongly opposed to racily reading data such that the data may not be internally consistent or such; but this is a case of racy use-after-free reads that might end up dumping entirely unrelated memory contents into dmesg. I think we should properly protect against that in an API that userspace can invoke. Otherwise, if we race, we might end up writing random memory contents into dmesg; and if we are particularly unlucky, those random memory contents could be PII or authentication tokens or such.

I'm not entirely sure what the right approach is here; I guess it makes sense that when the kernel internally detects corruption, dump_page doesn't take references on pages it accesses to avoid corrupting things further. If you are looking at a page based on a userspace request, I guess you could access the page with the necessary locking to access its properties under the normal locking rules?

I will take reference, as we already do that for memcg purpose, but have not included dump_page().

Note that taking a reference on the page does not make all of dump_page() fine; in particular, my understanding is that folio_mapping() requires that the page is locked in order to return a stable pointer, and some of the code in dump_mapping() would probably also require some other locks - probably at least on the inode and maybe also on the dentry, I think? Otherwise the inode's dentry list can probably change concurrently, and the dentry's name pointer can change too.

Agreed, once reference is taken, the page identity cannot change (i.e. if it is a named page it will stay a named page), but dentry can be renamed. I will look into what can be done to guarantee consistency in the next version. There is also a fallback if locking cannot be reliably resolved (i.e. for performance reasons) where we can make dump_mapping() optionally disabled from dump_page_lvl() with a new argument flag.

Thank you, Pasha

Jann Horn

3:53 p.m.

On Tue, Nov 19, 2024 at 4:14 PM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...

On Tue, Nov 19, 2024 at 7:52 AM Jann Horn jannh@google.com wrote:

...
On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...
...
Can you point me to where a refcounted reference to the page comes from when page_detective_metadata() calls dump_page_lvl()?

I am sorry, I remembered incorrectly, we are getting reference right after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I will move the folio_try_get() to before dump_page_lvl().

...
...
...
So I think dump_page() in its current form is not something we should expose to a userspace-reachable API.

We use dump_page() all over WARN_ONs in MM code where pages might not be locked, but this is a good point, that while even the existing usage might be racy, providing a user-reachable API potentially makes it worse. I will see if I could add some locking before dump_page(), or make a dump_page variant that does not do dump_mapping().

To be clear, I am not that strongly opposed to racily reading data such that the data may not be internally consistent or such; but this is a case of racy use-after-free reads that might end up dumping entirely unrelated memory contents into dmesg. I think we should properly protect against that in an API that userspace can invoke. Otherwise, if we race, we might end up writing random memory contents into dmesg; and if we are particularly unlucky, those random memory contents could be PII or authentication tokens or such.

I'm not entirely sure what the right approach is here; I guess it makes sense that when the kernel internally detects corruption, dump_page doesn't take references on pages it accesses to avoid corrupting things further. If you are looking at a page based on a userspace request, I guess you could access the page with the necessary locking to access its properties under the normal locking rules?

I will take reference, as we already do that for memcg purpose, but have not included dump_page().

Note that taking a reference on the page does not make all of dump_page() fine; in particular, my understanding is that folio_mapping() requires that the page is locked in order to return a stable pointer, and some of the code in dump_mapping() would probably also require some other locks - probably at least on the inode and maybe also on the dentry, I think? Otherwise the inode's dentry list can probably change concurrently, and the dentry's name pointer can change too.

Agreed, once reference is taken, the page identity cannot change (i.e. if it is a named page it will stay a named page), but dentry can be renamed. I will look into what can be done to guarantee consistency in the next version. There is also a fallback if locking cannot be reliably resolved (i.e. for performance reasons) where we can make dump_mapping() optionally disabled from dump_page_lvl() with a new argument flag.

Yeah, I think if you don't need the details that dump_mapping() shows, skipping that for user-requested dumps might be a reasonable option.

Matthew Wilcox

6:51 p.m.

On Tue, Nov 19, 2024 at 01:52:00PM +0100, Jann Horn wrote:

...

...
I will take reference, as we already do that for memcg purpose, but have not included dump_page().

Note that taking a reference on the page does not make all of dump_page() fine; in particular, my understanding is that folio_mapping() requires that the page is locked in order to return a stable pointer, and some of the code in dump_mapping() would probably also require some other locks - probably at least on the inode and maybe also on the dentry, I think? Otherwise the inode's dentry list can probably change concurrently, and the dentry's name pointer can change too.

First important thing is that we snapshot the page. So while we may have a torn snapshot of the page, it can't change under us any more, so we don't have to worry about it being swizzled one way and then swizzled back.

Second thing is that I think using folio_mapping() is actually wrong. We don't want the swap mapping if it's an anon page that's in the swapcache. We'd be fine just doing mapping = folio->mapping (we'd need to add a check for movable, but I think that's fine). Anyway, we know the folio isn't ksm or anon at the point that we call dump_mapping() because there's a chain of "else" statements. So I think we're fine because we can't switch between anon & file while holding a refcount.

Having a refcount on the folio will prevent the folio from being allocated to anything else again. It will not protect the mapping from being torn down (the folio can be truncated from the mapping, then the mapping can be freed, and the memory reused). As you say, the dentry can be renamed as well.

This patch series makes me nervous. I'd rather see it done as a bpf script or drgn script, but if it is going to be done in C, I'd really like to see more auditing of the safety here. It feels like the kind of hack that one deploys internally to debug a hard-to-hit condition, rather than the kind of code that we like to ship upstream.

Roman Gushchin

18 Nov 18 Nov

7:11 p.m.

On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:

...

Page Detective is a new kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It is often known that a particular page is corrupted, but it is hard to extract more information about such a page from live system. Examples are:

Checksum failure during live migration

Filesystem journal failure

dump_page warnings on the console log

Unexcpected segfaults

Page Detective helps to extract more information from the kernel, so it can be used by developers to root cause the associated problem.

It operates through the Linux debugfs interface, with two files: "virt" and "phys".

The "virt" file takes a virtual address and PID and outputs information about the corresponding page.

The "phys" file takes a physical address and outputs information about that page.

The output is presented via kernel log messages (can be accessed with dmesg), and includes information such as the page's reference count, mapping, flags, and memory cgroup. It also shows whether the page is mapped in the kernel page table, and if so, how many times.

This looks questionable both from the security and convenience points of view. Given the request-response nature of the interface, the output can be provided using a "normal" seq-based pseudo-file.

But I have a more generic question: doesn't it make sense to implement it as a set of drgn scripts instead of kernel code? This provides more flexibility, is safer (even if it's buggy, you won't crash the host) and should be at least in theory equally powerful.

Thanks!

Pasha Tatashin

10:08 p.m.

On Mon, Nov 18, 2024 at 2:11 PM Roman Gushchin roman.gushchin@linux.dev wrote:

...

On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote:

...
Page Detective is a new kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It is often known that a particular page is corrupted, but it is hard to extract more information about such a page from live system. Examples are:

Checksum failure during live migration

Filesystem journal failure

dump_page warnings on the console log

Unexcpected segfaults

Page Detective helps to extract more information from the kernel, so it can be used by developers to root cause the associated problem.

It operates through the Linux debugfs interface, with two files: "virt" and "phys".

The "virt" file takes a virtual address and PID and outputs information about the corresponding page.

The "phys" file takes a physical address and outputs information about that page.

The output is presented via kernel log messages (can be accessed with dmesg), and includes information such as the page's reference count, mapping, flags, and memory cgroup. It also shows whether the page is mapped in the kernel page table, and if so, how many times.

This looks questionable both from the security and convenience points of view. Given the request-response nature of the interface, the output can be provided using a "normal" seq-based pseudo-file.

We opted to use dmesg for output because it's the standard method for capturing kernel information and is commonly included in bug reports. Introducing a new file would require modifying existing data collection scripts used for reporting, so this approach minimizes disruption to existing workflows.

...

But I have a more generic question: doesn't it make sense to implement it as a set of drgn scripts instead of kernel code? This provides more flexibility, is safer (even if it's buggy, you won't crash the host) and should be at least in theory equally powerful.

Regarding your suggestion, our plan is to perform reverse lookups in all page tables: kernel, user, IOMMU, and KVM. Currently, we only traverse the kernel and user page tables, but we intend to extend this functionality to IOMMU and KVM tables in future updates, I am not sure if drgn can provide this level of details within a reasonable amount of time.

This approach will be helpful for debugging memory corruption scenarios. Often, external mechanisms detect corruption but require kernel-level information for root cause analysis. In our experience, invalid mappings persist in page tables for a period after corruption, providing a window to identify other users of the corrupted page via timely reverse lookup.

Additionally, using crash/drgn is not feasible for us at this time, it requires keeping external tools on our hosts, also it requires approval and a security review for each script before deployment in our fleet.

Thanks, Pasha

Greg KH

19 Nov 19 Nov

1:09 a.m.

On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:

...

Additionally, using crash/drgn is not feasible for us at this time, it requires keeping external tools on our hosts, also it requires approval and a security review for each script before deployment in our fleet.

So it's ok to add a totally insecure kernel feature to your fleet instead? You might want to reconsider that policy decision :)

good luck!

greg k-h

Pasha Tatashin

3:08 p.m.

On Mon, Nov 18, 2024 at 8:09 PM Greg KH gregkh@linuxfoundation.org wrote:

...

On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:

...
Additionally, using crash/drgn is not feasible for us at this time, it requires keeping external tools on our hosts, also it requires approval and a security review for each script before deployment in our fleet.

So it's ok to add a totally insecure kernel feature to your fleet instead? You might want to reconsider that policy decision :)

Hi Greg,

While some risk is inherent, we believe the potential for abuse here is limited, especially given the existing CAP_SYS_ADMIN requirement. But, even with root access compromised, this tool presents a smaller attack surface than alternatives like crash/drgn. It exposes less sensitive information, unlike crash/drgn, which could potentially allow reading all of kernel memory.

Pasha

Roman Gushchin

6:23 p.m.

On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:

...

On Mon, Nov 18, 2024 at 8:09 PM Greg KH gregkh@linuxfoundation.org wrote:

...
On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:

...
Additionally, using crash/drgn is not feasible for us at this time, it requires keeping external tools on our hosts, also it requires approval and a security review for each script before deployment in our fleet.

So it's ok to add a totally insecure kernel feature to your fleet instead? You might want to reconsider that policy decision :)

Hi Greg,

While some risk is inherent, we believe the potential for abuse here is limited, especially given the existing CAP_SYS_ADMIN requirement. But, even with root access compromised, this tool presents a smaller attack surface than alternatives like crash/drgn. It exposes less sensitive information, unlike crash/drgn, which could potentially allow reading all of kernel memory.

The problem here is with using dmesg for output. No security-sensitive information should go there. Even exposing raw kernel pointers is not considered safe.

I'm also not sure about what presents a bigger attack surface. Yes, drgn allows to read more, but it's using /proc/kcore, so the in-kernel code is much simpler. But I don't think it's a relevant discussion, if a malicious user has a root access, there are better options than both drgn and page detective.

Pasha Tatashin

7:30 p.m.

On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin roman.gushchin@linux.dev wrote:

...

On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:

...
On Mon, Nov 18, 2024 at 8:09 PM Greg KH gregkh@linuxfoundation.org wrote:

...
On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:

...
Additionally, using crash/drgn is not feasible for us at this time, it requires keeping external tools on our hosts, also it requires approval and a security review for each script before deployment in our fleet.

So it's ok to add a totally insecure kernel feature to your fleet instead? You might want to reconsider that policy decision :)

Hi Greg,

While some risk is inherent, we believe the potential for abuse here is limited, especially given the existing CAP_SYS_ADMIN requirement. But, even with root access compromised, this tool presents a smaller attack surface than alternatives like crash/drgn. It exposes less sensitive information, unlike crash/drgn, which could potentially allow reading all of kernel memory.

The problem here is with using dmesg for output. No security-sensitive information should go there. Even exposing raw kernel pointers is not considered safe.

I am OK in writing the output to a debugfs file in the next version, the only concern I have is that implies that dump_page() would need to be basically duplicated, as it now outputs everything via printk's.

Yosry Ahmed

7:35 p.m.

On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...

On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin roman.gushchin@linux.dev wrote:

...
On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:

...
On Mon, Nov 18, 2024 at 8:09 PM Greg KH gregkh@linuxfoundation.org wrote:

...
On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:

...
Additionally, using crash/drgn is not feasible for us at this time, it requires keeping external tools on our hosts, also it requires approval and a security review for each script before deployment in our fleet.

So it's ok to add a totally insecure kernel feature to your fleet instead? You might want to reconsider that policy decision :)

Hi Greg,

While some risk is inherent, we believe the potential for abuse here is limited, especially given the existing CAP_SYS_ADMIN requirement. But, even with root access compromised, this tool presents a smaller attack surface than alternatives like crash/drgn. It exposes less sensitive information, unlike crash/drgn, which could potentially allow reading all of kernel memory.

The problem here is with using dmesg for output. No security-sensitive information should go there. Even exposing raw kernel pointers is not considered safe.

I am OK in writing the output to a debugfs file in the next version, the only concern I have is that implies that dump_page() would need to be basically duplicated, as it now outputs everything via printk's.

Perhaps you can refactor the code in dump_page() to use a seq_buf, then have dump_page() printk that seq_buf using seq_buf_do_printk(), and have page detective output that seq_buf to the debugfs file?

We do something very similar with memory_stat_format(). We use the same function to generate the memcg stats in a seq_buf, then we use that seq_buf to output the stats to memory.stat as well as the OOM log.

Roman Gushchin

8:57 p.m.

On Tue, Nov 19, 2024 at 11:35:47AM -0800, Yosry Ahmed wrote:

...

On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...
On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin roman.gushchin@linux.dev wrote:

...
On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:

...
On Mon, Nov 18, 2024 at 8:09 PM Greg KH gregkh@linuxfoundation.org wrote:

...
On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:

...
Additionally, using crash/drgn is not feasible for us at this time, it requires keeping external tools on our hosts, also it requires approval and a security review for each script before deployment in our fleet.

So it's ok to add a totally insecure kernel feature to your fleet instead? You might want to reconsider that policy decision :)

Hi Greg,

While some risk is inherent, we believe the potential for abuse here is limited, especially given the existing CAP_SYS_ADMIN requirement. But, even with root access compromised, this tool presents a smaller attack surface than alternatives like crash/drgn. It exposes less sensitive information, unlike crash/drgn, which could potentially allow reading all of kernel memory.

The problem here is with using dmesg for output. No security-sensitive information should go there. Even exposing raw kernel pointers is not considered safe.

I am OK in writing the output to a debugfs file in the next version, the only concern I have is that implies that dump_page() would need to be basically duplicated, as it now outputs everything via printk's.

Perhaps you can refactor the code in dump_page() to use a seq_buf, then have dump_page() printk that seq_buf using seq_buf_do_printk(), and have page detective output that seq_buf to the debugfs file?

We do something very similar with memory_stat_format(). We use the same function to generate the memcg stats in a seq_buf, then we use that seq_buf to output the stats to memory.stat as well as the OOM log.

Thanks!

Pasha Tatashin

20 Nov 20 Nov

4:13 p.m.

On Tue, Nov 19, 2024 at 2:36 PM Yosry Ahmed yosryahmed@google.com wrote:

...

On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...
On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin roman.gushchin@linux.dev wrote:

...
On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:

...
On Mon, Nov 18, 2024 at 8:09 PM Greg KH gregkh@linuxfoundation.org wrote:

...
On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:

...
Additionally, using crash/drgn is not feasible for us at this time, it requires keeping external tools on our hosts, also it requires approval and a security review for each script before deployment in our fleet.

So it's ok to add a totally insecure kernel feature to your fleet instead? You might want to reconsider that policy decision :)

Hi Greg,

While some risk is inherent, we believe the potential for abuse here is limited, especially given the existing CAP_SYS_ADMIN requirement. But, even with root access compromised, this tool presents a smaller attack surface than alternatives like crash/drgn. It exposes less sensitive information, unlike crash/drgn, which could potentially allow reading all of kernel memory.

The problem here is with using dmesg for output. No security-sensitive information should go there. Even exposing raw kernel pointers is not considered safe.

I am OK in writing the output to a debugfs file in the next version, the only concern I have is that implies that dump_page() would need to be basically duplicated, as it now outputs everything via printk's.

Perhaps you can refactor the code in dump_page() to use a seq_buf, then have dump_page() printk that seq_buf using seq_buf_do_printk(), and have page detective output that seq_buf to the debugfs file?

Good idea, I will look into modifying it this way.

...

We do something very similar with memory_stat_format(). We use the

void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { /* Use static buffer, for the caller is holding oom_lock. */ static char buf[PAGE_SIZE]; .... seq_buf_init(&s, buf, sizeof(buf)); memory_stat_format(memcg, &s); seq_buf_do_printk(&s, KERN_INFO); }

This is a callosal stack allocation, given that our fleet only has 8K stacks. :-)

...

same function to generate the memcg stats in a seq_buf, then we use that seq_buf to output the stats to memory.stat as well as the OOM log.

Yosry Ahmed

5:33 p.m.

On Wed, Nov 20, 2024 at 8:14 AM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...

On Tue, Nov 19, 2024 at 2:36 PM Yosry Ahmed yosryahmed@google.com wrote:

...
On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin pasha.tatashin@soleen.com wrote:

...
On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin roman.gushchin@linux.dev wrote:

...
On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:

...
On Mon, Nov 18, 2024 at 8:09 PM Greg KH gregkh@linuxfoundation.org wrote:

...
On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > Additionally, using crash/drgn is not feasible for us at this time, it > requires keeping external tools on our hosts, also it requires > approval and a security review for each script before deployment in > our fleet.

So it's ok to add a totally insecure kernel feature to your fleet instead? You might want to reconsider that policy decision :)

Hi Greg,

While some risk is inherent, we believe the potential for abuse here is limited, especially given the existing CAP_SYS_ADMIN requirement. But, even with root access compromised, this tool presents a smaller attack surface than alternatives like crash/drgn. It exposes less sensitive information, unlike crash/drgn, which could potentially allow reading all of kernel memory.

The problem here is with using dmesg for output. No security-sensitive information should go there. Even exposing raw kernel pointers is not considered safe.

I am OK in writing the output to a debugfs file in the next version, the only concern I have is that implies that dump_page() would need to be basically duplicated, as it now outputs everything via printk's.

Perhaps you can refactor the code in dump_page() to use a seq_buf, then have dump_page() printk that seq_buf using seq_buf_do_printk(), and have page detective output that seq_buf to the debugfs file?

Good idea, I will look into modifying it this way.

...
We do something very similar with memory_stat_format(). We use the

void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { /* Use static buffer, for the caller is holding oom_lock. */ static char buf[PAGE_SIZE]; .... seq_buf_init(&s, buf, sizeof(buf)); memory_stat_format(memcg, &s); seq_buf_do_printk(&s, KERN_INFO); }

This is a callosal stack allocation, given that our fleet only has 8K stacks. :-)

That's a static allocation though :)

Pasha Tatashin

5:46 p.m.

...

...
    /* Use static buffer, for the caller is holding oom_lock. */
    static char buf[PAGE_SIZE];
    ....
    seq_buf_init(&s, buf, sizeof(buf));
    memory_stat_format(memcg, &s);
    seq_buf_do_printk(&s, KERN_INFO);
}

This is a callosal stack allocation, given that our fleet only has 8K stacks. :-)
That's a static allocation though :)

Ah right, did not notice it was static (and ignored the comment)

Pasha

Andi Kleen

3:29 p.m.

Pasha Tatashin pasha.tatashin@soleen.com writes:

...

Page Detective is a new kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It is often known that a particular page is corrupted, but it is hard to extract more information about such a page from live system. Examples are:

Checksum failure during live migration

Filesystem journal failure

dump_page warnings on the console log

Unexcpected segfaults

Page Detective helps to extract more information from the kernel, so it can be used by developers to root cause the associated problem.

It operates through the Linux debugfs interface, with two files: "virt" and "phys".

The "virt" file takes a virtual address and PID and outputs information about the corresponding page.

The "phys" file takes a physical address and outputs information about that page.

The output is presented via kernel log messages (can be accessed with dmesg), and includes information such as the page's reference count, mapping, flags, and memory cgroup. It also shows whether the page is mapped in the kernel page table, and if so, how many times.

A lot of all that is already covered in /proc/kpage{flags,cgroup,count) Also we already have /proc/pid/pagemap to resolve virtual addresses.

At a minimum you need to discuss why these existing mechanisms are not suitable for you and how your new one is better.

If something particular is missing perhaps the existing mechanisms can be extended?

Outputting in the dmesg seems rather clumpsy for a production mechanism.

I personally would just use live crash or live gdb on /proc/kcore to get extra information, although I can see that might have races.

-Andi

Pasha Tatashin

4:40 p.m.

On Wed, Nov 20, 2024 at 10:29 AM Andi Kleen ak@linux.intel.com wrote:

...

Pasha Tatashin pasha.tatashin@soleen.com writes:

...
Page Detective is a new kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.

It is often known that a particular page is corrupted, but it is hard to extract more information about such a page from live system. Examples are:

Checksum failure during live migration

Filesystem journal failure

dump_page warnings on the console log

Unexcpected segfaults

Page Detective helps to extract more information from the kernel, so it can be used by developers to root cause the associated problem.

It operates through the Linux debugfs interface, with two files: "virt" and "phys".

The "virt" file takes a virtual address and PID and outputs information about the corresponding page.

The "phys" file takes a physical address and outputs information about that page.

The output is presented via kernel log messages (can be accessed with dmesg), and includes information such as the page's reference count, mapping, flags, and memory cgroup. It also shows whether the page is mapped in the kernel page table, and if so, how many times.

A lot of all that is already covered in /proc/kpage{flags,cgroup,count) Also we already have /proc/pid/pagemap to resolve virtual addresses.

At a minimum you need to discuss why these existing mechanisms are not suitable for you and how your new one is better.

Hi Andi,

Thanks for your feedback! I will extend the cover letter in the next version to address your comment about comparing with the existing methods.

We periodically receive rare reports of page corruptions detected through various methods (journaling, live migrations, crashes, etc.) from userland. To effectively root cause these corruptions, we need to automatically and quickly gather comprehensive data about the affected pages from the kernel.

This includes:

- Obtain all metadata associated with a page. - Quickly identify all user processes mapping a given page. - Determine if and where the kernel maps the page, which is also important given the opportunity to remove guest memory from the kernel direct map (as discussed at LPC'24).

We also plan to extend this functionality to include KVM and IOMMU page tables in the future. <pagemap> provides an interface to traversing through user page tables, but the other information cannot be extracted using the existing interfaces.

To ensure data integrity, even when dealing with potential memory corruptions, Page Detective minimizes reliance on kernel data structures. Instead, it leverages direct access to hardware structures like page tables, providing a more reliable view of page mappings.

...

If something particular is missing perhaps the existing mechanisms can be extended? Outputting in the dmesg seems rather clumpsy for a production mechanism.

I am going to change the output to a file in the next version.

...

I personally would just use live crash or live gdb on /proc/kcore to get extra information, although I can see that might have races.

For security reasons crash is currently not available on our production fleet machines as it potentially provides access to all kernel memory.

Thank you, Pasha

Andi Kleen

7:14 p.m.

...

Quickly identify all user processes mapping a given page.

Can be done with /proc/*/pagemap today. Maybe it's not "quick" because it won't use the rmap chains, but is that a serious issue?

...

Determine if and where the kernel maps the page, which is also

important given the opportunity to remove guest memory from the kernel direct map (as discussed at LPC'24).

At least x86 already has a kernel page table dumper in debugfs that can be used for this. The value of a second redundant one seems low.

...

We also plan to extend this functionality to include KVM and IOMMU page tables in the future.

Yes dumpers for those would likely be useful.

(at least for the case when one hand is tied behind your back by security policies forbidding /proc/kcore access)

...

<pagemap> provides an interface to traversing through user page tables, but the other information cannot be extracted using the existing interfaces.

Like what? You mean the reference counts?

/proc/k* doesn't have any reference counts, and no space for full counts, but I suspect usually all you need to know is a few states like (>1, 1, 0, maybe negative) which could be mapped to a few spare kpageflags bits.

That said I thought Willy wanted to move a lot of these elsewhere anyways with the folio revolution, so it might be a short lived interface anyways.

-Andi

349

days inactive

353

days old

linux-kselftest-mirror@lists.linaro.org

42 comments

participants

tags (0)

participants (11)

Andi Kleen
Christoph Hellwig
Greg KH
Jann Horn
Jonathan Corbet
Lorenzo Stoakes
Matthew Wilcox
Muhammad Usama Anjum
Pasha Tatashin
Roman Gushchin
Yosry Ahmed