On Sat, Nov 16, 2024 at 6:59 PM Pasha Tatashin pasha.tatashin@soleen.com wrote:
Page Detective is a kernel debugging tool that provides detailed information about the usage and mapping of physical memory pages.
It operates through the Linux debugfs interface, providing access to both virtual and physical address inquiries. The output, presented via kernel log messages (accessible with dmesg), will help administrators and developers understand how specific pages are utilized by the system.
This tool can be used to investigate various memory-related issues, such as checksum failures during live migration, filesystem journal failures, general segfaults, or other corruptions.
[...]
+/*
- Walk kernel page table, and print all mappings to this pfn, return 1 if
- pfn is mapped in direct map, return 0 if not mapped in direct map, and
- return -1 if operation canceled by user.
- */
+static int page_detective_kernel_map_info(unsigned long pfn,
unsigned long direct_map_addr)
+{
struct pd_private_kernel pr = {0};
unsigned long s, e;
pr.direct_map_addr = direct_map_addr;
pr.pfn = pfn;
for (s = PAGE_OFFSET; s != ~0ul; ) {
e = s + PD_WALK_MAX_RANGE;
if (e < s)
e = ~0ul;
if (walk_page_range_kernel(s, e, &pd_kernel_ops, &pr)) {
I think which parts of the kernel virtual address range you can safely pagewalk is somewhat architecture-specific; for example, X86 can run under Xen PV, in which case I think part of the page tables may not be walkable because they're owned by the hypervisor for its own use? Notably the x86 version of ptdump_walk_pgd_level_core starts walking at GUARD_HOLE_END_ADDR instead.
See also https://kernel.org/doc/html/latest/arch/x86/x86_64/mm.html for an ASCII table reference on address space regions.
pr_info("Received a cancel signal from user, while scanning kernel mappings\n");
return -1;
}
cond_resched();
s = e;
}
if (!pr.vmalloc_maps) {
pr_info("The page is not mapped into kernel vmalloc area\n");
} else if (pr.vmalloc_maps > 1) {
pr_info("The page is mapped into vmalloc area: %ld times\n",
pr.vmalloc_maps);
}
if (!pr.direct_map)
pr_info("The page is not mapped into kernel direct map\n");
pr_info("The page mapped into kernel page table: %ld times\n", pr.maps);
return pr.direct_map ? 1 : 0;
+}
+/* Print kernel information about the pfn, return -1 if canceled by user */ +static int page_detective_kernel(unsigned long pfn) +{
unsigned long *mem = __va((pfn) << PAGE_SHIFT);
unsigned long sum = 0;
int direct_map;
u64 s, e;
int i;
s = sched_clock();
direct_map = page_detective_kernel_map_info(pfn, (unsigned long)mem);
e = sched_clock() - s;
pr_info("Scanned kernel page table in [%llu.%09llus]\n",
e / NSEC_PER_SEC, e % NSEC_PER_SEC);
/* Canceled by user or no direct map */
if (direct_map < 1)
return direct_map;
for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++)
sum |= mem[i];
If the purpose of this interface is to inspect pages in weird states, I wonder if it would make sense to use something like copy_mc_to_kernel() in case that helps avoid kernel crashes due to uncorrectable 2-bit ECC errors or such. But maybe that's not the kind of error you're concerned about here? And I also don't have any idea if copy_mc_to_kernel() actually does anything sensible for ECC errors. So don't treat this as a fix suggestion, more as a random idea that should probably be ignored unless someone who understands ECC errors says it makes sense.
But I think you should at least be using READ_ONCE(), since you're reading from memory that can change concurrently.
if (sum == 0)
pr_info("The page contains only zeroes\n");
else
pr_info("The page contains some data\n");
return 0;
+}
[...]
+/*
- print information about mappings of pfn by mm, return -1 if canceled
- return number of mappings found.
- */
+static long page_detective_user_mm_info(struct mm_struct *mm, unsigned long pfn) +{
struct pd_private_user pr = {0};
unsigned long s, e;
pr.pfn = pfn;
pr.mm = mm;
for (s = 0; s != TASK_SIZE; ) {
TASK_SIZE does not make sense when inspecting another task, because TASK_SIZE depends on the virtual address space size of the current task (whether you are a 32-bit or 64-bit process). Please use TASK_SIZE_MAX for remote process access.
e = s + PD_WALK_MAX_RANGE;
if (e > TASK_SIZE || e < s)
e = TASK_SIZE;
if (mmap_read_lock_killable(mm)) {
pr_info("Received a cancel signal from user, while scanning user mappings\n");
return -1;
}
walk_page_range(mm, s, e, &pd_user_ops, &pr);
mmap_read_unlock(mm);
cond_resched();
s = e;
}
return pr.maps;
+}