While running LTP mm test suite on i386 or qemu_i386 this kernel warning has been noticed from stable 5.4 to stable 5.7 branches and mainline 5.8.0-rc4 and linux next.
metadata: git branch: master git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git git commit: dcb7fd82c75ee2d6e6f9d8cc71c52519ed52e258 kernel-config: https://builds.tuxbuild.com/zFkvGKyEbfYdZOMzwrbl-w/kernel.config vmlinux: https://builds.tuxbuild.com/zFkvGKyEbfYdZOMzwrbl-w/vmlinux.xz system.map: https://builds.tuxbuild.com/zFkvGKyEbfYdZOMzwrbl-w/System.map
steps to reproduce: - boot i386 kernel or qemu_i386 device - cd /opt/ltp - ./runltp -f mm
code snippet: file: mm/mremap.c #ifdef CONFIG_HAVE_MOVE_PMD static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, ... /* * The destination pmd shouldn't be established, free_pgtables() * should have release it. */ if (WARN_ON(!pmd_none(*new_pmd))) return false;
------------[ cut here ]------------ [ 720.642524] WARNING: CPU: 3 PID: 27091 at mm/mremap.c:211 move_page_tables+0x6ef/0x720 [ 720.650437] Modules linked in: x86_pkg_temp_thermal [ 720.655314] CPU: 3 PID: 27091 Comm: true Not tainted 5.8.0-rc4 #1 [ 720.661408] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.0b 07/27/2017 [ 720.668886] EIP: move_page_tables+0x6ef/0x720 [ 720.673245] Code: 85 0c ff ff ff 8b 45 08 8b 4d d4 05 00 00 40 00 25 00 00 c0 ff 2b 45 08 39 c1 0f 46 c1 89 45 d4 01 f0 89 45 cc e9 d6 fa ff ff <0f> 0b 80 7d be 00 0f 84 80 fd ff ff e9 02 fe ff ff a8 80 0f 84 4d [ 720.691990] EAX: f3250bf8 EBX: f3250bfc ECX: 8c46a067 EDX: 000002fe [ 720.698249] ESI: bfc00000 EDI: f3250bf8 EBP: eea31dd4 ESP: eea31d74 [ 720.704514] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282 [ 720.711353] CR0: 80050033 CR2: b7d9e3b0 CR3: 33250000 CR4: 003406d0 [ 720.717634] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 720.723952] DR6: fffe0ff0 DR7: 00000400 [ 720.727843] Call Trace: [ 720.730325] setup_arg_pages+0x22b/0x310 [ 720.734277] ? get_random_u32+0x49/0x70 [ 720.738167] ? get_random_u32+0x49/0x70 [ 720.742006] load_elf_binary+0x30e/0x10e0 [ 720.746019] __do_execve_file+0x4c3/0x9f0 [ 720.750031] __ia32_sys_execve+0x25/0x30 [ 720.753959] do_syscall_32_irqs_on+0x3d/0x250 [ 720.758317] ? do_user_addr_fault+0x1a0/0x3c0 [ 720.762701] ? __prepare_exit_to_usermode+0x50/0x1a0 [ 720.767703] ? prepare_exit_to_usermode+0x8/0x20 [ 720.772349] do_fast_syscall_32+0x39/0xb0 [ 720.776361] do_SYSENTER_32+0x15/0x20 [ 720.780028] entry_SYSENTER_32+0x9f/0xf2 [ 720.783951] EIP: 0xb7f1d549 [ 720.786741] Code: Bad RIP value. [ 720.789965] EAX: ffffffda EBX: bfbbc7c0 ECX: 08067420 EDX: bfbbc9f4 [ 720.796225] ESI: 08058a14 EDI: bfbbc7c9 EBP: bfbbc868 ESP: bfbbc798 [ 720.802489] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292 [ 720.809276] ---[ end trace bc662e76cba2d081 ]--- [ 724.934902] ------------[ cut here ]------------ ------------[ cut here ]------------ [ 738.099993] WARNING: CPU: 1 PID: 29339 at mm/mremap.c:211 move_page_tables+0x6ef/0x720 [ 738.107928] Modules linked in: x86_pkg_temp_thermal [ 738.112860] CPU: 1 PID: 29339 Comm: true Tainted: G W 5.8.0-rc4 #1 [ 738.120374] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.0b 07/27/2017 [ 738.127904] EIP: move_page_tables+0x6ef/0x720 [ 738.132316] Code: 85 0c ff ff ff 8b 45 08 8b 4d d4 05 00 00 40 00 25 00 00 c0 ff 2b 45 08 39 c1 0f 46 c1 89 45 d4 01 f0 89 45 cc e9 d6 fa ff ff <0f> 0b 80 7d be 00 0f 84 80 fd ff ff e9 02 fe ff ff a8 80 0f 84 4d [ 738.151112] EAX: e0f3bbf8 EBX: e0f3bbfc ECX: 8c4b8067 EDX: 000002fe [ 738.157422] ESI: bfc00000 EDI: e0f3bbf8 EBP: f3809dd4 ESP: f3809d74 [ 738.163756] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282 [ 738.170587] CR0: 80050033 CR2: b7d9e3b0 CR3: 20f3b000 CR4: 003406d0 [ 738.176895] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 738.183179] DR6: fffe0ff0 DR7: 00000400 [ 738.187019] Call Trace: [ 738.189489] setup_arg_pages+0x22b/0x310 [ 738.193423] ? get_random_u32+0x49/0x70 [ 738.197280] ? get_random_u32+0x49/0x70 [ 738.201119] load_elf_binary+0x30e/0x10e0 [ 738.205159] __do_execve_file+0x4c3/0x9f0 [ 738.209198] __ia32_sys_execve+0x25/0x30 [ 738.213122] do_syscall_32_irqs_on+0x3d/0x250 [ 738.217484] ? do_user_addr_fault+0x1a0/0x3c0 [ 738.221841] ? __prepare_exit_to_usermode+0x50/0x1a0 [ 738.226808] ? prepare_exit_to_usermode+0x8/0x20 [ 738.231427] do_fast_syscall_32+0x39/0xb0 [ 738.235438] do_SYSENTER_32+0x15/0x20 [ 738.239107] entry_SYSENTER_32+0x9f/0xf2 [ 738.243030] EIP: 0xb7f1d549 [ 738.245830] Code: Bad RIP value. [ 738.249067] EAX: ffffffda EBX: bfbbc7c0 ECX: 08067420 EDX: bfbbc9f4 [ 738.255331] ESI: 08058a14 EDI: bfbbc7c9 EBP: bfbbc868 ESP: bfbbc798 [ 738.261595] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292 [ 738.268382] ---[ end trace bc662e76cba2d083 ]--- [ 770.832465] true (1218) used greatest stack depth: 5252 bytes left [ 781.451530] ------------[ cut here ]------------ ------------[ cut here ]------------ [ 784.798244] WARNING: CPU: 3 PID: 3053 at mm/mremap.c:211 move_page_tables+0x6ef/0x720 [ 784.806090] Modules linked in: x86_pkg_temp_thermal [ 784.811007] CPU: 3 PID: 3053 Comm: true Tainted: G W 5.8.0-rc4 #1 [ 784.818449] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.0b 07/27/2017 [ 784.825928] EIP: move_page_tables+0x6ef/0x720 [ 784.830341] Code: 85 0c ff ff ff 8b 45 08 8b 4d d4 05 00 00 40 00 25 00 00 c0 ff 2b 45 08 39 c1 0f 46 c1 89 45 d4 01 f0 89 45 cc e9 d6 fa ff ff <0f> 0b 80 7d be 00 0f 84 80 fd ff ff e9 02 fe ff ff a8 80 0f 84 4d [ 784.849138] EAX: e0663bf8 EBX: e0663bfc ECX: 8c463067 EDX: 000002fe [ 784.855456] ESI: bfc00000 EDI: e0663bf8 EBP: f3e39dd4 ESP: f3e39d74 [ 784.861750] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282 [ 784.868532] CR0: 80050033 CR2: b7d9e3b0 CR3: 20663000 CR4: 003406d0 [ 784.874843] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 784.881160] DR6: fffe0ff0 DR7: 00000400 [ 784.885053] Call Trace: [ 784.887522] setup_arg_pages+0x22b/0x310 [ 784.891483] ? get_random_u32+0x49/0x70 [ 784.895347] ? get_random_u32+0x49/0x70 [ 784.899232] load_elf_binary+0x30e/0x10e0 [ 784.903295] __do_execve_file+0x4c3/0x9f0 [ 784.907359] __ia32_sys_execve+0x25/0x30 [ 784.911339] do_syscall_32_irqs_on+0x3d/0x250 [ 784.915750] ? do_user_addr_fault+0x1a0/0x3c0 [ 784.920161] ? __prepare_exit_to_usermode+0x50/0x1a0 [ 784.925180] ? prepare_exit_to_usermode+0x8/0x20 [ 784.929799] do_fast_syscall_32+0x39/0xb0 [ 784.933862] do_SYSENTER_32+0x15/0x20 [ 784.937580] entry_SYSENTER_32+0x9f/0xf2 [ 784.941557] EIP: 0xb7f1d549 [ 784.944399] Code: Bad RIP value. [ 784.947624] EAX: ffffffda EBX: bfbbc7c0 ECX: 08067420 EDX: bfbbc9f4 [ 784.953944] ESI: 08058a14 EDI: bfbbc7c9 EBP: bfbbc868 ESP: bfbbc798 [ 784.960261] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292 [ 784.967097] ---[ end trace bc662e76cba2d085 ]--- [ 792.443236] ------------[ cut here ]------------
full test log, https://lkft.validation.linaro.org/scheduler/job/1541788#L9308
-- Linaro LKFT https://lkft.linaro.org
On Thu, Jul 9, 2020 at 7:28 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
While running LTP mm test suite on i386 or qemu_i386 this kernel warning has been noticed from stable 5.4 to stable 5.7 branches and mainline 5.8.0-rc4 and linux next.
Are you able to correlate this with any particular test case in LTP, or does it happen for random processes?
In the log you linked to, it happens once for ksm05.c and multiple times for thp01.c, sources here:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/k... https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/t...
Is it always these tests that trigger the warning, or sometimes others?
When you say it happens with linux-5.4 stable, does that mean you don't see it with older versions? What is the last known working version?
I also see that you give the virtual machine 16GB of RAM, but as you are running a 32-bit kernel without PAE, only 2.3GB end up being available, and some other LTP tests in the log run out of memory.
You could check if the behavior changes if you give the kernel less memory, e.g. 768MB (lowmem only), or enable CONFIG_X86_PAE to let it use the entire 16GB.
full test log, https://lkft.validation.linaro.org/scheduler/job/1541788#L9308
Arnd
On Thu, 9 Jul 2020 at 13:55, Arnd Bergmann arnd@arndb.de wrote:
On Thu, Jul 9, 2020 at 7:28 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
While running LTP mm test suite on i386 or qemu_i386 this kernel warning has been noticed from stable 5.4 to stable 5.7 branches and mainline 5.8.0-rc4 and linux next.
Are you able to correlate this with any particular test case in LTP, or does it happen for random processes?
In the log you linked to, it happens once for ksm05.c and multiple times for thp01.c, sources here:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/k... https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/t...
Is it always these tests that trigger the warning, or sometimes others?
These two test cases are causing this warning multiple times on i386.
When you say it happens with linux-5.4 stable, does that mean you don't see it with older versions? What is the last known working version?
I do not notice on stable-4.19 and below versions. Sorry i did not get the known working commit id or version.
It started happening from stable-rc 5.0 first release. I have evidence [1] showing it on 5.0.1
I also see that you give the virtual machine 16GB of RAM, but as you are running a 32-bit kernel without PAE, only 2.3GB end up being available, and some other LTP tests in the log run out of memory.
You could check if the behavior changes if you give the kernel less memory, e.g. 768MB (lowmem only), or enable CONFIG_X86_PAE to let it use the entire 16GB.
Warning is still happening after enabling PAE config. But the oom-killer messages are gone. Thank you.
CONFIG_HIGHMEM=y CONFIG_X86_PAE=y
full test log oom-killer messages are gone and kernel warning is still there, https://lkft.validation.linaro.org/scheduler/job/1552606#L10357
build location: https://builds.tuxbuild.com/puilcMcGVwzFMN5fDUhY4g/
[1] https://qa-reports.linaro.org/lkft/linux-stable-rc-5.0-oe/build/v5.0.1/testr...
--- [ 775.646689] WARNING: CPU: 3 PID: 10858 at /srv/oe/build/tmp-lkft-glibc/work-shared/intel-core2-32/kernel-source/mm/mremap.c:211 move_page_tables+0x553/0x570 [ 775.647006] Modules linked in: fuse [ 775.647006] CPU: 3 PID: 10858 Comm: true Not tainted 5.0.1 #1 [ 775.647006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 775.647006] EIP: move_page_tables+0x553/0x570
- Naresh
On Wed, Jul 8, 2020 at 10:28 PM Naresh Kamboju naresh.kamboju@linaro.org wrote:
While running LTP mm test suite on i386 or qemu_i386 this kernel warning has been noticed from stable 5.4 to stable 5.7 branches and mainline 5.8.0-rc4 and linux next.
Hmm
If this is repeatable, would you mind making the warning also print out the old range and new addresses and pmd value?
Something like the attached (UNTESTED!) patch.
Linus
On Fri, 10 Jul 2020 at 00:42, Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, Jul 8, 2020 at 10:28 PM Naresh Kamboju naresh.kamboju@linaro.org wrote:
While running LTP mm test suite on i386 or qemu_i386 this kernel warning has been noticed from stable 5.4 to stable 5.7 branches and mainline 5.8.0-rc4 and linux next.
Hmm
If this is repeatable, would you mind making the warning also print out the old range and new addresses and pmd value?
Your patch applied and re-tested. warning triggered 10 times.
old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Here is the crash output log, thp01.c:98: PASS: system didn't crash. [ 741.507000] ------------[ cut here ]------------ [ 741.511684] WARNING: CPU: 1 PID: 15173 at mm/mremap.c:211 move_page_tables.cold+0x0/0x2b [ 741.519812] Modules linked in: x86_pkg_temp_thermal fuse [ 741.525163] CPU: 1 PID: 15173 Comm: true Not tainted 5.8.0-rc4 #1 [ 741.531313] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 741.538760] EIP: move_page_tables.cold+0x0/0x2b [ 741.543337] Code: b1 a0 03 00 00 81 c1 cc 04 00 00 bb ea ff ff ff 51 68 e0 bc 68 d8 c6 05 dc 29 97 d8 01 e8 13 26 e9 ff 83 c4 0c e9 70 ea ff ff <0f> 0b 52 50 ff 75 08 ff 75 b4 ff 75 d4 68 3c bd 68 d8 e8 f4 25 e9 [ 741.562140] EAX: 7d530067 EBX: e9c90ff8 ECX: 00000000 EDX: 00000000 [ 741.568456] ESI: 00000000 EDI: 7d5ba007 EBP: cef67dd0 ESP: cef67d28 [ 741.574776] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010202 [ 741.581623] CR0: 80050033 CR2: b7d53f50 CR3: 107da000 CR4: 003406f0 [ 741.587941] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 741.594259] DR6: fffe0ff0 DR7: 00000400 [ 741.598159] Call Trace: [ 741.600694] setup_arg_pages+0x22b/0x310 [ 741.604654] ? _raw_spin_unlock_irqrestore+0x45/0x50 [ 741.609677] ? trace_hardirqs_on+0x4b/0x110 [ 741.613930] ? get_random_u32+0x4e/0x80 [ 741.617809] ? get_random_u32+0x4e/0x80 [ 741.621687] load_elf_binary+0x31e/0x10f0 [ 741.625714] ? __do_execve_file+0x5b4/0xbf0 [ 741.629917] ? find_held_lock+0x24/0x80 [ 741.633839] __do_execve_file+0x5a8/0xbf0 [ 741.637893] __ia32_sys_execve+0x2a/0x40 [ 741.641875] do_syscall_32_irqs_on+0x3d/0x2c0 [ 741.646246] ? find_held_lock+0x24/0x80 [ 741.650105] ? lock_release+0x8a/0x260 [ 741.653890] ? __might_fault+0x41/0x80 [ 741.657660] do_fast_syscall_32+0x60/0xf0 [ 741.661691] do_SYSENTER_32+0x15/0x20 [ 741.665373] entry_SYSENTER_32+0x9f/0xf2 [ 741.669328] EIP: 0xb7f38549 [ 741.672140] Code: Bad RIP value. [ 741.675430] EAX: ffffffda EBX: bfe19bf0 ECX: 08067420 EDX: bfe19e24 [ 741.681708] ESI: 08058a14 EDI: bfe19bf9 EBP: bfe19c98 ESP: bfe19bc8 [ 741.687991] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292 [ 741.694804] irq event stamp: 23911 [ 741.698253] hardirqs last enabled at (23929): [<d756f075>] console_unlock+0x4a5/0x610 [ 741.706181] hardirqs last disabled at (23946): [<d756ec5a>] console_unlock+0x8a/0x610 [ 741.714041] softirqs last enabled at (23962): [<d82b975c>] __do_softirq+0x2dc/0x3da [ 741.721849] softirqs last disabled at (23973): [<d74a8275>] call_on_stack+0x45/0x50 [ 741.729513] ---[ end trace 170f646c1b6225e0 ]--- [ 741.734151] old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Build link: https://builds.tuxbuild.com/1cwiUvFIB4M0hPyB1eA3cA/ vmlinux: https://builds.tuxbuild.com/1cwiUvFIB4M0hPyB1eA3cA/vmlinux.xz system.map: https://builds.tuxbuild.com/1cwiUvFIB4M0hPyB1eA3cA/System.map
full test log, https://lkft.validation.linaro.org/scheduler/job/1554181#L10557
- Naresh
Something like the attached (UNTESTED!) patch.
Linus
On Thu, Jul 9, 2020 at 9:29 PM Naresh Kamboju naresh.kamboju@linaro.org wrote:
Your patch applied and re-tested. warning triggered 10 times.
old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Hmm.. It's not even the overlapping case, it's literally just "move exactly 2MB of page tables exactly one pmd down". Which should be the nice efficient case where we can do it without modifying the lower page tables at all, we just move the PMD entry.
There shouldn't be anything in the new address space from bfa00000-bfdfffff.
That PMD value obviously says differently, but it looks like a nice normal PMD value, nothing bad there.
I'm starting to think that the issue might be that this is because the stack segment is special. Not only does it have the growsdown flag, but that whole thing has the magic guard page logic.
So I wonder if we have installed a guard page _just_ below the old stack, so that we have populated that pmd because of that.
We used to have an _actual_ guard page and then play nasty games with vm_start logic. We've gotten rid of that, though, and now we have that "stack_guard_gap" logic that _should_ mean that vm_start is always exact and proper (and that pgtbales_free() should have emptied it, but maybe we have some case we forgot about.
[ 741.511684] WARNING: CPU: 1 PID: 15173 at mm/mremap.c:211 move_page_tables.cold+0x0/0x2b [ 741.598159] Call Trace: [ 741.600694] setup_arg_pages+0x22b/0x310 [ 741.621687] load_elf_binary+0x31e/0x10f0 [ 741.633839] __do_execve_file+0x5a8/0xbf0 [ 741.637893] __ia32_sys_execve+0x2a/0x40 [ 741.641875] do_syscall_32_irqs_on+0x3d/0x2c0 [ 741.657660] do_fast_syscall_32+0x60/0xf0 [ 741.661691] do_SYSENTER_32+0x15/0x20 [ 741.665373] entry_SYSENTER_32+0x9f/0xf2 [ 741.734151] old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Nothing looks bad, and the ELF loading phase memory map should be really quite simple.
The only half-way unusual thing is that you have basically exactly 2MB of stack at execve time (easy enough to tune by just setting argv/env right), and it's moved down by exactly 2MB.
And that latter thing is just due to randomization, see arch_align_stack() in arch/x86/kernel/process.c.
So that would explain why it doesn't happen every time.
What happens if you apply the attached patch to *always* force the 2MB shift (rather than moving the stack by a random amount), and then run the other program (t.c -> compiled to "a.out").
The comment should be obvious. But it's untested, I might have gotten the math wrong. I don't run in a 32-bit environment.
Linus
On Fri, 10 Jul 2020 at 10:55, Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Jul 9, 2020 at 9:29 PM Naresh Kamboju naresh.kamboju@linaro.org wrote:
Your patch applied and re-tested. warning triggered 10 times.
old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Hmm.. It's not even the overlapping case, it's literally just "move exactly 2MB of page tables exactly one pmd down". Which should be the nice efficient case where we can do it without modifying the lower page tables at all, we just move the PMD entry.
There shouldn't be anything in the new address space from bfa00000-bfdfffff.
That PMD value obviously says differently, but it looks like a nice normal PMD value, nothing bad there.
I'm starting to think that the issue might be that this is because the stack segment is special. Not only does it have the growsdown flag, but that whole thing has the magic guard page logic.
So I wonder if we have installed a guard page _just_ below the old stack, so that we have populated that pmd because of that.
We used to have an _actual_ guard page and then play nasty games with vm_start logic. We've gotten rid of that, though, and now we have that "stack_guard_gap" logic that _should_ mean that vm_start is always exact and proper (and that pgtbales_free() should have emptied it, but maybe we have some case we forgot about.
[ 741.511684] WARNING: CPU: 1 PID: 15173 at mm/mremap.c:211 move_page_tables.cold+0x0/0x2b [ 741.598159] Call Trace: [ 741.600694] setup_arg_pages+0x22b/0x310 [ 741.621687] load_elf_binary+0x31e/0x10f0 [ 741.633839] __do_execve_file+0x5a8/0xbf0 [ 741.637893] __ia32_sys_execve+0x2a/0x40 [ 741.641875] do_syscall_32_irqs_on+0x3d/0x2c0 [ 741.657660] do_fast_syscall_32+0x60/0xf0 [ 741.661691] do_SYSENTER_32+0x15/0x20 [ 741.665373] entry_SYSENTER_32+0x9f/0xf2 [ 741.734151] old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Nothing looks bad, and the ELF loading phase memory map should be really quite simple.
The only half-way unusual thing is that you have basically exactly 2MB of stack at execve time (easy enough to tune by just setting argv/env right), and it's moved down by exactly 2MB.
And that latter thing is just due to randomization, see arch_align_stack() in arch/x86/kernel/process.c.
So that would explain why it doesn't happen every time.
What happens if you apply the attached patch to *always* force the 2MB shift (rather than moving the stack by a random amount), and then run the other program (t.c -> compiled to "a.out").
I have applied your patch and test started in a loop for a million times but the test ran for 35 times. Seems like the test got a timeout after 1 hour. kernel messages printed while testing a.out
a.out (480) used greatest stack depth: 4872 bytes left
On other device kworker/dying (172) used greatest stack depth: 5044 bytes left
Re-running test with long timeouts 4 hours and will share findings.
ref: https://lkft.validation.linaro.org/scheduler/job/1555132#L1515
- Naresh
On Fri, Jul 10, 2020 at 10:48 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
I have applied your patch and test started in a loop for a million times but the test ran for 35 times. Seems like the test got a timeout after 1 hour.
That just means that my test-case was wrong (in the sense that what it was testing wasn't what was triggering things).
It might be wrong because I got the stack usage calculations wrong, but it might be wrong simply because the bug you've triggered with LTP needs something else to trigger.
Re-running test with long timeouts 4 hours and will share findings.
That test was more intended to trigger the issue reliably, if it was just a "stack is special, needs that exact 2MB aligned case to be problematic".
So re-running my "t.c" the test for long times with that kernel patch that removes the stack randomization isn't going to matter. Either it triggers the case, or not.
I don't actually see any case where we play with vm_start the way we used to (it was basically removed with commit 1be7107fbe18: "mm: larger stack guard gap, between vmas"). So it was kind of a log shot anyway.
Linus
On Sat, 11 Jul 2020 at 01:35, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Jul 10, 2020 at 10:48 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
I have started bisecting this problem and found the first bad commit
commit 9f132f7e145506efc0744426cb338b18a54afc3b Author: Joel Fernandes (Google) joel@joelfernandes.org Date: Thu Jan 3 15:28:41 2019 -0800
mm: select HAVE_MOVE_PMD on x86 for faster mremap
Moving page-tables at the PMD-level on x86 is known to be safe. Enable this option so that we can do fast mremap when possible.
Link: http://lkml.kernel.org/r/20181108181201.88826-4-joelaf@google.com Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org Suggested-by: Kirill A. Shutemov kirill@shutemov.name Acked-by: Kirill A. Shutemov kirill@shutemov.name Cc: Julia Lawall Julia.Lawall@lip6.fr Cc: Michal Hocko mhocko@kernel.org Cc: William Kucharski william.kucharski@oracle.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e260460210e1..6185d4f33296 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -172,6 +172,7 @@ config X86 select HAVE_MEMBLOCK_NODE_MAP select HAVE_MIXED_BREAKPOINTS_REGS select HAVE_MOD_ARCH_SPECIFIC + select HAVE_MOVE_PMD select HAVE_NMI select HAVE_OPROFILE select HAVE_OPTPROBES
After reverting the above patch the reported kernel warning got fixed on Linus mainline tree 5.8.0-rc4.
On Sat, Jul 11, 2020 at 10:27 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
I have started bisecting this problem and found the first bad commit
Thanks for the effort. Bisection is often a great tool to figure out what's wrong.
Sadly, in this case:
commit 9f132f7e145506efc0744426cb338b18a54afc3b Author: Joel Fernandes (Google) joel@joelfernandes.org Date: Thu Jan 3 15:28:41 2019 -0800
mm: select HAVE_MOVE_PMD on x86 for faster mremap
Yeah, that's just the commit that enables the code, not the commit that introduces the fundamental problem.
That said, this is a prime example of why I absolutely detest patch series that do this kind of thing, and are several patches that create new functionality, followed by one patch to enable it.
If you can't get things working incrementally, maybe you shouldn't do them at all. Doing a big series of "hidden work" and then enabling it later is wrong.
In this case, though, the real patch that did the code isn't that kind of "big series of hidden work" patch series, it's just the (single) previous commit 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions").
So your bisection is useful, it's just that it really points to that previous commit, and it's where this code was introduced.
It's also worth noting that that commit doesn't really *break* anything, since it just falls back to the old behavior when it warns.
So to "fix" your test-case, we could just remove the WARN_ON.
But the WARN_ON() is still worrisome, because the thing it tests for really _should_ be true.
Now, we actually have a known bug in this area that is talked about elsewhere: the way unmap does the pgtable_free() is
/* Detach vmas from rbtree */ detach_vmas_to_be_unmapped(mm, vma, prev, end);
if (downgrade) mmap_write_downgrade(mm);
unmap_region(mm, vma, prev, start, end);
(and unmap_region() is what does the pgtable_free() that should have cleared the PMD).
And the problem with the "downgrade" is that another thread can change the beginning of the next vma when it's a grow-down region (or the end of the prev one if it's a grow-up).
See commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") for the source of that
But that requires an _actual_ "unmap()" system call (the others set "downgrade" to false - I'm not entirely sure why), and it requires another thread to be attached to that VM in order to actually do that racy concurrent stack size change.
And neither of those two cases will be true for the execve() path. It's a new VM, with just one thread attached, so no threaded munmap() going on there.
The fact that it seems to happen with
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/t...
makes me think it's somehow related to THP mappings, but I don't see why those would matter. All the same pmd freeing should still have happened, afaik.
And the printout I asked for a few days back for when it triggered clearly showed a normal non-huge pmd ("val: 7d530067" is just "accessed, dirty, rw, user and present", which is a perfectly normal page directory entry for 4kB pages, and we could move the whole thing and move 2MB (or 4MB) of aligned virtual memory in one go).
Some race with THP splitting and pgtable_free()? I can't see how anything would race in execve(), or how anything would have touched that address below the stack in the first place..
Kirill, Oleg, and reaction from this? Neither of you were on the original email, I think, it's this one:
https://lore.kernel.org/lkml/CA+G9fYt+6OeibZMD0fP=O3nqFbcN3O4xcLkjq0mpQbZJ2u...
and I really think it is harmless in that when the warning triggers, we just go back to the page-by-page code, but while I think the WARN_ON() should probably be downgraded to a WARN_ON_ONCE(), I do think it's showing _something_.
I just can't see how this would trigger for execve(). That's just about the _simplest_ case for us: single-threaded, mappings set up purely by load_elf_binary() etc.
I'm clearly missing something.
Linus
On Sat, Jul 11, 2020 at 11:12 AM Linus Torvalds torvalds@linux-foundation.org wrote:
The fact that it seems to happen with
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/thp/thp01.c
makes me think it's somehow related to THP mappings, but I don't see why those would matter. All the same pmd freeing should still have happened, afaik.
No, I think the only reason it happens with that test-case is simply because that test-case tests many different argument sizes, and as part of that it ends up then hitting the "stack is exactly one pmd in size" case when the stack alignment is right too.
So the THP part is almost certainly a red herring.
Maybe. Considering that I don't see what's wrong, anything is possible.
Linus
On Sat, Jul 11, 2020 at 11:12:58AM -0700, Linus Torvalds wrote:
On Sat, Jul 11, 2020 at 10:27 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
I have started bisecting this problem and found the first bad commit
Thanks for the effort. Bisection is often a great tool to figure out what's wrong.
Sadly, in this case:
commit 9f132f7e145506efc0744426cb338b18a54afc3b Author: Joel Fernandes (Google) joel@joelfernandes.org Date: Thu Jan 3 15:28:41 2019 -0800
mm: select HAVE_MOVE_PMD on x86 for faster mremap
Yeah, that's just the commit that enables the code, not the commit that introduces the fundamental problem.
That said, this is a prime example of why I absolutely detest patch series that do this kind of thing, and are several patches that create new functionality, followed by one patch to enable it.
If you can't get things working incrementally, maybe you shouldn't do them at all. Doing a big series of "hidden work" and then enabling it later is wrong.
In this case, though, the real patch that did the code isn't that kind of "big series of hidden work" patch series, it's just the (single) previous commit 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions").
So your bisection is useful, it's just that it really points to that previous commit, and it's where this code was introduced.
Right, I think I should have squashed the enabling of the config, and the introduction of the feature in the same patch, but as you pointed that probably would not have made a difference with this bisect since this a single patch.
It's also worth noting that that commit doesn't really *break* anything, since it just falls back to the old behavior when it warns.
Agreed, I am also of the opinion that the patch is likely surface an existing issue and not introducing a new one.
So to "fix" your test-case, we could just remove the WARN_ON.
But the WARN_ON() is still worrisome, because the thing it tests for really _should_ be true.
I'll get some tracing in an emulated i386 environment going and try to figure out exactly what is going on before the warning triggers. thanks for the other debug hints in this thread!
thanks,
- Joel
- Joel
On Sat, Jul 11, 2020 at 11:12:58AM -0700, Linus Torvalds wrote:
Yeah, that's just the commit that enables the code, not the commit that introduces the fundamental problem.
That said, this is a prime example of why I absolutely detest patch series that do this kind of thing, and are several patches that create new functionality, followed by one patch to enable it.
If you can't get things working incrementally, maybe you shouldn't do them at all. Doing a big series of "hidden work" and then enabling it later is wrong.
I'm struggling with exactly this for my current THP-in-pagecache patches. There are about fifty patches, each fixing something that won't work if the page is a THP. And then there's the one patch which actually starts creating THPs, and that's the one patch any bisect will point to.
But I don't see any other way to do it. It's not like I can put THPs in the page cache before fixing the things that won't work.
This probably wasn't the kind of thing you had in mind when you wrote the above, but if you had some advice for my situation, I'd welcome it.
On Sun, Jul 12, 2020 at 10:31 AM Matthew Wilcox willy@infradead.org wrote:
But I don't see any other way to do it. It's not like I can put THPs in the page cache before fixing the things that won't work.
I agree that sometimes there are bootstrapping issues. Incremental and explanatory commits are still better than one big commit that introduces a whole new feature and enables it.
But if at all possible, at least limit the scope of the new feature first, enabling the simplest possible cases as they become possible so that there's some incremental testing, and so that bisection can say "ok, that baseline worked, but then when XYZ happened, things went sideways".
And even when it's a new feature - if it needs cleanup patches to other things first, please do that. In fact, please do that as a completely independent series that goes into a previous kernel release entirely, so that the cleanup and preparatory patches can be independently verified by a lot of people who run that _previous_ kernel, so that the baseline of that cleanup phase is something as stable as possible.
Linus
On Thu, Jul 09, 2020 at 10:22:21PM -0700, Linus Torvalds wrote:
On Thu, Jul 9, 2020 at 9:29 PM Naresh Kamboju naresh.kamboju@linaro.org wrote:
Your patch applied and re-tested. warning triggered 10 times.
old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Hmm.. It's not even the overlapping case, it's literally just "move exactly 2MB of page tables exactly one pmd down". Which should be the nice efficient case where we can do it without modifying the lower page tables at all, we just move the PMD entry.
Hi Linus,
I reproduced Naresh's issue on a 32-bit x86 machine and the below patch fixes it. The issue is solely within execve() itself and the way it allocates/copies the temporary stack.
It is actually indeed an overlapping case because the length of the stack is big enough to cause overlap. The VMA grows quite a bit because of all the page faults that happen due to the copy of the args/env. Then during the move of overlapped region, it finds that a PMD is already allocated.
The below patch fixes it and is not warning anymore in 30 minutes of testing so far.
Naresh, could you also test the below patch on your setup?
thanks,
- Joel
---8<-----------------------
From: Joel Fernandes joelaf@google.com Subject: [PATCH] fs/exec: Fix stack overlap issue during stack moving in i386
When running LTP's thp01 test, it is observed that a warning fires in move_page_tables() because a PMD is already allocated.
This happens because there is an address space overlap between the temporary stack created and the range it is being moved to when the move_page_tables() is requested. During the move_page_tables() loop, it picks the same valid PMD that was already allocated for the temporary stack. This loop requires the PMD to be new or it warns. Making sure the new location of the stack is non-overlapping with the old location makes the warning go away.
Fixes: b6a2fea39318e ("mm: variable length argument support"). Reported-by: Naresh Kamboju naresh.kamboju@linaro.org Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org --- fs/exec.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/exec.c b/fs/exec.c index e6e8a9a703278..a270205228a1a 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -755,6 +755,10 @@ int setup_arg_pages(struct linux_binprm *bprm,
stack_shift = vma->vm_end - stack_top;
+ /* Ensure the temporary stack is shifted by atleast its size */ + if (stack_shift < (vma->vm_end - vma->vm_start)) + stack_shift = (vma->vm_end - vma->vm_start); + bprm->p -= stack_shift; mm->arg_start = bprm->p; #endif
On Sun, Jul 12, 2020 at 2:50 PM Joel Fernandes joel@joelfernandes.org wrote:
I reproduced Naresh's issue on a 32-bit x86 machine and the below patch fixes it. The issue is solely within execve() itself and the way it allocates/copies the temporary stack.
It is actually indeed an overlapping case because the length of the stack is big enough to cause overlap. The VMA grows quite a bit because of all the page faults that happen due to the copy of the args/env. Then during the move of overlapped region, it finds that a PMD is already allocated.
Oh, ok, I think I see.
So the issue is that while move_normal_pmd() _itself_ will be clearing the old pmd entries when it copies them, the 4kB copies that happened _before_ this time will not have cleared the old pmd that they were in.
So we've had a deeper stack, and we've already copied some of it one page at a time, and we're done with those 4kB entries, but now we hit a 2MB-aligned case and want to move that down. But when it does that, it hits the (by now hopefully empty) pmd that used to contain the 4kB entries we copied earlier.
So we've cleared the page table, but we've simply never called pgtable_clear() here, because the page table got cleared in move_ptes() when it did that
pte = ptep_get_and_clear(mm, old_addr, old_pte);
on the previous old pte entries.
That makes sense to me. Does that match what you see? Because when I said it wasn't overlapped, I was really talking about that one single pmd move itself not being overlapped.
The below patch fixes it and is not warning anymore in 30 minutes of testing so far.
So I'm not hugely happy with the patch, I have to admit.
Why?
Because:
/* Ensure the temporary stack is shifted by atleast its size */
if (stack_shift < (vma->vm_end - vma->vm_start))
stack_shift = (vma->vm_end - vma->vm_start);
This basically overrides the random stack shift done by arch_align_stack().
Yeah, yeah, arch_align_stack() is kind of misnamed. It _used_ to do what the name implies it does, which on x86 is to just give the minimum ABI-specified 16-byte stack alignment.
But these days, what it really does is say "align the stack pointer, but also shift it down by a random amount within 8kB so that the start of the stack isn't some repeatable thing that an attacker can trivially control with the right argv/envp size.."
I don't think it works right anyway because we then PAGE_ALIGN() it, but whatever.
But it looks like what you're doing means that now the size of the stack will override that shift, and that means that now the randomization is entirely undone. No?
Plus you do it after we've also checked that we haven't grown the stack down to below mmap_min_addr().
But maybe I misread that patch.
But I do feel like you figured out why the bug happened, now we're just discussing whether the patch is the right thing to do.
Maybe saying "doing the pmd copies for the initial stack isn't important, so let's just note this as a special case and get rid of the WARN_ON()" might be an alternative solution.
The reason I worried was that I felt like we didn't understand why the WARN_ON() happened. Now that I do, I think we could just say "ok, don't warn, we know that this can happen, and it's harmless".
Anybody else have any opinions?
Linus
Hi Linus,
On Sun, Jul 12, 2020 at 03:58:06PM -0700, Linus Torvalds wrote:
On Sun, Jul 12, 2020 at 2:50 PM Joel Fernandes joel@joelfernandes.org wrote:
I reproduced Naresh's issue on a 32-bit x86 machine and the below patch fixes it. The issue is solely within execve() itself and the way it allocates/copies the temporary stack.
It is actually indeed an overlapping case because the length of the stack is big enough to cause overlap. The VMA grows quite a bit because of all the page faults that happen due to the copy of the args/env. Then during the move of overlapped region, it finds that a PMD is already allocated.
Oh, ok, I think I see.
So the issue is that while move_normal_pmd() _itself_ will be clearing the old pmd entries when it copies them, the 4kB copies that happened _before_ this time will not have cleared the old pmd that they were in.
So we've had a deeper stack, and we've already copied some of it one page at a time, and we're done with those 4kB entries, but now we hit a 2MB-aligned case and want to move that down. But when it does that, it hits the (by now hopefully empty) pmd that used to contain the 4kB entries we copied earlier.
So we've cleared the page table, but we've simply never called pgtable_clear() here, because the page table got cleared in move_ptes() when it did that
pte = ptep_get_and_clear(mm, old_addr, old_pte);
on the previous old pte entries.
That makes sense to me. Does that match what you see? Because when I said it wasn't overlapped, I was really talking about that one single pmd move itself not being overlapped.
This matches exactly what you mentioned.
Here is some analysis with specific numbers:
Some time during execve(), the copy_strings() causes page faults. During this the VMA is growing and new PMD is allocated during the page fault:
copy_strings: Copying args/env old process's memory 8067420 to new stack bfb0eec6 (len 4096) handle_mm_fault: vma: bfb0d000 c0000000 handle_mm_fault: pmd_alloc: ad: bfb0dec6 ptr: f46d0bf8
Here we see that the the pmd entry's (pmd_t *) pointer value is f46d0bf8 and the fault was on address bfb0dec6.
Similarly, I note the pattern of other copy_strings() faulting and do the following mapping:
address space pmd entry pointer 0xbfe00000-0xc0000000 f4698ff8 0xbfc00000-0xbfe00000 f4698ff0 0xbfa00000-0xbfc00000 f4698fe8
This is all from tracing the copy_strings().
Then later, the kernel decides to move the VMA down. The VMA total size 5MB. The stack is initially at a 1MB aligned address: 0xbfb00000 . exec requests move_page_tables() to move it down by 4MB. That is, to 0xbf700000 which is also 1MB aligned. This is an overlapping move.
move_page_tables starts, I see the following prints before the warning fires.
The plan of attack should be, first copy 1MB using move_ptes() like you said. Then it hits 2MB aligned boundary and starts move_normal_pmd(). For both move_ptes() and move_normal_pmd(), a pmd_alloc() first happens which is printed below:
move_page_tables: move_page_tables old=bfb00000 (len:5242880) new=bf700000 move_page_tables: pmd_alloc: ad: bf700000 ptr: f4698fd8 move_page_tables: move_ptes old=bfb00000->bfc00000 new=bf700000 move_page_tables: pmd_alloc: ad: bf800000 ptr: f4698fe0 move_page_tables: move_normal_pmd: old: bfc00000-c0000000 new: bf800000 (val: 0) move_page_tables: pmd_alloc: ad: bfa00000 ptr: f4698fe8 move_page_tables: move_normal_pmd: old: bfe00000-c0000000 new: bfa00000 (val: 34164067)
So basically, it did 1MB worth of move_ptes(), and twice 2MB worth of move_normal_pmd. Since the shift was only 4MB, it hit an already allocated PMD during the second move_normal_pmd. The warning fires as that move_normal_pmd() sees 0xbf800000 is already allocated before.
As you said, this is harmless.
One thing I observed by code reading is move_page_tables() is (in the case of mremap) only called for non-overlapping moves (through mremap_to() or move_vma() as the case maybe). It makes me nervous we are doing an overlapping move and causing some bug inadvertently.
Could you let me know if there is any reason why we should not make the new stack's location as non-overlapping, just to keep things simple? (Assuming we fix the issues related to overriding you mentioned below).
The below patch fixes it and is not warning anymore in 30 minutes of testing so far.
So I'm not hugely happy with the patch, I have to admit.
Why?
Because:
/* Ensure the temporary stack is shifted by atleast its size */
if (stack_shift < (vma->vm_end - vma->vm_start))
stack_shift = (vma->vm_end - vma->vm_start);
This basically overrides the random stack shift done by arch_align_stack().
Yeah, yeah, arch_align_stack() is kind of misnamed. It _used_ to do what the name implies it does, which on x86 is to just give the minimum ABI-specified 16-byte stack alignment. But these days, what it really does is say "align the stack pointer, but also shift it down by a random amount within 8kB so that the start of the stack isn't some repeatable thing that an attacker can trivially control with the right argv/envp size.."
Got it, thanks I will work on improving the patch along these lines.
I don't think it works right anyway because we then PAGE_ALIGN() it, but whatever.
:)
But it looks like what you're doing means that now the size of the stack will override that shift, and that means that now the randomization is entirely undone. No?
Yes, true. I will work on fixing it.
Plus you do it after we've also checked that we haven't grown the stack down to below mmap_min_addr().
But maybe I misread that patch.
But I do feel like you figured out why the bug happened, now we're just discussing whether the patch is the right thing to do.
Yes.
Maybe saying "doing the pmd copies for the initial stack isn't important, so let's just note this as a special case and get rid of the WARN_ON()" might be an alternative solution.
Personally, I feel it is better to keep the warning just so in the future we can detect any bugs.
thanks,
- Joel
On Sun, Jul 12, 2020 at 7:53 PM Joel Fernandes joel@joelfernandes.org wrote:
But I do feel like you figured out why the bug happened, now we're just discussing whether the patch is the right thing to do.
Yes.
Maybe saying "doing the pmd copies for the initial stack isn't important, so let's just note this as a special case and get rid of the WARN_ON()" might be an alternative solution.
Personally, I feel it is better to keep the warning just so in the future we can detect any bugs.
I don't disagree, the warning didn't happen to find a bug now, but it did fine a case we might be able to do better.
So now that I feel we understand the issue, and it's not a horrible problem, just a (very hard to trigger) warning, I don't think there's any huge hurry.
I think think I will - for now - change the WARN_ON() to WARN_ON_ONCE() (so that it doesn't floow the logs if somebody triggers this odd special case this malisiously), and add a note about how this happens to the code for posterito.
And if/when you figure out a better way to fix it, we can update the note.
Ok?
Linus
On Sun, Jul 12, 2020 at 08:51:26PM -0700, Linus Torvalds wrote:
Maybe saying "doing the pmd copies for the initial stack isn't important, so let's just note this as a special case and get rid of the WARN_ON()" might be an alternative solution.
Personally, I feel it is better to keep the warning just so in the future we can detect any bugs.
I don't disagree, the warning didn't happen to find a bug now, but it did fine a case we might be able to do better.
So now that I feel we understand the issue, and it's not a horrible problem, just a (very hard to trigger) warning, I don't think there's any huge hurry.
I think think I will - for now - change the WARN_ON() to WARN_ON_ONCE() (so that it doesn't floow the logs if somebody triggers this odd special case this malisiously), and add a note about how this happens to the code for posterito.
And if/when you figure out a better way to fix it, we can update the note.
Ok?
Yes, that sounds great to me.
thanks,
- Joel
On Sun, Jul 12, 2020 at 03:58:06PM -0700, Linus Torvalds wrote:
Anybody else have any opinions?
Maybe we just shouldn't allow move_normal_pmd() if ranges overlap?
Other option: pass 'overlaps' down to move_normal_pmd() and only WARN() if see establised PMD without overlaps being true.
Untested patch:
diff --git a/mm/mremap.c b/mm/mremap.c index 5dd572d57ca9..e33fcee541fe 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -245,6 +245,18 @@ unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long extent, next, old_end; struct mmu_notifier_range range; pmd_t *old_pmd, *new_pmd; + bool overlaps; + + /* + * shift_arg_pages() can call move_page_tables() on overlapping ranges. + * In this case we cannot use move_normal_pmd() because destination pmd + * might be established page table: move_ptes() doesn't free page + * table. + */ + if (old_addr > new_addr) + overlaps = old_addr - new_addr < len; + else + overlaps = new_addr - old_addr < len;
old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end); @@ -282,7 +294,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, split_huge_pmd(vma, old_pmd, old_addr); if (pmd_trans_unstable(old_pmd)) continue; - } else if (extent == PMD_SIZE) { + } else if (!overlaps && extent == PMD_SIZE) { #ifdef CONFIG_HAVE_MOVE_PMD /* * If the extent is PMD-sized, try to speed the move by
On Tue, 14 Jul 2020 at 13:03, Kirill A. Shutemov kirill@shutemov.name wrote:
On Sun, Jul 12, 2020 at 03:58:06PM -0700, Linus Torvalds wrote:
Anybody else have any opinions?
Maybe we just shouldn't allow move_normal_pmd() if ranges overlap?
Other option: pass 'overlaps' down to move_normal_pmd() and only WARN() if see establised PMD without overlaps being true.
Untested patch:
This patch applied on top of Linus mainline tree and tested on i386. The reported warning did not happen while testing LTP mm [1].
diff --git a/mm/mremap.c b/mm/mremap.c index 5dd572d57ca9..e33fcee541fe 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -245,6 +245,18 @@ unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long extent, next, old_end; struct mmu_notifier_range range; pmd_t *old_pmd, *new_pmd;
bool overlaps;
/*
* shift_arg_pages() can call move_page_tables() on overlapping ranges.
* In this case we cannot use move_normal_pmd() because destination pmd
* might be established page table: move_ptes() doesn't free page
* table.
*/
if (old_addr > new_addr)
overlaps = old_addr - new_addr < len;
else
overlaps = new_addr - old_addr < len; old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end);
@@ -282,7 +294,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, split_huge_pmd(vma, old_pmd, old_addr); if (pmd_trans_unstable(old_pmd)) continue;
} else if (extent == PMD_SIZE) {
} else if (!overlaps && extent == PMD_SIZE) {
#ifdef CONFIG_HAVE_MOVE_PMD /* * If the extent is PMD-sized, try to speed the move by -- Kirill A. Shutemov
ref: https://lkft.validation.linaro.org/scheduler/job/1561734#L1598
- Naresh
Hi Kirill,
On Tue, Jul 14, 2020 at 10:33:06AM +0300, Kirill A. Shutemov wrote:
On Sun, Jul 12, 2020 at 03:58:06PM -0700, Linus Torvalds wrote:
Anybody else have any opinions?
Maybe we just shouldn't allow move_normal_pmd() if ranges overlap?
Other option: pass 'overlaps' down to move_normal_pmd() and only WARN() if see establised PMD without overlaps being true.
I was thinking we should not call move_page_tables() with overlapping ranges at all, just to keep things simple. I am concerned about other issues such as if you move a range forward, you will end up overwriting part of the source range.
Allow me some time to develop a proper patch, I have it on my list. I will try to get to it this week.
I think we can also add a patch to detect the overlap as you did and warn in such situation.
Thoughts?
thanks,
- Joel
Untested patch:
diff --git a/mm/mremap.c b/mm/mremap.c index 5dd572d57ca9..e33fcee541fe 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -245,6 +245,18 @@ unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long extent, next, old_end; struct mmu_notifier_range range; pmd_t *old_pmd, *new_pmd;
- bool overlaps;
- /*
* shift_arg_pages() can call move_page_tables() on overlapping ranges.
* In this case we cannot use move_normal_pmd() because destination pmd
* might be established page table: move_ptes() doesn't free page
* table.
*/
- if (old_addr > new_addr)
overlaps = old_addr - new_addr < len;
- else
overlaps = new_addr - old_addr < len;
old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end); @@ -282,7 +294,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, split_huge_pmd(vma, old_pmd, old_addr); if (pmd_trans_unstable(old_pmd)) continue;
} else if (extent == PMD_SIZE) {
} else if (!overlaps && extent == PMD_SIZE) {
#ifdef CONFIG_HAVE_MOVE_PMD /* * If the extent is PMD-sized, try to speed the move by -- Kirill A. Shutemov
On Tue, Jul 14, 2020 at 9:08 AM Joel Fernandes joel@joelfernandes.org wrote:
I was thinking we should not call move_page_tables() with overlapping ranges at all, just to keep things simple.
No, we're not breaking the existing stack movement code just to keep things simple.
The rule is "make it as simple as possible, but no simpler".
And "as possible" in the case of Linux means "no breaking of old interfaces". The stack randomization movement most certainly counts.
Linus
On Tue, Jul 14, 2020 at 12:11 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, Jul 14, 2020 at 9:08 AM Joel Fernandes joel@joelfernandes.org wrote:
I was thinking we should not call move_page_tables() with overlapping ranges at all, just to keep things simple.
No, we're not breaking the existing stack movement code just to keep things simple.
The rule is "make it as simple as possible, but no simpler".
And "as possible" in the case of Linux means "no breaking of old interfaces". The stack randomization movement most certainly counts.
Hi Linus, I think you misunderstood me. I was not advocating breaking the stack movement code or breaking stack randomization, I was going to try to see if I could keep that working while not having to do an overlapping move. That would be what I would do in the future patch (until which as you mentioned, you would switch the WARN_ON to WARN_ON_ONCE).
I agree with you that we keep things as simple as possible while not breaking functionality.
Cheers,
- Joel
On Tue, Jul 14, 2020 at 11:12 AM Joel Fernandes joel@joelfernandes.org wrote:
I think you misunderstood me. I was not advocating breaking the stack movement code or breaking stack randomization, I was going to try to see if I could keep that working while not having to do an overlapping move.
I'm not really seeing how you'd do that with a big stack that gets close to the stack ulimit.
Except by avoiding randomization.
But the existing randomization may be so bad that it doesn't much matter. And I do think we limit the execve stack to a reasonably small fraction of the whole ulimit. So worth exploring, I guess.
The current code with "align_stack" doing randomization could also do with a lot of clarifications. The code is odd.
Linus
linux-stable-mirror@lists.linaro.org