The patch titled
Subject: memcontrol: ensure memcg acquired by id is properly set up
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
memcontrol-ensure-memcg-acquired-by-id-is-properly-set-up.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Johannes Weiner <hannes(a)cmpxchg.org>
Subject: memcontrol: ensure memcg acquired by id is properly set up
Date: Wed, 23 Aug 2023 15:54:30 -0700
In the eviction recency check, we attempt to retrieve the memcg to which
the folio belonged when it was evicted, by the memcg id stored in the
shadow entry. However, there is a chance that the retrieved memcg is not
the original memcg that has been killed, but a new one which happens to
have the same id.
This is a somewhat unfortunate, but acceptable and rare inaccuracy in the
heuristics. However, if we retrieve this new memcg between its allocation
and when it is properly attached to the memcg hierarchy, we could run into
the following NULL pointer exception during the memcg hierarchy traversal
done in mem_cgroup_get_nr_swap_pages():
[ 155757.793456] BUG: kernel NULL pointer dereference, address: 00000000000000c0
[ 155757.807568] #PF: supervisor read access in kernel mode
[ 155757.818024] #PF: error_code(0x0000) - not-present page
[ 155757.828482] PGD 401f77067 P4D 401f77067 PUD 401f76067 PMD 0
[ 155757.839985] Oops: 0000 [#1] SMP
[ 155757.887870] RIP: 0010:mem_cgroup_get_nr_swap_pages+0x3d/0xb0
[ 155757.899377] Code: 29 19 4a 02 48 39 f9 74 63 48 8b 97 c0 00 00 00 48 8b b7 58 02 00 00 48 2b b7 c0 01 00 00 48 39 f0 48 0f 4d c6 48 39 d1 74 42 <48> 8b b2 c0 00 00 00 48 8b ba 58 02 00 00 48 2b ba c0 01 00 00 48
[ 155757.937125] RSP: 0018:ffffc9002ecdfbc8 EFLAGS: 00010286
[ 155757.947755] RAX: 00000000003a3b1c RBX: 000007ffffffffff RCX: ffff888280183000
[ 155757.962202] RDX: 0000000000000000 RSI: 0007ffffffffffff RDI: ffff888bbc2d1000
[ 155757.976648] RBP: 0000000000000001 R08: 000000000000000b R09: ffff888ad9cedba0
[ 155757.991094] R10: ffffea0039c07900 R11: 0000000000000010 R12: ffff888b23a7b000
[ 155758.005540] R13: 0000000000000000 R14: ffff888bbc2d1000 R15: 000007ffffc71354
[ 155758.019991] FS: 00007f6234c68640(0000) GS:ffff88903f9c0000(0000) knlGS:0000000000000000
[ 155758.036356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 155758.048023] CR2: 00000000000000c0 CR3: 0000000a83eb8004 CR4: 00000000007706e0
[ 155758.062473] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 155758.076924] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 155758.091376] PKRU: 55555554
[ 155758.096957] Call Trace:
[ 155758.102016] <TASK>
[ 155758.106502] ? __die+0x78/0xc0
[ 155758.112793] ? page_fault_oops+0x286/0x380
[ 155758.121175] ? exc_page_fault+0x5d/0x110
[ 155758.129209] ? asm_exc_page_fault+0x22/0x30
[ 155758.137763] ? mem_cgroup_get_nr_swap_pages+0x3d/0xb0
[ 155758.148060] workingset_test_recent+0xda/0x1b0
[ 155758.157133] workingset_refault+0xca/0x1e0
[ 155758.165508] filemap_add_folio+0x4d/0x70
[ 155758.173538] page_cache_ra_unbounded+0xed/0x190
[ 155758.182919] page_cache_sync_ra+0xd6/0x1e0
[ 155758.191738] filemap_read+0x68d/0xdf0
[ 155758.199495] ? mlx5e_napi_poll+0x123/0x940
[ 155758.207981] ? __napi_schedule+0x55/0x90
[ 155758.216095] __x64_sys_pread64+0x1d6/0x2c0
[ 155758.224601] do_syscall_64+0x3d/0x80
[ 155758.232058] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 155758.242473] RIP: 0033:0x7f62c29153b5
[ 155758.249938] Code: e8 48 89 75 f0 89 7d f8 48 89 4d e0 e8 b4 e6 f7 ff 41 89 c0 4c 8b 55 e0 48 8b 55 e8 48 8b 75 f0 8b 7d f8 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 45 f8 e8 e7 e6 f7 ff 48 8b
[ 155758.288005] RSP: 002b:00007f6234c5ffd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
[ 155758.303474] RAX: ffffffffffffffda RBX: 00007f628c4e70c0 RCX: 00007f62c29153b5
[ 155758.318075] RDX: 000000000003c041 RSI: 00007f61d2986000 RDI: 0000000000000076
[ 155758.332678] RBP: 00007f6234c5fff0 R08: 0000000000000000 R09: 0000000064d5230c
[ 155758.347452] R10: 000000000027d450 R11: 0000000000000293 R12: 000000000003c041
[ 155758.362044] R13: 00007f61d2986000 R14: 00007f629e11b060 R15: 000000000027d450
[ 155758.376661] </TASK>
This patch fixes the issue by moving the memcg's id publication from the
alloc stage to online stage, ensuring that any memcg acquired via id must
be connected to the memcg tree.
Link: https://lkml.kernel.org/r/20230823225430.166925-1-nphamcs@gmail.com
Fixes: f78dfc7b77d5 ("workingset: fix confusion around eviction vs refault container")
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Co-developed-by: Nhat Pham <nphamcs(a)gmail.com>
Signed-off-by: Nhat Pham <nphamcs(a)gmail.com>
Cc: Yosry Ahmed <yosryahmed(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
--- a/mm/memcontrol.c~memcontrol-ensure-memcg-acquired-by-id-is-properly-set-up
+++ a/mm/memcontrol.c
@@ -5339,7 +5339,6 @@ static struct mem_cgroup *mem_cgroup_all
INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue);
memcg->deferred_split_queue.split_queue_len = 0;
#endif
- idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
lru_gen_init_memcg(memcg);
return memcg;
fail:
@@ -5411,14 +5410,27 @@ static int mem_cgroup_css_online(struct
if (alloc_shrinker_info(memcg))
goto offline_kmem;
- /* Online state pins memcg ID, memcg ID pins CSS */
- refcount_set(&memcg->id.ref, 1);
- css_get(css);
-
if (unlikely(mem_cgroup_is_root(memcg)))
queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
FLUSH_TIME);
lru_gen_online_memcg(memcg);
+
+ /* Online state pins memcg ID, memcg ID pins CSS */
+ refcount_set(&memcg->id.ref, 1);
+ css_get(css);
+
+ /*
+ * Ensure mem_cgroup_from_id() works once we're fully online.
+ *
+ * We could do this earlier and require callers to filter with
+ * css_tryget_online(). But right now there are no users that
+ * need earlier access, and the workingset code relies on the
+ * cgroup tree linkage (mem_cgroup_get_nr_swap_pages()). So
+ * publish it here at the end of onlining. This matches the
+ * regular ID destruction during offlining.
+ */
+ idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
+
return 0;
offline_kmem:
memcg_offline_kmem(memcg);
_
Patches currently in -mm which might be from hannes(a)cmpxchg.org are
memcontrol-ensure-memcg-acquired-by-id-is-properly-set-up.patch
mm-page_alloc-remove-stale-cma-guard-code.patch
On 8/25/23 01:33, Conor Dooley wrote:
> On Fri, Aug 25, 2023 at 01:30:36AM +0800, Mingzheng Xing wrote:
>> On 8/24/23 19:32, Mingzheng Xing wrote:
>>> On 8/23/23 21:31, Conor Dooley wrote:
>>>> On Wed, Aug 23, 2023 at 12:51:13PM +0100, Conor Dooley wrote:
>>>>> On Thu, Aug 17, 2023 at 03:20:24PM +0000,
>>>>> patchwork-bot+linux-riscv(a)kernel.org wrote:
>>>>>> Hello:
>>>>>>
>>>>>> This patch was applied to riscv/linux.git (fixes)
>>>>>> by Palmer Dabbelt <palmer(a)rivosinc.com>:
>>>>>>
>>>>>> On Thu, 10 Aug 2023 00:56:48 +0800 you wrote:
>>>>>>> Binutils-2.38 and GCC-12.1.0 bumped[0][1] the default
>>>>>>> ISA spec to the newer
>>>>>>> 20191213 version which moves some instructions from the
>>>>>>> I extension to the
>>>>>>> Zicsr and Zifencei extensions. So if one of the binutils
>>>>>>> and GCC exceeds
>>>>>>> that version, we should explicitly specifying Zicsr and
>>>>>>> Zifencei via -march
>>>>>>> to cope with the new changes. but this only occurs when
>>>>>>> binutils >= 2.36
>>>>>>> and GCC >= 11.1.0. It's a different story when binutils < 2.36.
>>>>>>>
>>>>>>> [...]
>>>>>> Here is the summary with links:
>>>>>> - [v5] riscv: Handle zicsr/zifencei issue between gcc and binutils
>>>>>> https://git.kernel.org/riscv/c/ca09f772ccca
>>>>> *sigh* so this breaks the build for gcc-11 & binutils 2.37 w/
>>>>> Assembler messages:
>>>>> Error: cannot find default versions of the ISA extension `zicsr'
>>>>> Error: cannot find default versions of the ISA extension `zifencei'
>>>>>
>>>>> I'll have a poke later.
>>>> So uh, are we sure that this should not be:
>>>> - depends on (CC_IS_CLANG && CLANG_VERSION < 170000) ||
>>>> (CC_IS_GCC && GCC_VERSION < 110100)
>>>> + depends on (CC_IS_CLANG && CLANG_VERSION < 170000) ||
>>>> (CC_IS_GCC && GCC_VERSION <= 110100)
>>>>
>>>> My gcc-11.1 + binutils 2.37 toolchain built from riscv-gnu-toolchain
>>>> doesn't have the default versions & the above diff fixes the build.
>>> I reproduced the error, the combination of gcc-11.1 and
>>> binutils 2.37 does cause errors. What a surprise, since binutils
>>> 2.36 and 2.38 are fine.
>>>
>>> I used git bisect to locate this commit[1] for binutils.
>>> I'll test this diff in more detail later. Thanks!
>>>
>>> [1] https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=f0bae2552db1dd4f1…
>>>
>> Hi, Conor.
>> The above error does originate from link[1] mentioned above, which was
>> resolved in gcc-12.1.0[2], and gcc-11.3.0 made the backport[3].
>> So gcc-11.2.0 combined with binutils 2.37 produces the same error.
>> I think we should do the following diff to fix it:
>> - depends on (CC_IS_CLANG && CLANG_VERSION < 170000) || (CC_IS_GCC &&
>> GCC_VERSION < 110100)
>> + depends on (CC_IS_CLANG && CLANG_VERSION < 170000) || (CC_IS_GCC &&
>> GCC_VERSION < 110300)
> That sounds good to me. Can you make that a real patch please?
Okey.
Just a question, I see that the previous patch has been merged
into 6.5-rc7, and now a new fix patch should be sent out based
on that, right?
Best Regards,
Mingzheng.
>
> Thanks for working on it :)
On 8/23/23 21:31, Conor Dooley wrote:
> On Wed, Aug 23, 2023 at 12:51:13PM +0100, Conor Dooley wrote:
>> On Thu, Aug 17, 2023 at 03:20:24PM +0000, patchwork-bot+linux-riscv(a)kernel.org wrote:
>>> Hello:
>>>
>>> This patch was applied to riscv/linux.git (fixes)
>>> by Palmer Dabbelt <palmer(a)rivosinc.com>:
>>>
>>> On Thu, 10 Aug 2023 00:56:48 +0800 you wrote:
>>>> Binutils-2.38 and GCC-12.1.0 bumped[0][1] the default ISA spec to the newer
>>>> 20191213 version which moves some instructions from the I extension to the
>>>> Zicsr and Zifencei extensions. So if one of the binutils and GCC exceeds
>>>> that version, we should explicitly specifying Zicsr and Zifencei via -march
>>>> to cope with the new changes. but this only occurs when binutils >= 2.36
>>>> and GCC >= 11.1.0. It's a different story when binutils < 2.36.
>>>>
>>>> [...]
>>> Here is the summary with links:
>>> - [v5] riscv: Handle zicsr/zifencei issue between gcc and binutils
>>> https://git.kernel.org/riscv/c/ca09f772ccca
>> *sigh* so this breaks the build for gcc-11 & binutils 2.37 w/
>> Assembler messages:
>> Error: cannot find default versions of the ISA extension `zicsr'
>> Error: cannot find default versions of the ISA extension `zifencei'
>>
>> I'll have a poke later.
> So uh, are we sure that this should not be:
> - depends on (CC_IS_CLANG && CLANG_VERSION < 170000) || (CC_IS_GCC && GCC_VERSION < 110100)
> + depends on (CC_IS_CLANG && CLANG_VERSION < 170000) || (CC_IS_GCC && GCC_VERSION <= 110100)
>
> My gcc-11.1 + binutils 2.37 toolchain built from riscv-gnu-toolchain
> doesn't have the default versions & the above diff fixes the build.
I reproduced the error, the combination of gcc-11.1 and
binutils 2.37 does cause errors. What a surprise, since binutils
2.36 and 2.38 are fine.
I used git bisect to locate this commit[1] for binutils.
I'll test this diff in more detail later. Thanks!
[1]
https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=f0bae2552db1dd4f1…
Best Regards,
Mingzheng.
>
> Thanks,
> Conor.
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 1f69383b203e28cf8a4ca9570e572da1699f76cd
Gitweb: https://git.kernel.org/tip/1f69383b203e28cf8a4ca9570e572da1699f76cd
Author: Rick Edgecombe <rick.p.edgecombe(a)intel.com>
AuthorDate: Fri, 18 Aug 2023 10:03:05 -07:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Thu, 24 Aug 2023 11:01:45 +02:00
x86/fpu: Invalidate FPU state correctly on exec()
The thread flag TIF_NEED_FPU_LOAD indicates that the FPU saved state is
valid and should be reloaded when returning to userspace. However, the
kernel will skip doing this if the FPU registers are already valid as
determined by fpregs_state_valid(). The logic embedded there considers
the state valid if two cases are both true:
1: fpu_fpregs_owner_ctx points to the current tasks FPU state
2: the last CPU the registers were live in was the current CPU.
This is usually correct logic. A CPU’s fpu_fpregs_owner_ctx is set to
the current FPU during the fpregs_restore_userregs() operation, so it
indicates that the registers have been restored on this CPU. But this
alone doesn’t preclude that the task hasn’t been rescheduled to a
different CPU, where the registers were modified, and then back to the
current CPU. To verify that this was not the case the logic relies on the
second condition. So the assumption is that if the registers have been
restored, AND they haven’t had the chance to be modified (by being
loaded on another CPU), then they MUST be valid on the current CPU.
Besides the lazy FPU optimizations, the other cases where the FPU
registers might not be valid are when the kernel modifies the FPU register
state or the FPU saved buffer. In this case the operation modifying the
FPU state needs to let the kernel know the correspondence has been
broken. The comment in “arch/x86/kernel/fpu/context.h” has:
/*
...
* If the FPU register state is valid, the kernel can skip restoring the
* FPU state from memory.
*
* Any code that clobbers the FPU registers or updates the in-memory
* FPU state for a task MUST let the rest of the kernel know that the
* FPU registers are no longer valid for this task.
*
* Either one of these invalidation functions is enough. Invalidate
* a resource you control: CPU if using the CPU for something else
* (with preemption disabled), FPU for the current task, or a task that
* is prevented from running by the current task.
*/
However, this is not completely true. When the kernel modifies the
registers or saved FPU state, it can only rely on
__fpu_invalidate_fpregs_state(), which wipes the FPU’s last_cpu
tracking. The exec path instead relies on fpregs_deactivate(), which sets
the CPU’s FPU context to NULL. This was observed to fail to restore the
reset FPU state to the registers when returning to userspace in the
following scenario:
1. A task is executing in userspace on CPU0
- CPU0’s FPU context points to tasks
- fpu->last_cpu=CPU0
2. The task exec()’s
3. While in the kernel the task is preempted
- CPU0 gets a thread executing in the kernel (such that no other
FPU context is activated)
- Scheduler sets task’s fpu->last_cpu=CPU0 when scheduling out
4. Task is migrated to CPU1
5. Continuing the exec(), the task gets to
fpu_flush_thread()->fpu_reset_fpregs()
- Sets CPU1’s fpu context to NULL
- Copies the init state to the task’s FPU buffer
- Sets TIF_NEED_FPU_LOAD on the task
6. The task reschedules back to CPU0 before completing the exec() and
returning to userspace
- During the reschedule, scheduler finds TIF_NEED_FPU_LOAD is set
- Skips saving the registers and updating task’s fpu→last_cpu,
because TIF_NEED_FPU_LOAD is the canonical source.
7. Now CPU0’s FPU context is still pointing to the task’s, and
fpu->last_cpu is still CPU0. So fpregs_state_valid() returns true even
though the reset FPU state has not been restored.
So the root cause is that exec() is doing the wrong kind of invalidate. It
should reset fpu->last_cpu via __fpu_invalidate_fpregs_state(). Further,
fpu__drop() doesn't really seem appropriate as the task (and FPU) are not
going away, they are just getting reset as part of an exec. So switch to
__fpu_invalidate_fpregs_state().
Also, delete the misleading comment that says that either kind of
invalidate will be enough, because it’s not always the case.
Fixes: 33344368cb08 ("x86/fpu: Clean up the fpu__clear() variants")
Reported-by: Lei Wang <lei4.wang(a)intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe(a)intel.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Lijun Pan <lijun.pan(a)intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta(a)intel.com>
Acked-by: Lijun Pan <lijun.pan(a)intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20230818170305.502891-1-rick.p.edgecombe@intel.com
---
arch/x86/kernel/fpu/context.h | 3 +--
arch/x86/kernel/fpu/core.c | 2 +-
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h
index af5cbdd..f6d856b 100644
--- a/arch/x86/kernel/fpu/context.h
+++ b/arch/x86/kernel/fpu/context.h
@@ -19,8 +19,7 @@
* FPU state for a task MUST let the rest of the kernel know that the
* FPU registers are no longer valid for this task.
*
- * Either one of these invalidation functions is enough. Invalidate
- * a resource you control: CPU if using the CPU for something else
+ * Invalidate a resource you control: CPU if using the CPU for something else
* (with preemption disabled), FPU for the current task, or a task that
* is prevented from running by the current task.
*/
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1015af1..98e507c 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -679,7 +679,7 @@ static void fpu_reset_fpregs(void)
struct fpu *fpu = ¤t->thread.fpu;
fpregs_lock();
- fpu__drop(fpu);
+ __fpu_invalidate_fpregs_state(fpu);
/*
* This does not change the actual hardware registers. It just
* resets the memory image and sets TIF_NEED_FPU_LOAD so a
From: Baoquan He <bhe(a)redhat.com>
[ Upstream commit b1e213a9e31c20206f111ec664afcf31cbfe0dbb ]
On s390 systems (aka mainframes), it has classic channel devices for
networking and permanent storage that are currently even more common
than PCI devices. Hence it could have a fully functional s390 kernel
with CONFIG_PCI=n, then the relevant iomem mapping functions
[including ioremap(), devm_ioremap(), etc.] are not available.
Here let FSL_EDMA and INTEL_IDMA64 depend on HAS_IOMEM so that it
won't be built to cause below compiling error if PCI is unset.
--------
ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/fsl-edma.ko] undefined!
ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/idma64.ko] undefined!
--------
Reported-by: kernel test robot <lkp(a)intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306211329.ticOJCSv-lkp@intel.com/
Signed-off-by: Baoquan He <bhe(a)redhat.com>
Cc: Vinod Koul <vkoul(a)kernel.org>
Cc: dmaengine(a)vger.kernel.org
Link: https://lore.kernel.org/r/20230707135852.24292-2-bhe@redhat.com
Signed-off-by: Vinod Koul <vkoul(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/dma/Kconfig | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index f5f422f9b8507..b6221b4432fd3 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -211,6 +211,7 @@ config FSL_DMA
config FSL_EDMA
tristate "Freescale eDMA engine support"
depends on OF
+ depends on HAS_IOMEM
select DMA_ENGINE
select DMA_VIRTUAL_CHANNELS
help
@@ -280,6 +281,7 @@ config IMX_SDMA
config INTEL_IDMA64
tristate "Intel integrated DMA 64-bit support"
+ depends on HAS_IOMEM
select DMA_ENGINE
select DMA_VIRTUAL_CHANNELS
help
--
2.40.1