When migrating a THP, concurrent access to the PMD migration entry
during a deferred split scan can lead to a page fault, as illustrated
below. To prevent this page fault, it is necessary to check the PMD
migration entry and return early. In this context, there is no need to
use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality
of the target folio. Since the PMD migration entry is locked, it cannot
be served as the target.
BUG: unable to handle page fault for address: ffffea60001db008
CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60
Call Trace:
<TASK>
try_to_migrate_one+0x28c/0x3730
rmap_walk_anon+0x4f6/0x770
unmap_folio+0x196/0x1f0
split_huge_page_to_list_to_order+0x9f6/0x1560
deferred_split_scan+0xac5/0x12a0
shrinker_debugfs_scan_write+0x376/0x470
full_proxy_write+0x15c/0x220
vfs_write+0x2fc/0xcb0
ksys_write+0x146/0x250
do_syscall_64+0x6a/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The bug is found by syzkaller on an internal kernel, then confirmed on
upstream.
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
Cc: stable(a)vger.kernel.org
Signed-off-by: Gavin Guo <gavinguo(a)igalia.com>
---
mm/huge_memory.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a47682d1ab7..0cb9547dcff2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd, bool freeze, struct folio *folio)
{
+ bool pmd_migration = is_pmd_migration_entry(*pmd);
+
VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
@@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
* require a folio to check the PMD against. Otherwise, there
* is a risk of replacing the wrong folio.
*/
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
- is_pmd_migration_entry(*pmd)) {
- if (folio && folio != pmd_folio(*pmd))
- return;
+ if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) {
+ if (folio) {
+ /*
+ * Do not apply pmd_folio() to a migration entry; and
+ * folio lock guarantees that it must be of the wrong
+ * folio anyway.
+ */
+ if (pmd_migration)
+ return;
+ if (folio != pmd_folio(*pmd))
+ return;
+ }
__split_huge_pmd_locked(vma, pmd, address, freeze);
}
}
base-commit: a24588245776dafc227243a01bfbeb8a59bafba9
--
2.43.0
The is_vram() is checking the current placement, however if we consider
exported VRAM with dynamic dma-buf, it looks possible for the xe driver
to async evict the memory, notifying the importer, however importer does
not have to call unmap_attachment() immediately, but rather just as
"soon as possible", like when the dma-resv idles. Following from this we
would then pipeline the move, attaching the fence to the manager, and
then update the current placement. But when the unmap_attachment() runs
at some later point we might see that is_vram() is now false, and take
the complete wrong path when dma-unmapping the sg, leading to
explosions.
To fix this check if the sgl was mapping a struct page.
v2:
- The attachment can be mapped multiple times it seems, so we can't
really rely on encoding something in the attachment->priv. Instead
see if the page_link has an encoded struct page. For vram we expect
this to be NULL.
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4563
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Auld <matthew.auld(a)intel.com>
Cc: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.8+
Acked-by: Christian König <christian.koenig(a)amd.com>
---
drivers/gpu/drm/xe/xe_dma_buf.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
index f67803e15a0e..f7a20264ea33 100644
--- a/drivers/gpu/drm/xe/xe_dma_buf.c
+++ b/drivers/gpu/drm/xe/xe_dma_buf.c
@@ -145,10 +145,7 @@ static void xe_dma_buf_unmap(struct dma_buf_attachment *attach,
struct sg_table *sgt,
enum dma_data_direction dir)
{
- struct dma_buf *dma_buf = attach->dmabuf;
- struct xe_bo *bo = gem_to_xe_bo(dma_buf->priv);
-
- if (!xe_bo_is_vram(bo)) {
+ if (sg_page(sgt->sgl)) {
dma_unmap_sgtable(attach->dev, sgt, dir, 0);
sg_free_table(sgt);
kfree(sgt);
--
2.49.0
Henry's bug[1] and fix[2] prompted some further inspection by
Jean.
This series provides fixes for the remaining issues Jean identified, as
well as reworking the channel paths to reduce cleanup required in error
paths. It is based on the tree at[3].
Lightly tested on an AST2600 EVB. Further testing on platforms
designed around the snoop device appreciated.
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=219934
[2]: https://lore.kernel.org/all/20250401074647.21300-1-bsdhenrymartin@gmail.com/
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/arj/bmc.git/log/?h=aspeed/d…
Signed-off-by: Andrew Jeffery <andrew(a)codeconstruct.com.au>
---
Andrew Jeffery (7):
soc: aspeed: lpc-snoop: Cleanup resources in stack-order
soc: aspeed: lpc-snoop: Don't disable channels that aren't enabled
soc: aspeed: lpc-snoop: Ensure model_data is valid
soc: aspeed: lpc-snoop: Constrain parameters in channel paths
soc: aspeed: lpc-snoop: Rename 'channel' to 'index' in channel paths
soc: aspeed: lpc-snoop: Rearrange channel paths
soc: aspeed: lpc-snoop: Lift channel config to const structs
drivers/soc/aspeed/aspeed-lpc-snoop.c | 149 ++++++++++++++++++++--------------
1 file changed, 88 insertions(+), 61 deletions(-)
---
base-commit: f3089a4fc24777ea2fccdf4ffc84732b1da65bdc
change-id: 20250401-aspeed-lpc-snoop-fixes-e5d2883da3a3
Best regards,
--
Andrew Jeffery <andrew(a)codeconstruct.com.au>
The patch titled
Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
has been added to the -mm mm-new branch. Its filename is
mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Tianyang Zhang <zhangtianyang(a)loongson.cn>
Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
Date: Wed, 16 Apr 2025 16:24:05 +0800
__alloc_pages_slowpath has no change detection for ac->nodemask in the
part of retry path, while cpuset can modify it in parallel. For some
processes that set mempolicy as MPOL_BIND, this results ac->nodemask
changes, and then the should_reclaim_retry will judge based on the latest
nodemask and jump to retry, while the get_page_from_freelist only
traverses the zonelist from ac->preferred_zoneref, which selected by a
expired nodemask and may cause infinite retries in some cases
cpu 64:
__alloc_pages_slowpath {
/* ..... */
retry:
/* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/* cpu 1:
cpuset_write_resmask
update_nodemask
update_nodemasks_hier
update_tasks_nodemask
mpol_rebind_task
mpol_rebind_policy
mpol_rebind_nodemask
// mempolicy->nodes has been modified,
// which ac->nodemask point to
*/
/* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress > 0, &no_progress_loops))
goto retry;
}
Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
this issue on a multi node server when the maximum memory pressure is
reached and the swap is enabled
Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
Signed-off-by: Tianyang Zhang <zhangtianyang(a)loongson.cn>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Brendan Jackman <jackmanb(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
+++ a/mm/page_alloc.c
@@ -4555,6 +4555,14 @@ restart:
}
retry:
+ /*
+ * Deal with possible cpuset update races or zonelist updates to avoid
+ * infinite retries.
+ */
+ if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+ check_retry_zonelist(zonelist_iter_cookie))
+ goto restart;
+
/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
_
Patches currently in -mm which might be from zhangtianyang(a)loongson.cn are
mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race.patch
In vegam_populate_smc_boot_level(), the return value of
phm_find_boot_level() is 0 or negative error code and the
"if (result)" branch statement will never run into the true
branch. Besides, this will skip setting the voltages later
below. Returning early may break working devices.
Add an error handling to phm_find_boot_level() to reset the
boot level when the function fails. A proper implementation
can be found in tonga_populate_smc_boot_level().
Fixes: 4a1132782200 ("drm/amd/powerplay: return errno code to caller when error occur")
Cc: stable(a)vger.kernel.org # v5.6+
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
.../drm/amd/pm/powerplay/smumgr/vegam_smumgr.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/pm/powerplay/smumgr/vegam_smumgr.c b/drivers/gpu/drm/amd/pm/powerplay/smumgr/vegam_smumgr.c
index 34c9f59b889a..c68dd12b858a 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/smumgr/vegam_smumgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/smumgr/vegam_smumgr.c
@@ -1374,15 +1374,20 @@ static int vegam_populate_smc_boot_level(struct pp_hwmgr *hwmgr,
result = phm_find_boot_level(&(data->dpm_table.sclk_table),
data->vbios_boot_state.sclk_bootup_value,
(uint32_t *)&(table->GraphicsBootLevel));
- if (result)
- return result;
+ if (result != 0) {
+ table->GraphicsBootLevel = 0;
+ pr_err("VBIOS did not find boot engine clock value in dependency table. Using Graphics DPM level 0!\n");
+ result = 0;
+ }
result = phm_find_boot_level(&(data->dpm_table.mclk_table),
data->vbios_boot_state.mclk_bootup_value,
(uint32_t *)&(table->MemoryBootLevel));
-
- if (result)
- return result;
+ if (result != 0) {
+ table->MemoryBootLevel = 0;
+ pr_err("VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!\n");
+ result = 0;
+ }
table->BootVddc = data->vbios_boot_state.vddc_bootup_value *
VOLTAGE_SCALE;
--
2.42.0.windows.2
The patch titled
Subject: mm/mempolicy: fix memory leaks in weighted interleave sysfs
has been added to the -mm mm-new branch. Its filename is
mm-mempolicy-fix-memory-leaks-in-weighted-interleave-sysfs.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Rakie Kim <rakie.kim(a)sk.com>
Subject: mm/mempolicy: fix memory leaks in weighted interleave sysfs
Date: Wed, 16 Apr 2025 20:31:19 +0900
Patch series "Enhance sysfs handling for memory hotplug in weighted
interleave", v8.
The following patch series enhances the weighted interleave policy in the
memory management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug support.
This patch (of 3):
Memory leaks occurred when removing sysfs attributes for weighted
interleave. Improper kobject deallocation led to unreleased memory when
initialization failed or when nodes were removed.
This patch resolves the issue by replacing unnecessary `kfree()` calls
with proper `kobject_del()` and `kobject_put()` sequences, ensuring
correct teardown and preventing memory leaks.
By explicitly calling `kobject_del()` before `kobject_put()`, the release
function is now invoked safely, and internal sysfs state is correctly
cleaned up. This guarantees that the memory associated with the kobject
is fully released and avoids resource leaks, thereby improving system
stability.
Additionally, sysfs_remove_file() is no longer called from the release
function to avoid accessing invalid sysfs state after kobject_del(). All
attribute removals are now done before kobject_del(), preventing WARN_ON()
in kernfs and ensuring safe and consistent cleanup of sysfs entries.
Link: https://lkml.kernel.org/r/20250416113123.629-1-rakie.kim@sk.com
Link: https://lkml.kernel.org/r/20250416113123.629-2-rakie.kim@sk.com
Fixes: dce41f5ae253 ("mm/mempolicy: implement the sysfs-based weighted_interleave interface")
Signed-off-by: Rakie Kim <rakie.kim(a)sk.com>
Reviewed-by: Gregory Price <gourry(a)gourry.net>
Reviewed-by: Joshua Hahn <joshua.hahnjy(a)gmail.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Honggyu Kim <honggyu.kim(a)sk.com>
Cc: "Huang, Ying" <ying.huang(a)linux.alibaba.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Yunjeong Mun <yunjeong.mun(a)sk.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mempolicy.c | 111 +++++++++++++++++++++++++----------------------
1 file changed, 61 insertions(+), 50 deletions(-)
--- a/mm/mempolicy.c~mm-mempolicy-fix-memory-leaks-in-weighted-interleave-sysfs
+++ a/mm/mempolicy.c
@@ -3463,8 +3463,8 @@ static ssize_t node_store(struct kobject
static struct iw_node_attr **node_attrs;
-static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
- struct kobject *parent)
+static void sysfs_wi_node_delete(struct iw_node_attr *node_attr,
+ struct kobject *parent)
{
if (!node_attr)
return;
@@ -3473,18 +3473,41 @@ static void sysfs_wi_node_release(struct
kfree(node_attr);
}
-static void sysfs_wi_release(struct kobject *wi_kobj)
+static void sysfs_wi_node_delete_all(struct kobject *wi_kobj)
{
- int i;
+ int nid;
- for (i = 0; i < nr_node_ids; i++)
- sysfs_wi_node_release(node_attrs[i], wi_kobj);
- kobject_put(wi_kobj);
+ for (nid = 0; nid < nr_node_ids; nid++)
+ sysfs_wi_node_delete(node_attrs[nid], wi_kobj);
+}
+
+static void iw_table_free(void)
+{
+ u8 *old;
+
+ mutex_lock(&iw_table_lock);
+ old = rcu_dereference_protected(iw_table,
+ lockdep_is_held(&iw_table_lock));
+ if (old) {
+ rcu_assign_pointer(iw_table, NULL);
+ mutex_unlock(&iw_table_lock);
+
+ synchronize_rcu();
+ kfree(old);
+ } else
+ mutex_unlock(&iw_table_lock);
+}
+
+static void wi_kobj_release(struct kobject *wi_kobj)
+{
+ iw_table_free();
+ kfree(node_attrs);
+ kfree(wi_kobj);
}
static const struct kobj_type wi_ktype = {
.sysfs_ops = &kobj_sysfs_ops,
- .release = sysfs_wi_release,
+ .release = wi_kobj_release,
};
static int add_weight_node(int nid, struct kobject *wi_kobj)
@@ -3525,41 +3548,42 @@ static int add_weighted_interleave_group
struct kobject *wi_kobj;
int nid, err;
+ node_attrs = kcalloc(nr_node_ids, sizeof(struct iw_node_attr *),
+ GFP_KERNEL);
+ if (!node_attrs)
+ return -ENOMEM;
+
wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
- if (!wi_kobj)
+ if (!wi_kobj) {
+ kfree(node_attrs);
return -ENOMEM;
+ }
err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj,
"weighted_interleave");
- if (err) {
- kfree(wi_kobj);
- return err;
- }
+ if (err)
+ goto err_put_kobj;
for_each_node_state(nid, N_POSSIBLE) {
err = add_weight_node(nid, wi_kobj);
if (err) {
pr_err("failed to add sysfs [node%d]\n", nid);
- break;
+ goto err_cleanup_kobj;
}
}
- if (err)
- kobject_put(wi_kobj);
+
return 0;
+
+err_cleanup_kobj:
+ sysfs_wi_node_delete_all(wi_kobj);
+ kobject_del(wi_kobj);
+err_put_kobj:
+ kobject_put(wi_kobj);
+ return err;
}
static void mempolicy_kobj_release(struct kobject *kobj)
{
- u8 *old;
-
- mutex_lock(&iw_table_lock);
- old = rcu_dereference_protected(iw_table,
- lockdep_is_held(&iw_table_lock));
- rcu_assign_pointer(iw_table, NULL);
- mutex_unlock(&iw_table_lock);
- synchronize_rcu();
- kfree(old);
- kfree(node_attrs);
kfree(kobj);
}
@@ -3573,37 +3597,24 @@ static int __init mempolicy_sysfs_init(v
static struct kobject *mempolicy_kobj;
mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL);
- if (!mempolicy_kobj) {
- err = -ENOMEM;
- goto err_out;
- }
-
- node_attrs = kcalloc(nr_node_ids, sizeof(struct iw_node_attr *),
- GFP_KERNEL);
- if (!node_attrs) {
- err = -ENOMEM;
- goto mempol_out;
- }
+ if (!mempolicy_kobj)
+ return -ENOMEM;
err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj,
"mempolicy");
if (err)
- goto node_out;
+ goto err_put_kobj;
err = add_weighted_interleave_group(mempolicy_kobj);
- if (err) {
- pr_err("mempolicy sysfs structure failed to initialize\n");
- kobject_put(mempolicy_kobj);
- return err;
- }
+ if (err)
+ goto err_del_kobj;
- return err;
-node_out:
- kfree(node_attrs);
-mempol_out:
- kfree(mempolicy_kobj);
-err_out:
- pr_err("failed to add mempolicy kobject to the system\n");
+ return 0;
+
+err_del_kobj:
+ kobject_del(mempolicy_kobj);
+err_put_kobj:
+ kobject_put(mempolicy_kobj);
return err;
}
_
Patches currently in -mm which might be from rakie.kim(a)sk.com are
mm-mempolicy-fix-memory-leaks-in-weighted-interleave-sysfs.patch
mm-mempolicy-prepare-weighted-interleave-sysfs-for-memory-hotplug.patch
mm-mempolicy-support-memory-hotplug-in-weighted-interleave.patch
Changes since v1 [1]:
* Fix the fact that devmem_is_allowed() == 2 does not prevent
mmap access (Kees)
* Rather than teach devmem_is_allowed() == 2 to map zero pages in the
mmap case, just fail (Nikolay)
[1]: http://lore.kernel.org/67f5b75c37143_71fe2949b@dwillia2-xfh.jf.intel.com.no…
---
The story starts with Nikolay reporting an SEPT violation due to
mismatched encrypted/non-encrypted mappings of the BIOS data space [2].
An initial suggestion to just make sure that the BIOS data space is
mapped consistently [3] ran into another issue that TDX and SEV-SNP
disagree about when that space can be mapped as encrypted.
Then, in response to a partial patch to allow SEV-SNP to block BIOS data
space for other reasons [4], Dave asked why not just give up on /dev/mem
access entirely in the confidential VM case [5].
Enter this series to:
1/ Close a subtle hole whereby /dev/mem that is supposed return zeros in
lieu of access only enforces that for read()/write()
2/ Use that new closed hole to reliably disable all /dev/mem access for
confidential x86 VMs
[2]: http://lore.kernel.org/20250318113604.297726-1-nik.borisov@suse.com
[3]: http://lore.kernel.org/174346288005.2166708.14425674491111625620.stgit@dwil…
[4]: http://lore.kernel.org/20250403120228.2344377-1-naveen@kernel.org
[5]: http://lore.kernel.org/fd683daa-d953-48ca-8c5d-6f4688ad442c@intel.com
---
Dan Williams (3):
x86/devmem: Remove duplicate range_is_allowed() definition
devmem: Block mmap access when read/write access is restricted
x86/devmem: Restrict /dev/mem access for potentially unaccepted memory by default
arch/x86/Kconfig | 2 ++
arch/x86/include/asm/x86_init.h | 2 ++
arch/x86/kernel/x86_init.c | 6 ++++++
arch/x86/mm/init.c | 23 +++++++++++++++++------
arch/x86/mm/pat/memtype.c | 31 ++++---------------------------
drivers/char/mem.c | 18 ------------------
include/linux/io.h | 26 ++++++++++++++++++++++++++
7 files changed, 57 insertions(+), 51 deletions(-)
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
The following commit has been merged into the sched/core branch of tip:
Commit-ID: c70fc32f44431bb30f9025ce753ba8be25acbba3
Gitweb: https://git.kernel.org/tip/c70fc32f44431bb30f9025ce753ba8be25acbba3
Author: Peter Zijlstra <peterz(a)infradead.org>
AuthorDate: Tue, 28 Jan 2025 15:39:49 +01:00
Committer: Peter Zijlstra <peterz(a)infradead.org>
CommitterDate: Wed, 16 Apr 2025 21:09:12 +02:00
sched/fair: Adhere to place_entity() constraints
Mike reports that commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity
placement bug causing scheduling lag") relies on commit 4423af84b297
("sched/fair: optimize the PLACE_LAG when se->vlag is zero") to not
trip a WARN in place_entity().
What happens is that the lag of the very last entity is 0 per
definition -- the average of one element matches the value of that
element. Therefore place_entity() will match the condition skipping
the lag adjustment:
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
Without the 'se->vlag' condition -- it will attempt to adjust the zero
lag even though we're inserting into an empty tree.
Notably, we should have failed the 'cfs_rq->nr_queued' condition, but
don't because they didn't get updated.
Additionally, move update_load_add() after placement() as is
consistent with other place_entity() users -- this change is
non-functional, place_entity() does not use cfs_rq->load.
Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Reported-by: Mike Galbraith <efault(a)gmx.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz(a)infradead.org>
Signed-off-by: Mike Galbraith <efault(a)gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/c216eb4ef0e0e0029c600aefc69d56681cee5581.camel@gm…
---
kernel/sched/fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5e1bd9e..eb5a257 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3795,6 +3795,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
update_entity_lag(cfs_rq, se);
se->deadline -= se->vruntime;
se->rel_deadline = 1;
+ cfs_rq->nr_queued--;
if (!curr)
__dequeue_entity(cfs_rq, se);
update_load_sub(&cfs_rq->load, se->load.weight);
@@ -3821,10 +3822,11 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
enqueue_load_avg(cfs_rq, se);
if (se->on_rq) {
- update_load_add(&cfs_rq->load, se->load.weight);
place_entity(cfs_rq, se, 0);
+ update_load_add(&cfs_rq->load, se->load.weight);
if (!curr)
__enqueue_entity(cfs_rq, se);
+ cfs_rq->nr_queued++;
/*
* The entity's vruntime has been adjusted, so let's check