[ Upstream commit 7c2fd76048e95dd267055b5f5e0a48e6e7c81fd9 ]
On an NVMe namespace that does not support metadata, it is possible to
send an IO command with metadata through io-passthru. This allows issues
like [1] to trigger in the completion code path.
nvme_map_user_request() doesn't check if the namespace supports metadata
before sending it forward. It also allows admin commands with metadata
to be processed as it ignores metadata when bdev == NULL and may report
success.
Reject an IO command with metadata when the NVMe namespace doesn't
support it and reject an admin command if it has metadata.
[1] https://lore.kernel.org/all/mb61pcylvnym8.fsf@amazon.com/
Suggested-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Sagi Grimberg <sagi(a)grimberg.me>
Reviewed-by: Anuj Gupta <anuj20.g(a)samsung.com>
Signed-off-by: Keith Busch <kbusch(a)kernel.org>
[ Move the changes from nvme_map_user_request() to nvme_submit_user_cmd()
to make it work on 5.10 ]
Signed-off-by: Puranjay Mohan <pjy(a)amazon.com>
---
drivers/nvme/host/core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 30a642c8f5374..bee55902fe6ce 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1121,11 +1121,16 @@ static int nvme_submit_user_cmd(struct request_queue *q,
bool write = nvme_is_write(cmd);
struct nvme_ns *ns = q->queuedata;
struct gendisk *disk = ns ? ns->disk : NULL;
+ bool supports_metadata = disk && blk_get_integrity(disk);
+ bool has_metadata = meta_buffer && meta_len;
struct request *req;
struct bio *bio = NULL;
void *meta = NULL;
int ret;
+ if (has_metadata && !supports_metadata)
+ return -EINVAL;
+
req = nvme_alloc_request(q, cmd, 0);
if (IS_ERR(req))
return PTR_ERR(req);
@@ -1141,7 +1146,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
goto out;
bio = req->bio;
bio->bi_disk = disk;
- if (disk && meta_buffer && meta_len) {
+ if (has_metadata) {
meta = nvme_add_user_metadata(bio, meta_buffer, meta_len,
meta_seed, write);
if (IS_ERR(meta)) {
--
2.40.1
From: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
[Why&How]
vblank immediate disable currently does not work for all asics. On
DCN401, the vblank interrupts never stop coming, and hence we never
get a chance to trigger idle optimizations.
Add a workaround to enable immediate disable only on APUs for now. This
adds a 2-frame delay for triggering idle optimization, which is a
negligible overhead.
Fixes: db11e20a1144 ("drm/amd/display: use a more lax vblank enable policy for older ASICs")
Fixes: 6dfb3a42a914 ("drm/amd/display: use a more lax vblank enable policy for DCN35+")
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Harry Wentland <harry.wentland(a)amd.com>
Reviewed-by: Rodrigo Siqueira <rodrigo.siqueira(a)amd.com>
Signed-off-by: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
Signed-off-by: Wayne Lin <wayne.lin(a)amd.com>
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index a4882b16ace2..6ea54eb5d68d 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -8379,7 +8379,8 @@ static void manage_dm_interrupts(struct amdgpu_device *adev,
if (amdgpu_ip_version(adev, DCE_HWIP, 0) <
IP_VERSION(3, 5, 0) ||
acrtc_state->stream->link->psr_settings.psr_version <
- DC_PSR_VERSION_UNSUPPORTED) {
+ DC_PSR_VERSION_UNSUPPORTED ||
+ !(adev->flags & AMD_IS_APU)) {
timing = &acrtc_state->stream->timing;
/* at least 2 frames */
--
2.37.3
From: Matt Fleming <mfleming(a)cloudflare.com>
Under memory pressure it's possible for GFP_ATOMIC order-0 allocations
to fail even though free pages are available in the highatomic reserves.
GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock()
since it's only run from reclaim.
Given that such allocations will pass the watermarks in
__zone_watermark_unusable_free(), it makes sense to fallback to
highatomic reserves the same way that ALLOC_OOM can.
This fixes order-0 page allocation failures observed on Cloudflare's
fleet when handling network packets:
kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC),
nodemask=(null),cpuset=/,mems_allowed=0-7
CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G O 6.6.43-CUSTOM #1
Hardware name: MACHINE
Call Trace:
<IRQ>
dump_stack_lvl+0x3c/0x50
warn_alloc+0x13a/0x1c0
__alloc_pages_slowpath.constprop.0+0xc9d/0xd10
__alloc_pages+0x327/0x340
__napi_alloc_skb+0x16d/0x1f0
bnxt_rx_page_skb+0x96/0x1b0 [bnxt_en]
bnxt_rx_pkt+0x201/0x15e0 [bnxt_en]
__bnxt_poll_work+0x156/0x2b0 [bnxt_en]
bnxt_poll+0xd9/0x1c0 [bnxt_en]
__napi_poll+0x2b/0x1b0
bpf_trampoline_6442524138+0x7d/0x1000
__napi_poll+0x5/0x1b0
net_rx_action+0x342/0x740
handle_softirqs+0xcf/0x2b0
irq_exit_rcu+0x6c/0x90
sysvec_apic_timer_interrupt+0x72/0x90
</IRQ>
Fixes: 1d91df85f399 ("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs")
Suggested-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: linux-mm(a)kvack.org
Cc: stable(a)vger.kernel.org # v5.9+
Link: https://lore.kernel.org/all/CAGis_TWzSu=P7QJmjD58WWiu3zjMTVKSzdOwWE8ORaGytz…
Signed-off-by: Matt Fleming <mfleming(a)cloudflare.com>
---
mm/page_alloc.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
v2: Update comment and add Fixes, Reviewed-by, and Cc: stable tags.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8afab64814dc..94a2ffe28008 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2893,12 +2893,12 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
page = __rmqueue(zone, order, migratetype, alloc_flags);
/*
- * If the allocation fails, allow OOM handling access
- * to HIGHATOMIC reserves as failing now is worse than
- * failing a high-order atomic allocation in the
- * future.
+ * If the allocation fails, allow OOM handling and
+ * order-0 (atomic) allocs access to HIGHATOMIC
+ * reserves as failing now is worse than failing a
+ * high-order atomic allocation in the future.
*/
- if (!page && (alloc_flags & ALLOC_OOM))
+ if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (!page) {
--
2.34.1
Hi Greg,
How are you?
What is the process to backport Pedro's recent mseal fixes to 6.10 ?
Specifically those 5 commits:
67203f3f2a63d429272f0c80451e5fcc469fdb46
selftests/mm: add mseal test for no-discard madvise
4d1b3416659be70a2251b494e85e25978de06519
mm: move can_modify_vma to mm/vma.h
4a2dd02b09160ee43f96c759fafa7b56dfc33816
mm/mprotect: replace can_modify_mm with can_modify_vma
23c57d1fa2b9530e38f7964b4e457fed5a7a0ae8
mseal: replace can_modify_mm_madv with a vma variant
f28bdd1b17ec187eaa34845814afaaff99832762
selftests/mm: add more mseal traversal tests
There will be merge conflicts, I can backport them to 5.10 and test
to help the backporting process.
Those 5 fixes are needed for two reasons: maintain the consistency of
mseal's semantics across releases, and for ease of backporting future
fixes.
PS: There are also three other commits for munmap and remap (see below),
they have dependency on Michael Ellerman's arch_unmap() patch [1] and maybe
uprobe change [2]. If Michael and Oleg are OK with backporting their
patches, then great !
Otherwise, since those commits below don't change mseal's semantics, I
think it is OK to just backport above 5 patches.
df2a7df9a9aa32c3df227de346693e6e802c8591
mm/munmap: replace can_modify_mm with can_modify_vma
38075679b5f157eeacd46c900e9cfc684bdbc167
mm/mremap: replace can_modify_mm with can_modify_vma
5b3db2b812a1f86dfab587324d198a5d10c7a5cf
mm: remove can_modify_mm()
[1] https://lore.kernel.org/all/20240812082605.743814-1-mpe@ellerman.id.au/
[2] https://lore.kernel.org/all/20240911131320.GA3448@redhat.com/
Thanks!
-Jeff
From: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
[Why&How]
Disabling P-State support on full updates for DCN401 results in
introducing additional communication with SMU. A UCLK hard min message
to SMU takes 4 seconds to go through, which was due to DCN not allowing
pstate switch, which was caused by incorrect value for TTU watermark
before blanking the HUBP prior to DPG on for servicing the test request.
Fix the issue temporarily by disallowing pstate changes for compliance
test while test request handler is reworked for a proper fix.
Fixes: 67ea53a4bd9d ("drm/amd/display: Disable DCN401 UCLK P-State support on full updates")
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Dillon Varone <dillon.varone(a)amd.com>
Signed-off-by: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
Signed-off-by: Wayne Lin <wayne.lin(a)amd.com>
---
.../drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
index 8eaf292bc4eb..b0fea0856866 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
@@ -46,6 +46,7 @@
#include "dm_helpers.h"
#include "ddc_service_types.h"
+#include "clk_mgr.h"
static u32 edid_extract_panel_id(struct edid *edid)
{
@@ -1181,6 +1182,8 @@ bool dm_helpers_dp_handle_test_pattern_request(
struct pipe_ctx *pipe_ctx = NULL;
struct amdgpu_dm_connector *aconnector = link->priv;
struct drm_device *dev = aconnector->base.dev;
+ struct dc_state *dc_state = ctx->dc->current_state;
+ struct clk_mgr *clk_mgr = ctx->dc->clk_mgr;
int i;
for (i = 0; i < MAX_PIPES; i++) {
@@ -1281,6 +1284,16 @@ bool dm_helpers_dp_handle_test_pattern_request(
pipe_ctx->stream->test_pattern.type = test_pattern;
pipe_ctx->stream->test_pattern.color_space = test_pattern_color_space;
+ /* Temp W/A for compliance test failure */
+ dc_state->bw_ctx.bw.dcn.clk.p_state_change_support = false;
+ dc_state->bw_ctx.bw.dcn.clk.dramclk_khz = clk_mgr->dc_mode_softmax_enabled ?
+ clk_mgr->bw_params->dc_mode_softmax_memclk : clk_mgr->bw_params->max_memclk_mhz;
+ dc_state->bw_ctx.bw.dcn.clk.idle_dramclk_khz = dc_state->bw_ctx.bw.dcn.clk.dramclk_khz;
+ ctx->dc->clk_mgr->funcs->update_clocks(
+ ctx->dc->clk_mgr,
+ dc_state,
+ false);
+
dc_link_dp_set_test_pattern(
(struct dc_link *) link,
test_pattern,
--
2.37.3
During fuzz testing, the following issue was discovered:
BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x598/0x2a30
_copy_to_iter+0x598/0x2a30
__skb_datagram_iter+0x168/0x1060
skb_copy_datagram_iter+0x5b/0x220
netlink_recvmsg+0x362/0x1700
sock_recvmsg+0x2dc/0x390
__sys_recvfrom+0x381/0x6d0
__x64_sys_recvfrom+0x130/0x200
x64_sys_call+0x32c8/0x3cc0
do_syscall_64+0xd8/0x1c0
entry_SYSCALL_64_after_hwframe+0x79/0x81
Uninit was stored to memory at:
copy_to_user_state_extra+0xcc1/0x1e00
dump_one_state+0x28c/0x5f0
xfrm_state_walk+0x548/0x11e0
xfrm_dump_sa+0x1e0/0x840
netlink_dump+0x943/0x1c40
__netlink_dump_start+0x746/0xdb0
xfrm_user_rcv_msg+0x429/0xc00
netlink_rcv_skb+0x613/0x780
xfrm_netlink_rcv+0x77/0xc0
netlink_unicast+0xe90/0x1280
netlink_sendmsg+0x126d/0x1490
__sock_sendmsg+0x332/0x3d0
____sys_sendmsg+0x863/0xc30
___sys_sendmsg+0x285/0x3e0
__x64_sys_sendmsg+0x2d6/0x560
x64_sys_call+0x1316/0x3cc0
do_syscall_64+0xd8/0x1c0
entry_SYSCALL_64_after_hwframe+0x79/0x81
Uninit was created at:
__kmalloc+0x571/0xd30
attach_auth+0x106/0x3e0
xfrm_add_sa+0x2aa0/0x4230
xfrm_user_rcv_msg+0x832/0xc00
netlink_rcv_skb+0x613/0x780
xfrm_netlink_rcv+0x77/0xc0
netlink_unicast+0xe90/0x1280
netlink_sendmsg+0x126d/0x1490
__sock_sendmsg+0x332/0x3d0
____sys_sendmsg+0x863/0xc30
___sys_sendmsg+0x285/0x3e0
__x64_sys_sendmsg+0x2d6/0x560
x64_sys_call+0x1316/0x3cc0
do_syscall_64+0xd8/0x1c0
entry_SYSCALL_64_after_hwframe+0x79/0x81
Bytes 328-379 of 732 are uninitialized
Memory access of size 732 starts at ffff88800e18e000
Data copied to user address 00007ff30f48aff0
CPU: 2 PID: 18167 Comm: syz-executor.0 Not tainted 6.8.11 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Fixes copying of xfrm algorithms where some random
data of the structure fields can end up in userspace.
Padding in structures may be filled with random (possibly sensitve)
data and should never be given directly to user-space.
A similar issue was resolved in the commit
8222d5910dae ("xfrm: Zero padding when dumping algos and encap")
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: c7a5899eb26e ("xfrm: redact SA secret with lockdown confidentiality")
Cc: stable(a)vger.kernel.org
Co-developed-by: Boris Tonofa <b.tonofa(a)ideco.ru>
Signed-off-by: Boris Tonofa <b.tonofa(a)ideco.ru>
Signed-off-by: Petr Vaganov <p.vaganov(a)ideco.ru>
---
v3: Corrected commit description "This patch fixes copying..." to
"Fixes copying..." according to accepted rules of Linux kernel commits,
as suggested by Markus Elfring <elfring(a)users.sourceforge.net>.
v2: Fixed typo in sizeof(sizeof(ap->alg_name) expression.
The third argument for the strscpy_pad macro was chosen by
analogy with those in other functions of this file - as did
commit 8222d5910dae ("xfrm: Zero padding when dumping algos and encap").
I still think it would be better to leave the strscpy_pad() macro with
three arguments in this patch, mainly so that it would be consistent
with the existing similar code in this module.
Regarding strncpy() above, we don't think it is
a real issue since the uncopied parts should be already padded with zeroes
by nla_reserve().
net/xfrm/xfrm_user.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 55f039ec3d59..0083faabe8be 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1098,7 +1098,9 @@ static int copy_to_user_auth(struct xfrm_algo_auth *auth, struct sk_buff *skb)
if (!nla)
return -EMSGSIZE;
ap = nla_data(nla);
- memcpy(ap, auth, sizeof(struct xfrm_algo_auth));
+ strscpy_pad(ap->alg_name, auth->alg_name, sizeof(ap->alg_name));
+ ap->alg_key_len = auth->alg_key_len;
+ ap->alg_trunc_len = auth->alg_trunc_len;
if (redact_secret && auth->alg_key_len)
memset(ap->alg_key, 0, (auth->alg_key_len + 7) / 8);
else
--
2.46.2
New processors have become pickier about the local APIC timer state
before entering low power modes. These low power modes are used (for
example) when you close your laptop lid and suspend. If you put your
laptop in a bag in this unnecessarily-high-power state, it is likely
to get quite toasty while it quickly sucks the battery dry.
The problem boils down to some CPUs' inability to power down until the
kernel fully disables the local APIC timer. The current kernel code
works in one-shot and periodic modes but does not work for deadline
mode. Deadline mode has been the supported and preferred mode on
Intel CPUs for over a decade and uses an MSR to drive the timer
instead of an APIC register.
Disable the TSC Deadline timer in lapic_timer_shutdown() by writing to
MSR_IA32_TSC_DEADLINE when in TSC-deadline mode. Also avoid writing
to the initial-count register (APIC_TMICT) which is ignored in
TSC-deadline mode.
Note: The APIC_LVTT|=APIC_LVT_MASKED operation should theoretically be
enough to tell the hardware that the timer will not fire in any of the
timer modes. But mitigating AMD erratum 411[1] also requires clearing
out APIC_TMICT. Solely setting APIC_LVT_MASKED is also ineffective in
practice on Intel Lunar Lake systems, which is the motivation for this
change.
1. 411 Processor May Exit Message-Triggered C1E State Without an Interrupt if Local APIC Timer Reaches Zero - https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/revisio…
Cc: stable(a)vger.kernel.org
Fixes: 279f1461432c ("x86: apic: Use tsc deadline for oneshot when available")
Suggested-by: Dave Hansen <dave.hansen(a)intel.com>
Signed-off-by: Zhang Rui <rui.zhang(a)intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada(a)linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt(a)intel.com>
---
V2
- Improve changelog
V3
- Subject and changelog rewrite
- Check LAPIC Timer mode using APIC_LVTT value instead of extra CPU feature flag check
- Avoid APIC_TMICT write which is ignored in TSC-deadline mode
V4
- Add back Fixes tag and stable tag which was missing in V3
- Update patch recipients
---
arch/x86/kernel/apic/apic.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 6513c53c9459..5436a4083065 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -440,7 +440,19 @@ static int lapic_timer_shutdown(struct clock_event_device *evt)
v = apic_read(APIC_LVTT);
v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
apic_write(APIC_LVTT, v);
- apic_write(APIC_TMICT, 0);
+
+ /*
+ * Setting APIC_LVT_MASKED should be enough to tell the
+ * hardware that this timer will never fire. But AMD
+ * erratum 411 and some Intel CPU behavior circa 2024
+ * say otherwise. Time for belt and suspenders programming,
+ * mask the timer and zero the counter registers:
+ */
+ if (v & APIC_LVT_TIMER_TSCDEADLINE)
+ wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
+ else
+ apic_write(APIC_TMICT, 0);
+
return 0;
}
--
2.34.1
The patch titled
Subject: mm/swapfile: skip HugeTLB pages for unuse_vma
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-swapfile-skip-hugetlb-pages-for-unuse_vma.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Liu Shixin <liushixin2(a)huawei.com>
Subject: mm/swapfile: skip HugeTLB pages for unuse_vma
Date: Tue, 15 Oct 2024 09:45:21 +0800
I got a bad pud error and lost a 1GB HugeTLB when calling swapoff. The
problem can be reproduced by the following steps:
1. Allocate an anonymous 1GB HugeTLB and some other anonymous memory.
2. Swapout the above anonymous memory.
3. run swapoff and we will get a bad pud error in kernel message:
mm/pgtable-generic.c:42: bad pud 00000000743d215d(84000001400000e7)
We can tell that pud_clear_bad is called by pud_none_or_clear_bad in
unuse_pud_range() by ftrace. And therefore the HugeTLB pages will never
be freed because we lost it from page table. We can skip HugeTLB pages
for unuse_vma to fix it.
Link: https://lkml.kernel.org/r/20241015014521.570237-1-liushixin2@huawei.com
Fixes: 0fe6e20b9c4c ("hugetlb, rmap: add reverse mapping for hugepage")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Acked-by: Muchun Song <muchun.song(a)linux.dev>
Cc: Naoya Horiguchi <nao.horiguchi(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swapfile.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/swapfile.c~mm-swapfile-skip-hugetlb-pages-for-unuse_vma
+++ a/mm/swapfile.c
@@ -2313,7 +2313,7 @@ static int unuse_mm(struct mm_struct *mm
mmap_read_lock(mm);
for_each_vma(vmi, vma) {
- if (vma->anon_vma) {
+ if (vma->anon_vma && !is_vm_hugetlb_page(vma)) {
ret = unuse_vma(vma, type);
if (ret)
break;
_
Patches currently in -mm which might be from liushixin2(a)huawei.com are
mm-swapfile-skip-hugetlb-pages-for-unuse_vma.patch
I got a bad pud error and lost a 1GB HugeTLB when calling swapoff.
The problem can be reproduced by the following steps:
1. Allocate an anonymous 1GB HugeTLB and some other anonymous memory.
2. Swapout the above anonymous memory.
3. run swapoff and we will get a bad pud error in kernel message:
mm/pgtable-generic.c:42: bad pud 00000000743d215d(84000001400000e7)
We can tell that pud_clear_bad is called by pud_none_or_clear_bad
in unuse_pud_range() by ftrace. And therefore the HugeTLB pages will
never be freed because we lost it from page table. We can skip
HugeTLB pages for unuse_vma to fix it.
Fixes: 0fe6e20b9c4c ("hugetlb, rmap: add reverse mapping for hugepage")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
---
mm/swapfile.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0cded32414a1..f4ef91513fc9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2312,7 +2312,7 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
mmap_read_lock(mm);
for_each_vma(vmi, vma) {
- if (vma->anon_vma) {
+ if (vma->anon_vma && !is_vm_hugetlb_page(vma)) {
ret = unuse_vma(vma, type);
if (ret)
break;
--
2.34.1