This is a backport of the series that fixes the way deadline bandwidth
restoration is done which is causing noticeable delay on resume path. It also
converts the cpuset lock back into a mutex which some users on Android too.
I lack the details but AFAIU the read/write semaphore was slower on high
contention.
Compile tested against some randconfig for different archs and tested against
android14-5.15 GKI kernel, which already contains a version of this backport.
My testing is limited to resume path only; and general phone usage to make sure
nothing falls apart. Would be good to have some deadline specific testing done
too.
Based on v5.15.127
Original series:
https://lore.kernel.org/lkml/20230508075854.17215-1-juri.lelli@redhat.com/
Thanks!
--
Qais Yousef
Dietmar Eggemann (2):
sched/deadline: Create DL BW alloc, free & check overflow interface
cgroup/cpuset: Free DL BW in case can_attach() fails
Juri Lelli (4):
cgroup/cpuset: Rename functions dealing with DEADLINE accounting
sched/cpuset: Bring back cpuset_mutex
sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets
cgroup/cpuset: Iterate only if DEADLINE tasks are present
include/linux/cpuset.h | 12 ++-
include/linux/sched.h | 4 +-
kernel/cgroup/cgroup.c | 4 +
kernel/cgroup/cpuset.c | 232 ++++++++++++++++++++++++++--------------
kernel/sched/core.c | 41 ++++---
kernel/sched/deadline.c | 66 +++++++++---
kernel/sched/sched.h | 2 +-
7 files changed, 238 insertions(+), 123 deletions(-)
--
2.34.1
Attn:
I'm an Investment Consultant in the United Kingdom, I specialize
in searching for potential investments opportunities for high
net-worth clients worldwide.
Should this be of interest to you, please do not hesitate to
email me for further information.
Kind regards,
David Brennan
eMail:davbrennanb@gmail.com
19 sierpnia 2023 r.
Cześć,
Mam propozycję biznesową, którą chcę się z tobą podzielić. Odpowiedz w
języku angielskim, aby uzyskać więcej informacji.
Pozdrowienia
Pani Wiktorii Cleland
____________________
Sekretarz: Moradmand Celine
Good day.
I emailed you a week ago, but I’m not sure if you received it. Please
confirm if you receive it or not so I can be sure Thank you.
Yours Sincerely,
Mr.Mohammed
Hi,
This is Carrie. We are bag manufacturer directly from China.
Below are some bags we did for your reference. All bags can be customized according to your request.
We are happy to send you quotation and share our catalog with you.
Please contact directly if there is any project that we can support.
Wish you have a nice day!
Kind regards,
Carrie
+86 18906051620
If you don’t need our products, please kindly reply “unsubscribe”. Thanks for your time.
These are the women of Maragua What are catastrophic effects? A three degree centigrade climate change rise that will result in 50 percent species extinction It starts with something we are all familiar with -- waves However, while we often think of stretching a muscle like stretching a rubber band, muscles are actually comprised of various tissue types, which interact to make a complex material And it may surprise you to know that "Where in the World is Carmen Sandiego?" continues to be the last substantial giant hit in the entertainment business, despite the fact that it was 1987, which is such an incredibly long time ago, and I'm only 36, so you can do the math
From: Sven Eckelmann <sven(a)narfation.org>
If an interface changes the MTU, it is expected that an NETDEV_PRECHANGEMTU
and NETDEV_CHANGEMTU notification events is triggered. This worked fine for
.ndo_change_mtu based changes because core networking code took care of it.
But for auto-adjustments after hard-interfaces changes, these events were
simply missing.
Due to this problem, non-batman-adv components weren't aware of MTU changes
and thus couldn't perform their own tasks correctly.
Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sven Eckelmann <sven(a)narfation.org>
Signed-off-by: Simon Wunderlich <sw(a)simonwunderlich.de>
---
net/batman-adv/hard-interface.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c
index 41c1ad33d009..ae5762af0146 100644
--- a/net/batman-adv/hard-interface.c
+++ b/net/batman-adv/hard-interface.c
@@ -630,7 +630,7 @@ int batadv_hardif_min_mtu(struct net_device *soft_iface)
*/
void batadv_update_min_mtu(struct net_device *soft_iface)
{
- soft_iface->mtu = batadv_hardif_min_mtu(soft_iface);
+ dev_set_mtu(soft_iface, batadv_hardif_min_mtu(soft_iface));
/* Check if the local translate table should be cleaned up to match a
* new (and smaller) MTU.
--
2.39.2
We hit softlocup with following call trace:
? asm_sysvec_apic_timer_interrupt+0x16/0x20
xa_erase+0x21/0xb0
? sgx_free_epc_page+0x20/0x50
sgx_vepc_release+0x75/0x220
__fput+0x89/0x250
task_work_run+0x59/0x90
do_exit+0x337/0x9a0
Similar like commit 8795359e35bc ("x86/sgx: Silence softlockup detection
when releasing large enclaves"). The test system has 64GB of enclave memory,
and all assigned to a single VM. Release vepc take longer time and triggers
the softlockup warning.
Add cond_resched() to give other tasks a chance to run and placate
the softlockup detector.
Cc: Jarkko Sakkinen <jarkko(a)kernel.org>
Cc: Haitao Huang <haitao.huang(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
Fixes: 540745ddbc70 ("x86/sgx: Introduce virtual EPC for use by KVM guests")
Reported-by: Yu Zhang <yu.zhang(a)ionos.com>
Tested-by: Yu Zhang <yu.zhang(a)ionos.com>
Acked-by: Haitao Huang <haitao.huang(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Signed-off-by: Jack Wang <jinpu.wang(a)ionos.com>
---
v3:
* improve commit message as suggested.
* Add cond_resched() to the 3rd loop too.
arch/x86/kernel/cpu/sgx/virt.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index c3e37eaec8ec..7aaa3652e31d 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -204,6 +204,7 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
continue;
xa_erase(&vepc->page_array, index);
+ cond_resched();
}
/*
@@ -222,6 +223,7 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
list_add_tail(&epc_page->list, &secs_pages);
xa_erase(&vepc->page_array, index);
+ cond_resched();
}
/*
@@ -243,6 +245,7 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
if (sgx_vepc_free_page(epc_page))
list_add_tail(&epc_page->list, &secs_pages);
+ cond_resched();
}
if (!list_empty(&secs_pages))
--
2.34.1
The patch titled
Subject: nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
nilfs2-fix-warning-in-mark_buffer_dirty-due-to-discarded-buffer-reuse.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Subject: nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
Date: Fri, 18 Aug 2023 22:18:04 +0900
A syzbot stress test using a corrupted disk image reported that
mark_buffer_dirty() called from __nilfs_mark_inode_dirty() or
nilfs_palloc_commit_alloc_entry() may output a kernel warning, and can
panic if the kernel is booted with panic_on_warn.
This is because nilfs2 keeps buffer pointers in local structures for some
metadata and reuses them, but such buffers may be forcibly discarded by
nilfs_clear_dirty_page() in some critical situations.
This issue is reported to appear after commit 28a65b49eb53 ("nilfs2: do
not write dirty data after degenerating to read-only"), but the issue has
potentially existed before.
Fix this issue by checking the uptodate flag when attempting to reuse an
internally held buffer, and reloading the metadata instead of reusing the
buffer if the flag was lost.
Link: https://lkml.kernel.org/r/20230818131804.7758-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Reported-by: syzbot+cdfcae656bac88ba0e2d(a)syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/0000000000003da75f05fdeffd12@google.com
Fixes: 8c26c4e2694a ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption")
Tested-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Cc: <stable(a)vger.kernel.org> # 3.10+
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/alloc.c | 3 ++-
fs/nilfs2/inode.c | 7 +++++--
2 files changed, 7 insertions(+), 3 deletions(-)
--- a/fs/nilfs2/alloc.c~nilfs2-fix-warning-in-mark_buffer_dirty-due-to-discarded-buffer-reuse
+++ a/fs/nilfs2/alloc.c
@@ -205,7 +205,8 @@ static int nilfs_palloc_get_block(struct
int ret;
spin_lock(lock);
- if (prev->bh && blkoff == prev->blkoff) {
+ if (prev->bh && blkoff == prev->blkoff &&
+ likely(buffer_uptodate(prev->bh))) {
get_bh(prev->bh);
*bhp = prev->bh;
spin_unlock(lock);
--- a/fs/nilfs2/inode.c~nilfs2-fix-warning-in-mark_buffer_dirty-due-to-discarded-buffer-reuse
+++ a/fs/nilfs2/inode.c
@@ -1025,7 +1025,7 @@ int nilfs_load_inode_block(struct inode
int err;
spin_lock(&nilfs->ns_inode_lock);
- if (ii->i_bh == NULL) {
+ if (ii->i_bh == NULL || unlikely(!buffer_uptodate(ii->i_bh))) {
spin_unlock(&nilfs->ns_inode_lock);
err = nilfs_ifile_get_inode_block(ii->i_root->ifile,
inode->i_ino, pbh);
@@ -1034,7 +1034,10 @@ int nilfs_load_inode_block(struct inode
spin_lock(&nilfs->ns_inode_lock);
if (ii->i_bh == NULL)
ii->i_bh = *pbh;
- else {
+ else if (unlikely(!buffer_uptodate(ii->i_bh))) {
+ __brelse(ii->i_bh);
+ ii->i_bh = *pbh;
+ } else {
brelse(*pbh);
*pbh = ii->i_bh;
}
_
Patches currently in -mm which might be from konishi.ryusuke(a)gmail.com are
nilfs2-fix-general-protection-fault-in-nilfs_lookup_dirty_data_buffers.patch
nilfs2-fix-warning-in-mark_buffer_dirty-due-to-discarded-buffer-reuse.patch
commit 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend")
changed pci_bridge_d3_possible() so that any vendor's PCIe ports
from modern machines (>=2015) are allowed to be put into D3.
Iain reports that USB devices can't be used to wake a Lenovo Z13
from suspend. This is because the PCIe root port has been put
into D3 and AMD's platform can't handle USB devices waking in this
case.
This behavior is only reported on Linux. Comparing the behavior
on Windows and Linux, Windows doesn't put the root ports into D3.
To fix the issue without regressing existing Intel systems,
limit the >=2015 check to only apply to Intel PCIe ports.
Cc: stable(a)vger.kernel.org
Fixes: 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend")
Reported-by: Iain Lane <iain(a)orangesquash.org.uk>
Closes: https://forums.lenovo.com/t5/Ubuntu/Z13-can-t-resume-from-suspend-with-exte…
Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com>
---
v12->v13:
* New patch
---
drivers/pci/pci.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 60230da957e0c..051e88ee64c63 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3037,10 +3037,11 @@ bool pci_bridge_d3_possible(struct pci_dev *bridge)
return false;
/*
- * It should be safe to put PCIe ports from 2015 or newer
+ * It is safe to put Intel PCIe ports from 2015 or newer
* to D3.
*/
- if (dmi_get_bios_year() >= 2015)
+ if (bridge->vendor == PCI_VENDOR_ID_INTEL &&
+ dmi_get_bios_year() >= 2015)
return true;
break;
}
--
2.34.1
The following commit has been merged into the ras/core branch of tip:
Commit-ID: 4240e2ebe67941ce2c4f5c866c3af4b5ac7a0c67
Gitweb: https://git.kernel.org/tip/4240e2ebe67941ce2c4f5c866c3af4b5ac7a0c67
Author: Yazen Ghannam <yazen.ghannam(a)amd.com>
AuthorDate: Mon, 14 Aug 2023 15:08:53 -05:00
Committer: Borislav Petkov (AMD) <bp(a)alien8.de>
CommitterDate: Fri, 18 Aug 2023 13:05:52 +02:00
x86/MCE: Always save CS register on AMD Zen IF Poison errors
The Instruction Fetch (IF) units on current AMD Zen-based systems do not
guarantee a synchronous #MC is delivered for poison consumption errors.
Therefore, MCG_STATUS[EIPV|RIPV] will not be set. However, the
microarchitecture does guarantee that the exception is delivered within
the same context. In other words, the exact rIP is not known, but the
context is known to not have changed.
There is no architecturally-defined method to determine this behavior.
The Code Segment (CS) register is always valid on such IF unit poison
errors regardless of the value of MCG_STATUS[EIPV|RIPV].
Add a quirk to save the CS register for poison consumption from the IF
unit banks.
This is needed to properly determine the context of the error.
Otherwise, the severity grading function will assume the context is
IN_KERNEL due to the m->cs value being 0 (the initialized value). This
leads to unnecessary kernel panics on data poison errors due to the
kernel believing the poison consumption occurred in kernel context.
Signed-off-by: Yazen Ghannam <yazen.ghannam(a)amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20230814200853.29258-1-yazen.ghannam@amd.com
---
arch/x86/kernel/cpu/mce/core.c | 26 ++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/internal.h | 5 ++++-
2 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index b8ad5a5..6f35f72 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -843,6 +843,26 @@ static noinstr bool quirk_skylake_repmov(void)
}
/*
+ * Some Zen-based Instruction Fetch Units set EIPV=RIPV=0 on poison consumption
+ * errors. This means mce_gather_info() will not save the "ip" and "cs" registers.
+ *
+ * However, the context is still valid, so save the "cs" register for later use.
+ *
+ * The "ip" register is truly unknown, so don't save it or fixup EIPV/RIPV.
+ *
+ * The Instruction Fetch Unit is at MCA bank 1 for all affected systems.
+ */
+static __always_inline void quirk_zen_ifu(int bank, struct mce *m, struct pt_regs *regs)
+{
+ if (bank != 1)
+ return;
+ if (!(m->status & MCI_STATUS_POISON))
+ return;
+
+ m->cs = regs->cs;
+}
+
+/*
* Do a quick check if any of the events requires a panic.
* This decides if we keep the events around or clear them.
*/
@@ -861,6 +881,9 @@ static __always_inline int mce_no_way_out(struct mce *m, char **msg, unsigned lo
if (mce_flags.snb_ifu_quirk)
quirk_sandybridge_ifu(i, m, regs);
+ if (mce_flags.zen_ifu_quirk)
+ quirk_zen_ifu(i, m, regs);
+
m->bank = i;
if (mce_severity(m, regs, &tmp, true) >= MCE_PANIC_SEVERITY) {
mce_read_aux(m, i);
@@ -1849,6 +1872,9 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
if (c->x86 == 0x15 && c->x86_model <= 0xf)
mce_flags.overflow_recov = 1;
+ if (c->x86 >= 0x17 && c->x86 <= 0x1A)
+ mce_flags.zen_ifu_quirk = 1;
+
}
if (c->x86_vendor == X86_VENDOR_INTEL) {
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index ed4a71c..bcf1b3c 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -157,6 +157,9 @@ struct mce_vendor_flags {
*/
smca : 1,
+ /* Zen IFU quirk */
+ zen_ifu_quirk : 1,
+
/* AMD-style error thresholding banks present. */
amd_threshold : 1,
@@ -172,7 +175,7 @@ struct mce_vendor_flags {
/* Skylake, Cascade Lake, Cooper Lake REP;MOVS* quirk */
skx_repmov_quirk : 1,
- __reserved_0 : 56;
+ __reserved_0 : 55;
};
extern struct mce_vendor_flags mce_flags;
I am Mr.Patrick Joseph, The Director in charge of Head of Operations
section of Africa Development Bank Burkina Faso. I need your urgent
business assistance in transferring to your bank account an abandoned
sum of $26.5 million dollars belonging to our deceased customer who
died with his entire family in 2006, leaving nobody for the claim, I
ask you, can we work together? I will be pleased to work with you as a
trusted person and see that the fund is transferred out of my Bank
into another Bank Account, Your share is 35% while 65% for me. Contact
me immediately If you're interested, so I will let you know the next
steps to follow. The transaction is 100% risky free.
Thanks,
Mr.Patrick Joseph.
The current implementation of append may cause duplicate data and/or
incorrect ranges to be returned to a reader during an update. Although
this has not been reported or seen, disable the append write operation
while the tree is in rcu mode out of an abundance of caution.
During the analysis of the mas_next_slot() the following was
artificially created by separating the writer and reader code:
Writer: reader:
mas_wr_append
set end pivot
updates end metata
Detects write to last slot
last slot write is to start of slot
store current contents in slot
overwrite old end pivot
mas_next_slot():
read end metadata
read old end pivot
return with incorrect range
store new value
Alternatively:
Writer: reader:
mas_wr_append
set end pivot
updates end metata
Detects write to last slot
last lost write to end of slot
store value
mas_next_slot():
read end metadata
read old end pivot
read new end pivot
return with incorrect range
set old end pivot
There may be other accesses that are not safe since we are now updating
both metadata and pointers, so disabling append if there could be rcu
readers is the safest action.
Cc: stable(a)vger.kernel.org
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
---
lib/maple_tree.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ffb9d15bd815..05d5db255c39 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -4107,6 +4107,10 @@ static inline unsigned char mas_wr_new_end(struct ma_wr_state *wr_mas)
* mas_wr_append: Attempt to append
* @wr_mas: the maple write state
*
+ * This is currently unsafe in rcu mode since the end of the node may be cached
+ * by readers while the node contents may be updated which could result in
+ * inaccurate information.
+ *
* Return: True if appended, false otherwise
*/
static inline bool mas_wr_append(struct ma_wr_state *wr_mas,
@@ -4116,6 +4120,9 @@ static inline bool mas_wr_append(struct ma_wr_state *wr_mas,
struct ma_state *mas = wr_mas->mas;
unsigned char node_pivots = mt_pivots[wr_mas->type];
+ if (mt_in_rcu(mas->tree))
+ return false;
+
if (mas->offset != wr_mas->node_end)
return false;
--
2.39.2
(Before starting, i want to say i am not the developer of the patch, i am
just an end user of a distribution kernel, which has just faced some
problem, i don't even know if kernel team accepts patch requests from non
dev, and thi mail is mostly shot in dark)
Respected Sir/s/Ma'am/s
I want to request that patch
https://lore.kernel.org/all/20230627062442.54008-1-mika.westerberg@linux.in…
be accepted in to the stable 6.1 branch, it is already present in the
current branch, and i request for it to be backported. This fixes some
Intel CPUs (according to other users claims, ranging back to 6th gen
for some) not going to deeper c states with kernels 5.16+, and was
fixed by this patch in 6.4.4+, since many institutions, and
distributions use 6.1, it would be great if this is included (from
what i understand from requirements
(https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html)
i guess it fits all)
I was facing this issue for roughly a year, and then reported in
debian(my distribution) forums, and we found nothing earlier
(https://forums.debian.net/viewtopic.php?t=154875), then i posted on
some other forums, and then i was asked to check if there was an
active bug linux kernel tracker at that time, and i could not find
one, so i decided to raise an issue
(https://bugzilla.kernel.org/show_bug.cgi?id=217616), but just a few
days prior to me posting someone had submiteed the patch, that i did
not know of, since then i am using the newest kernel(in 6.4 branch)
and am not facing issues.
I apologize for all the foolishness of mine, and just making an humble
request.
Thank You
When we use NT_ARM_SSVE to either enable streaming mode or change the
vector length for a process we do not currently do anything to ensure that
there is storage allocated for the SME specific register state. If the
task had not previously used SME or we changed the vector length then
the task will not have had TIF_SME set or backing storage for ZA/ZT
allocated, resulting in inconsistent register sizes when saving state
and spurious traps which flush the newly set register state.
We should set TIF_SME to disable traps and ensure that storage is
allocated for ZA and ZT if it is not already allocated. This requires
modifying sme_alloc() to make the flush of any existing register state
optional so we don't disturb existing state for ZA and ZT.
Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers")
Reported-by: David Spickett <David.Spickett(a)arm.com>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
arch/arm64/include/asm/fpsimd.h | 4 ++--
arch/arm64/kernel/fpsimd.c | 6 +++---
arch/arm64/kernel/ptrace.c | 12 ++++++++++--
arch/arm64/kernel/signal.c | 2 +-
4 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index 67f2fb781f59..8df46f186c64 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -356,7 +356,7 @@ static inline int sme_max_virtualisable_vl(void)
return vec_max_virtualisable_vl(ARM64_VEC_SME);
}
-extern void sme_alloc(struct task_struct *task);
+extern void sme_alloc(struct task_struct *task, bool flush);
extern unsigned int sme_get_vl(void);
extern int sme_set_current_vl(unsigned long arg);
extern int sme_get_current_vl(void);
@@ -388,7 +388,7 @@ static inline void sme_smstart_sm(void) { }
static inline void sme_smstop_sm(void) { }
static inline void sme_smstop(void) { }
-static inline void sme_alloc(struct task_struct *task) { }
+static inline void sme_alloc(struct task_struct *task, bool flush) { }
static inline void sme_setup(void) { }
static inline unsigned int sme_get_vl(void) { return 0; }
static inline int sme_max_vl(void) { return 0; }
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 75c37b1c55aa..087c05aa960e 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -1285,9 +1285,9 @@ void fpsimd_release_task(struct task_struct *dead_task)
* the interest of testability and predictability, the architecture
* guarantees that when ZA is enabled it will be zeroed.
*/
-void sme_alloc(struct task_struct *task)
+void sme_alloc(struct task_struct *task, bool flush)
{
- if (task->thread.sme_state) {
+ if (task->thread.sme_state && flush) {
memset(task->thread.sme_state, 0, sme_state_size(task));
return;
}
@@ -1515,7 +1515,7 @@ void do_sme_acc(unsigned long esr, struct pt_regs *regs)
}
sve_alloc(current, false);
- sme_alloc(current);
+ sme_alloc(current, true);
if (!current->thread.sve_state || !current->thread.sme_state) {
force_sig(SIGKILL);
return;
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 5b9b4305248b..95568e865ae1 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -881,6 +881,14 @@ static int sve_set_common(struct task_struct *target,
break;
case ARM64_VEC_SME:
target->thread.svcr |= SVCR_SM_MASK;
+
+ /*
+ * Disable tramsp and ensure there is SME
+ * storage but preserve any currently set
+ * values in ZA/ZT.
+ */
+ sme_alloc(target, false);
+ set_tsk_thread_flag(target, TIF_SME);
break;
default:
WARN_ON_ONCE(1);
@@ -1100,7 +1108,7 @@ static int za_set(struct task_struct *target,
}
/* Allocate/reinit ZA storage */
- sme_alloc(target);
+ sme_alloc(target, true);
if (!target->thread.sme_state) {
ret = -ENOMEM;
goto out;
@@ -1171,7 +1179,7 @@ static int zt_set(struct task_struct *target,
return -EINVAL;
if (!thread_za_enabled(&target->thread)) {
- sme_alloc(target);
+ sme_alloc(target, true);
if (!target->thread.sme_state)
return -ENOMEM;
}
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e304f7ebec2a..c7ebe744c64e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -475,7 +475,7 @@ static int restore_za_context(struct user_ctxs *user)
fpsimd_flush_task_state(current);
/* From now, fpsimd_thread_switch() won't touch thread.sve_state */
- sme_alloc(current);
+ sme_alloc(current, true);
if (!current->thread.sme_state) {
current->thread.svcr &= ~SVCR_ZA_MASK;
clear_thread_flag(TIF_SME);
---
base-commit: 52a93d39b17dc7eb98b6aa3edb93943248e03b2f
change-id: 20230809-arm64-fix-ptrace-race-db8552fb985b
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Some variants in this series of UART controllers have GPIO pins that
are shared between GPIO and modem control lines.
The pin mux mode (GPIO or modem control lines) can be set for each
ports (channels) supported by the variant.
This adds a property to the device tree to set the GPIO pin mux to
modem control lines on selected ports if needed.
Cc: <stable(a)vger.kernel.org> # 6.1.x
Signed-off-by: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Acked-by: Conor Dooley <conor.dooley(a)microchip.com>
Reviewed-by: Lech Perczak <lech.perczak(a)camlingroup.com>
---
.../bindings/serial/nxp,sc16is7xx.txt | 46 +++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/Documentation/devicetree/bindings/serial/nxp,sc16is7xx.txt b/Documentation/devicetree/bindings/serial/nxp,sc16is7xx.txt
index 0fa8e3e43bf8..1a7e4bff0456 100644
--- a/Documentation/devicetree/bindings/serial/nxp,sc16is7xx.txt
+++ b/Documentation/devicetree/bindings/serial/nxp,sc16is7xx.txt
@@ -23,6 +23,9 @@ Optional properties:
1 = active low.
- irda-mode-ports: An array that lists the indices of the port that
should operate in IrDA mode.
+- nxp,modem-control-line-ports: An array that lists the indices of the port that
+ should have shared GPIO lines configured as
+ modem control lines.
Example:
sc16is750: sc16is750@51 {
@@ -35,6 +38,26 @@ Example:
#gpio-cells = <2>;
};
+ sc16is752: sc16is752@53 {
+ compatible = "nxp,sc16is752";
+ reg = <0x53>;
+ clocks = <&clk20m>;
+ interrupt-parent = <&gpio3>;
+ interrupts = <7 IRQ_TYPE_EDGE_FALLING>;
+ nxp,modem-control-line-ports = <1>; /* Port 1 as modem control lines */
+ gpio-controller; /* Port 0 as GPIOs */
+ #gpio-cells = <2>;
+ };
+
+ sc16is752: sc16is752@54 {
+ compatible = "nxp,sc16is752";
+ reg = <0x54>;
+ clocks = <&clk20m>;
+ interrupt-parent = <&gpio3>;
+ interrupts = <7 IRQ_TYPE_EDGE_FALLING>;
+ nxp,modem-control-line-ports = <0 1>; /* Ports 0 and 1 as modem control lines */
+ };
+
* spi as bus
Required properties:
@@ -59,6 +82,9 @@ Optional properties:
1 = active low.
- irda-mode-ports: An array that lists the indices of the port that
should operate in IrDA mode.
+- nxp,modem-control-line-ports: An array that lists the indices of the port that
+ should have shared GPIO lines configured as
+ modem control lines.
Example:
sc16is750: sc16is750@0 {
@@ -70,3 +96,23 @@ Example:
gpio-controller;
#gpio-cells = <2>;
};
+
+ sc16is752: sc16is752@1 {
+ compatible = "nxp,sc16is752";
+ reg = <1>;
+ clocks = <&clk20m>;
+ interrupt-parent = <&gpio3>;
+ interrupts = <7 IRQ_TYPE_EDGE_FALLING>;
+ nxp,modem-control-line-ports = <1>; /* Port 1 as modem control lines */
+ gpio-controller; /* Port 0 as GPIOs */
+ #gpio-cells = <2>;
+ };
+
+ sc16is752: sc16is752@2 {
+ compatible = "nxp,sc16is752";
+ reg = <2>;
+ clocks = <&clk20m>;
+ interrupt-parent = <&gpio3>;
+ interrupts = <7 IRQ_TYPE_EDGE_FALLING>;
+ nxp,modem-control-line-ports = <0 1>; /* Ports 0 and 1 as modem control lines */
+ };
--
2.30.2
This patch series enables support of i2c bus for Intel Alder Lake PCH-P and PCH-M
on kernel version 5.10. These patches add ID's of Alder lake platform in these
drivers: i801, intel-lpss, pinctrl. ID's were taken from linux kernel version 5.15.
Alexander Ofitserov (3):
i2c: i801: Add support for Intel Alder Lake PCH
mfd: intel-lpss: Add Alder Lake's PCI devices IDs
pinctrl: tigerlake: Add Alder Lake-P ACPI ID
drivers/i2c/busses/i2c-i801.c | 8 +++++
drivers/mfd/intel-lpss-pci.c | 41 +++++++++++++++++++++++
drivers/pinctrl/intel/pinctrl-tigerlake.c | 1 +
3 files changed, 50 insertions(+)
--
2.33.8
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
folio_mapcount() is used to check whether the folio is shared. But it's
not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
This patchset will fix the cases:
User space application call madvise() with MADV_FREE, MADV_COLD and
MADV_PAGEOUT for specific address range. There are THP mapped to the
range. Without the patchset, the THP is skipped. With the patch, the
THP will be split and handled accordingly.
David reported the cow self test skip some cases because of
MADV_PAGEOUT skip THP:
https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel…
and I confirmed this patchset make it work again.
Changelog from v1:
- Avoid two Fixes tags make backport harder. Thank Andrew for pointing
this out.
- Add note section to mention this is a temporary fix which is fine
to reduce user-visble effects. For long term fix, we should wait for
David's solution. Thank Ryan and David for pointing this out.
- Spell user-visible effects out. Then people could decide whether
these patches are necessary for stable branch. Thank Andrew for
pointing this out.
V1:
https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel…
Yin Fengwei (3):
madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount()
against large folio for sharing check
madvise:madvise_free_huge_pmd(): don't use mapcount() against large
folio for sharing check
madvise:madvise_free_pte_range(): don't use mapcount() against large
folio for sharing check
mm/huge_memory.c | 2 +-
mm/madvise.c | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)
--
2.39.2
We hit softlocup with following call trace:
? asm_sysvec_apic_timer_interrupt+0x16/0x20
xa_erase+0x21/0xb0
? sgx_free_epc_page+0x20/0x50
sgx_vepc_release+0x75/0x220
__fput+0x89/0x250
task_work_run+0x59/0x90
do_exit+0x337/0x9a0
Similar like commit
8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves")
The test system has 64GB of enclave memory, and all assigned to a single
VM. Release vepc take longer time and triggers the softlockup warning.
Add a cond_resched() to give other tasks a chance to run and placate
the softlockup detector.
Cc: Jarkko Sakkinen <jarkko(a)kernel.org>
Cc: Haitao Huang <haitao.huang(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
Fixes: 540745ddbc70 ("x86/sgx: Introduce virtual EPC for use by KVM guests")
Reported-by: Yu Zhang <yu.zhang(a)ionos.com>
Tested-by: Yu Zhang <yu.zhang(a)ionos.com>
Acked-by: Haitao Huang <haitao.huang(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Signed-off-by: Jack Wang <jinpu.wang(a)ionos.com>
---
v2:
* collects review and test by.
* add fixes tag
* trim call trace.
arch/x86/kernel/cpu/sgx/virt.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index c3e37eaec8ec..01d2795792cc 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -204,6 +204,7 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
continue;
xa_erase(&vepc->page_array, index);
+ cond_resched();
}
/*
@@ -222,6 +223,7 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
list_add_tail(&epc_page->list, &secs_pages);
xa_erase(&vepc->page_array, index);
+ cond_resched();
}
/*
--
2.34.1
Your attention please!
My efforts to reaching you many times always not through.
Please may you kindly let me know if you are still using this email
address as my previous messages to you were not responded to.
I await hearing from you once more if my previous messages were not received.
Reach me via my email: privateemail01112(a)gmail.com
My regards,
Kein Briggs.
Because scsi_finish_command() subtracts the residual from the buffer
length, residual overflows must not be reported. Reflect this in the
SCSI documentation. See also commit 9237f04e12cc ("scsi: core: Fix
scsi_get/set_resid() interface")
Cc: Damien Le Moal <dlemoal(a)kernel.org>
Cc: Hannes Reinecke <hare(a)suse.de>
Cc: Douglas Gilbert <dgilbert(a)interlog.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche(a)acm.org>
---
Documentation/scsi/scsi_mid_low_api.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/scsi/scsi_mid_low_api.rst b/Documentation/scsi/scsi_mid_low_api.rst
index 6fa3a6279501..022198c51350 100644
--- a/Documentation/scsi/scsi_mid_low_api.rst
+++ b/Documentation/scsi/scsi_mid_low_api.rst
@@ -1190,11 +1190,11 @@ Members of interest:
- pointer to scsi_device object that this command is
associated with.
resid
- - an LLD should set this signed integer to the requested
+ - an LLD should set this unsigned integer to the requested
transfer length (i.e. 'request_bufflen') less the number
of bytes that are actually transferred. 'resid' is
preset to 0 so an LLD can ignore it if it cannot detect
- underruns (overruns should be rare). If possible an LLD
+ underruns (overruns should not be reported). An LLD
should set 'resid' prior to invoking 'done'. The most
interesting case is data transfers from a SCSI target
device (e.g. READs) that underrun.
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 32c877191e022b55fe3a374f3d7e9fb5741c514d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023081206-asleep-slain-4704@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
32c877191e02 ("hugetlb: do not clear hugetlb dtor until allocating vmemmap")
d6ef19e25df2 ("mm/hugetlb: convert update_and_free_page() to folios")
cfd5082b5147 ("mm/hugetlb: convert remove_hugetlb_page() to folios")
1a7cdab59b22 ("mm/hugetlb: convert dissolve_free_huge_page() to folios")
911565b82853 ("mm/hugetlb: convert destroy_compound_gigantic_page() to folios")
cb67f4282bf9 ("mm,thp,rmap: simplify compound page mapcount handling")
dad6a5eb5556 ("mm,hugetlb: use folio fields in second tail page")
0356c4b96f68 ("mm/hugetlb: convert free_huge_page to folios")
de656ed376c4 ("mm/hugetlb_cgroup: convert set_hugetlb_cgroup*() to folios")
f074732d599e ("mm/hugetlb_cgroup: convert hugetlb_cgroup_from_page() to folios")
a098c977722c ("mm/hugetlb_cgroup: convert __set_hugetlb_cgroup() to folios")
4781593d5dba ("mm/hugetlb: unify clearing of RestoreReserve for private pages")
149562f75094 ("mm/hugetlb: add hugetlb_folio_subpool() helpers")
d340625f4849 ("mm: add private field of first tail to struct page and struct folio")
71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t during THP split")
8346d69d8bcb ("mm/hugetlb: add available_huge_pages() func")
2b21624fc232 ("hugetlb: freeze allocated pages before creating hugetlb pages")
14455eabd840 ("mm: use nth_page instead of mem_map_offset mem_map_next")
379708ffde1b ("mm: add the first tail page to struct folio")
6d751329e761 ("Merge branch 'mm-hotfixes-stable' into mm-stable")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 32c877191e022b55fe3a374f3d7e9fb5741c514d Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Tue, 11 Jul 2023 15:09:41 -0700
Subject: [PATCH] hugetlb: do not clear hugetlb dtor until allocating vmemmap
Patch series "Fix hugetlb free path race with memory errors".
In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
HWPOISON hugepages" the race window was discovered.
https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/
Freeing a hugetlb page back to low level memory allocators is performed
in two steps.
1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
2) Outside lock, allocate vmemmap if necessary and call low level free
Between these two steps, the hugetlb page will appear as a normal
compound page. However, vmemmap for tail pages could be missing.
If a memory error occurs at this time, we could try to update page
flags non-existant page structs.
A much more detailed description is in the first patch.
The first patch addresses the race window. However, it adds a
hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb page
free operation. This could lead to slowdowns if one is freeing a large
number of hugetlb pages.
The second path optimizes the update_and_free_pages_bulk routine to only
take the lock once in bulk operations.
The second patch is technically not a bug fix, but includes a Fixes tag
and Cc stable to avoid a performance regression. It can be combined with
the first, but was done separately make reviewing easier.
This patch (of 2):
Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
lists, get a ref on the page and remove hugetlb destructor. This
all must be done under the hugetlb lock. After this call, the page
can be treated as a normal compound page or a collection of base
size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
needed and the free routine of the underlying allocator is called
on the resulting page. We can not hold the hugetlb lock here.
One issue with this scheme is that a memory error could occur between
these two steps. In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages. It will then try to SetPageHWPoison(page) on the page with an
error. If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.
Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap. Since clearing the destructor
requires holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present. This saves a
lock/unlock cycle. Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.
Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count. This is not
a normal state. The only code that would notice is the memory error
code, and it is set up to retry in such a case.
A subsequent patch will create a routine to do bulk processing of
vmemmap allocation. This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a large number of pages.
Link: https://lkml.kernel.org/r/20230711220942.43706-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20230711220942.43706-2-mike.kravetz@oracle.com
Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Muchun Song <songmuchun(a)bytedance.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: James Houghton <jthoughton(a)google.com>
Cc: Jiaqi Yan <jiaqiyan(a)google.com>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 64a3239b6407..6da626bfb52e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
unsigned int order) { }
#endif
+static inline void __clear_hugetlb_destructor(struct hstate *h,
+ struct folio *folio)
+{
+ lockdep_assert_held(&hugetlb_lock);
+
+ /*
+ * Very subtle
+ *
+ * For non-gigantic pages set the destructor to the normal compound
+ * page dtor. This is needed in case someone takes an additional
+ * temporary ref to the page, and freeing is delayed until they drop
+ * their reference.
+ *
+ * For gigantic pages set the destructor to the null dtor. This
+ * destructor will never be called. Before freeing the gigantic
+ * page destroy_compound_gigantic_folio will turn the folio into a
+ * simple group of pages. After this the destructor does not
+ * apply.
+ *
+ */
+ if (hstate_is_gigantic(h))
+ folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
+ else
+ folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
+}
+
/*
- * Remove hugetlb folio from lists, and update dtor so that the folio appears
- * as just a compound page.
+ * Remove hugetlb folio from lists.
+ * If vmemmap exists for the folio, update dtor so that the folio appears
+ * as just a compound page. Otherwise, wait until after allocating vmemmap
+ * to update dtor.
*
* A reference is held on the folio, except in the case of demote.
*
@@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
}
/*
- * Very subtle
- *
- * For non-gigantic pages set the destructor to the normal compound
- * page dtor. This is needed in case someone takes an additional
- * temporary ref to the page, and freeing is delayed until they drop
- * their reference.
- *
- * For gigantic pages set the destructor to the null dtor. This
- * destructor will never be called. Before freeing the gigantic
- * page destroy_compound_gigantic_folio will turn the folio into a
- * simple group of pages. After this the destructor does not
- * apply.
- *
- * This handles the case where more than one ref is held when and
- * after update_and_free_hugetlb_folio is called.
- *
- * In the case of demote we do not ref count the page as it will soon
- * be turned into a page of smaller size.
+ * We can only clear the hugetlb destructor after allocating vmemmap
+ * pages. Otherwise, someone (memory error handling) may try to write
+ * to tail struct pages.
+ */
+ if (!folio_test_hugetlb_vmemmap_optimized(folio))
+ __clear_hugetlb_destructor(h, folio);
+
+ /*
+ * In the case of demote we do not ref count the page as it will soon
+ * be turned into a page of smaller size.
*/
if (!demote)
folio_ref_unfreeze(folio, 1);
- if (hstate_is_gigantic(h))
- folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
- else
- folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
h->nr_huge_pages--;
h->nr_huge_pages_node[nid]--;
@@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
{
int i;
struct page *subpage;
+ bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio);
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
@@ -1735,6 +1752,16 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
if (unlikely(folio_test_hwpoison(folio)))
folio_clear_hugetlb_hwpoison(folio);
+ /*
+ * If vmemmap pages were allocated above, then we need to clear the
+ * hugetlb destructor under the hugetlb lock.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
for (i = 0; i < pages_per_huge_page(h); i++) {
subpage = folio_page(folio, i);
subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
[ Upstream commit 4acfe3dfde685a5a9eaec5555351918e2d7266a1 ]
Dan Carpenter spotted a race condition in a couple of situations like
these in the test_firmware driver:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
ret = kstrtou8(buf, 10, &val);
if (ret)
return ret;
mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
static ssize_t config_num_requests_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
int rc;
mutex_lock(&test_fw_mutex);
if (test_fw_config->reqs) {
pr_err("Must call release_all_firmware prior to changing config\n");
rc = -EINVAL;
mutex_unlock(&test_fw_mutex);
goto out;
}
mutex_unlock(&test_fw_mutex);
// NOTE: HERE is the race!!! Function can be preempted!
// test_fw_config->reqs can change between the release of
// the lock about and acquire of the lock in the
// test_dev_config_update_u8()
rc = test_dev_config_update_u8(buf, count,
&test_fw_config->num_requests);
out:
return rc;
}
static ssize_t config_read_fw_idx_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
return test_dev_config_update_u8(buf, count,
&test_fw_config->read_fw_idx);
}
The function test_dev_config_update_u8() is called from both the locked
and the unlocked context, function config_num_requests_store() and
config_read_fw_idx_store() which can both be called asynchronously as
they are driver's methods, while test_dev_config_update_u8() and siblings
change their argument pointed to by u8 *cfg or similar pointer.
To avoid deadlock on test_fw_mutex, the lock is dropped before calling
test_dev_config_update_u8() and re-acquired within test_dev_config_update_u8()
itself, but alas this creates a race condition.
Having two locks wouldn't assure a race-proof mutual exclusion.
This situation is best avoided by the introduction of a new, unlocked
function __test_dev_config_update_u8() which can be called from the locked
context and reducing test_dev_config_update_u8() to:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
int ret;
mutex_lock(&test_fw_mutex);
ret = __test_dev_config_update_u8(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
}
doing the locking and calling the unlocked primitive, which enables both
locked and unlocked versions without duplication of code.
Fixes: c92316bf8e948 ("test_firmware: add batched firmware tests")
Cc: Luis R. Rodriguez <mcgrof(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Russ Weight <russell.h.weight(a)intel.com>
Cc: Takashi Iwai <tiwai(a)suse.de>
Cc: Tianfei Zhang <tianfei.zhang(a)intel.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Colin Ian King <colin.i.king(a)gmail.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: linux-kselftest(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # v5.4, 4.19, 4.14
Suggested-by: Dan Carpenter <error27(a)gmail.com>
Link: https://lore.kernel.org/r/20230509084746.48259-1-mirsad.todorovac@alu.unizg…
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ This is the patch to fix the racing condition in locking for the 5.4, ]
[ 4.19 and 4.14 stable branches. Not all the fixes from the upstream ]
[ commit apply, but those which do are verbatim equal to those in the ]
[ upstream commit. ]
---
v4:
verbatim the same patch as for the 5.4 stable tree which patchwork didn't apply
lib/test_firmware.c | 37 ++++++++++++++++++++++++++++---------
1 file changed, 28 insertions(+), 9 deletions(-)
diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index 34210306ea66..d407e5e670f3 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -283,16 +283,26 @@ static ssize_t config_test_show_str(char *dst,
return len;
}
-static int test_dev_config_update_bool(const char *buf, size_t size,
- bool *cfg)
+static inline int __test_dev_config_update_bool(const char *buf, size_t size,
+ bool *cfg)
{
int ret;
- mutex_lock(&test_fw_mutex);
if (strtobool(buf, cfg) < 0)
ret = -EINVAL;
else
ret = size;
+
+ return ret;
+}
+
+static int test_dev_config_update_bool(const char *buf, size_t size,
+ bool *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_bool(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
@@ -322,7 +332,7 @@ static ssize_t test_dev_config_show_int(char *buf, int cfg)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+static inline int __test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
int ret;
long new;
@@ -334,14 +344,23 @@ static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
if (new > U8_MAX)
return -EINVAL;
- mutex_lock(&test_fw_mutex);
*(u8 *)cfg = new;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
+static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_u8(buf, size, cfg);
+ mutex_unlock(&test_fw_mutex);
+
+ return ret;
+}
+
static ssize_t test_dev_config_show_u8(char *buf, u8 cfg)
{
u8 val;
@@ -374,10 +393,10 @@ static ssize_t config_num_requests_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_u8(buf, count,
- &test_fw_config->num_requests);
+ rc = __test_dev_config_update_u8(buf, count,
+ &test_fw_config->num_requests);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
--
2.34.1
-----------------
Note, PLEASE TEST this kernel if you are on the 5.4.y tree before using
it in a real workload. This was a quick release due to the obvious
security fixes in it, and as such, it has not had very much testing "in
the wild". Please let us know of any problems seen. Also note that the
user/kernel api for the new security mitigations might be changing over
time, so do not get used to them being fixed in stone just yet.
-----------------
I'm announcing the release of the 5.4.252 kernel.
All users of the 5.4 kernel series must upgrade.
The updated 5.4.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.4.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Documentation/ABI/testing/sysfs-devices-system-cpu | 11
Documentation/admin-guide/hw-vuln/gather_data_sampling.rst | 109 ++++++
Documentation/admin-guide/hw-vuln/index.rst | 1
Documentation/admin-guide/kernel-parameters.txt | 39 +-
Makefile | 2
arch/Kconfig | 3
arch/alpha/include/asm/bugs.h | 20 -
arch/arm/Kconfig | 1
arch/arm/include/asm/bugs.h | 4
arch/arm/kernel/bugs.c | 3
arch/ia64/Kconfig | 1
arch/ia64/include/asm/bugs.h | 20 -
arch/ia64/kernel/setup.c | 3
arch/m68k/Kconfig | 1
arch/m68k/include/asm/bugs.h | 21 -
arch/m68k/kernel/setup_mm.c | 3
arch/mips/Kconfig | 1
arch/mips/include/asm/bugs.h | 17 -
arch/mips/kernel/setup.c | 13
arch/parisc/include/asm/bugs.h | 20 -
arch/powerpc/include/asm/bugs.h | 15
arch/sh/Kconfig | 1
arch/sh/include/asm/bugs.h | 78 ----
arch/sh/include/asm/processor.h | 2
arch/sh/kernel/idle.c | 1
arch/sh/kernel/setup.c | 55 +++
arch/sparc/Kconfig | 1
arch/sparc/include/asm/bugs.h | 18 -
arch/sparc/kernel/setup_32.c | 7
arch/um/Kconfig | 1
arch/um/include/asm/bugs.h | 7
arch/um/kernel/um_arch.c | 3
arch/x86/Kconfig | 20 +
arch/x86/include/asm/bugs.h | 2
arch/x86/include/asm/cpufeature.h | 10
arch/x86/include/asm/cpufeatures.h | 18 -
arch/x86/include/asm/disabled-features.h | 4
arch/x86/include/asm/fpu/internal.h | 2
arch/x86/include/asm/mem_encrypt.h | 2
arch/x86/include/asm/msr-index.h | 12
arch/x86/include/asm/required-features.h | 4
arch/x86/kernel/cpu/amd.c | 3
arch/x86/kernel/cpu/bugs.c | 209 +++++++++----
arch/x86/kernel/cpu/common.c | 122 ++++++-
arch/x86/kernel/cpu/cpu.h | 2
arch/x86/kernel/cpu/scattered.c | 3
arch/x86/kernel/fpu/init.c | 8
arch/x86/kernel/smpboot.c | 1
arch/x86/kvm/cpuid.h | 1
arch/x86/kvm/x86.c | 5
arch/x86/mm/init.c | 7
arch/x86/xen/smp_pv.c | 2
arch/xtensa/include/asm/bugs.h | 18 -
drivers/base/cpu.c | 8
drivers/net/xen-netback/netback.c | 15
include/asm-generic/bugs.h | 11
include/linux/cpu.h | 6
include/linux/sched/task.h | 2
init/main.c | 21 -
kernel/fork.c | 37 +-
tools/arch/x86/include/asm/cpufeatures.h | 18 -
tools/arch/x86/include/asm/disabled-features.h | 3
tools/arch/x86/include/asm/required-features.h | 3
63 files changed, 659 insertions(+), 402 deletions(-)
Arnaldo Carvalho de Melo (1):
tools headers cpufeatures: Sync with the kernel sources
Borislav Petkov (AMD) (1):
x86/bugs: Increase the x86 bugs vector size to two u32s
Daniel Sneddon (4):
x86/speculation: Add Gather Data Sampling mitigation
x86/speculation: Add force option to GDS mitigation
x86/speculation: Add Kconfig option for GDS
KVM: Add GDS_NO support to KVM
Dave Hansen (1):
Documentation/x86: Fix backwards on/off logic about YMM support
Greg Kroah-Hartman (2):
x86: fix backwards merge of GDS/SRSO bit
Linux 5.4.252
Juergen Gross (2):
x86/xen: Fix secondary processors' FPU initialization
x86/mm: fix poking_init() for Xen PV guests
Kim Phillips (1):
x86/cpu, kvm: Add support for CPUID_80000021_EAX
Peter Zijlstra (3):
x86/mm: Use mm_alloc() in poking_init()
mm: Move mm_cachep initialization to mm_init()
x86/mm: Initialize text poking earlier
Ross Lagerwall (1):
xen/netback: Fix buffer overrun triggered by unusual packet
Sean Christopherson (1):
x86/cpufeatures: Assign dedicated feature word for CPUID_0x8000001F[EAX]
Thomas Gleixner (15):
init: Provide arch_cpu_finalize_init()
x86/cpu: Switch to arch_cpu_finalize_init()
ARM: cpu: Switch to arch_cpu_finalize_init()
ia64/cpu: Switch to arch_cpu_finalize_init()
m68k/cpu: Switch to arch_cpu_finalize_init()
mips/cpu: Switch to arch_cpu_finalize_init()
sh/cpu: Switch to arch_cpu_finalize_init()
sparc/cpu: Switch to arch_cpu_finalize_init()
um/cpu: Switch to arch_cpu_finalize_init()
init: Remove check_bugs() leftovers
init: Invoke arch_cpu_finalize_init() earlier
init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()
x86/fpu: Remove cpuinfo argument from init functions
x86/fpu: Mark init functions __init
x86/fpu: Move FPU initialization into arch_cpu_finalize_init()
Tom Lendacky (2):
x86/cpufeatures: Add SEV-ES CPU feature
x86/cpu: Add VM page flush MSR availablility as a CPUID feature
From: Remi Pommarel <repk(a)triplefau.lt>
When a client roamed back to a node before it got time to destroy the
pending local entry (i.e. within the same originator interval) the old
global one is directly removed from hash table and left as such.
But because this entry had an extra reference taken at lookup (i.e using
batadv_tt_global_hash_find) there is no way its memory will be reclaimed
at any time causing the following memory leak:
unreferenced object 0xffff0000073c8000 (size 18560):
comm "softirq", pid 0, jiffies 4294907738 (age 228.644s)
hex dump (first 32 bytes):
06 31 ac 12 c7 7a 05 00 01 00 00 00 00 00 00 00 .1...z..........
2c ad be 08 00 80 ff ff 6c b6 be 08 00 80 ff ff ,.......l.......
backtrace:
[<00000000ee6e0ffa>] kmem_cache_alloc+0x1b4/0x300
[<000000000ff2fdbc>] batadv_tt_global_add+0x700/0xe20
[<00000000443897c7>] _batadv_tt_update_changes+0x21c/0x790
[<000000005dd90463>] batadv_tt_update_changes+0x3c/0x110
[<00000000a2d7fc57>] batadv_tt_tvlv_unicast_handler_v1+0xafc/0xe10
[<0000000011793f2a>] batadv_tvlv_containers_process+0x168/0x2b0
[<00000000b7cbe2ef>] batadv_recv_unicast_tvlv+0xec/0x1f4
[<0000000042aef1d8>] batadv_batman_skb_recv+0x25c/0x3a0
[<00000000bbd8b0a2>] __netif_receive_skb_core.isra.0+0x7a8/0xe90
[<000000004033d428>] __netif_receive_skb_one_core+0x64/0x74
[<000000000f39a009>] __netif_receive_skb+0x48/0xe0
[<00000000f2cd8888>] process_backlog+0x174/0x344
[<00000000507d6564>] __napi_poll+0x58/0x1f4
[<00000000b64ef9eb>] net_rx_action+0x504/0x590
[<00000000056fa5e4>] _stext+0x1b8/0x418
[<00000000878879d6>] run_ksoftirqd+0x74/0xa4
unreferenced object 0xffff00000bae1a80 (size 56):
comm "softirq", pid 0, jiffies 4294910888 (age 216.092s)
hex dump (first 32 bytes):
00 78 b1 0b 00 00 ff ff 0d 50 00 00 00 00 00 00 .x.......P......
00 00 00 00 00 00 00 00 50 c8 3c 07 00 00 ff ff ........P.<.....
backtrace:
[<00000000ee6e0ffa>] kmem_cache_alloc+0x1b4/0x300
[<00000000d9aaa49e>] batadv_tt_global_add+0x53c/0xe20
[<00000000443897c7>] _batadv_tt_update_changes+0x21c/0x790
[<000000005dd90463>] batadv_tt_update_changes+0x3c/0x110
[<00000000a2d7fc57>] batadv_tt_tvlv_unicast_handler_v1+0xafc/0xe10
[<0000000011793f2a>] batadv_tvlv_containers_process+0x168/0x2b0
[<00000000b7cbe2ef>] batadv_recv_unicast_tvlv+0xec/0x1f4
[<0000000042aef1d8>] batadv_batman_skb_recv+0x25c/0x3a0
[<00000000bbd8b0a2>] __netif_receive_skb_core.isra.0+0x7a8/0xe90
[<000000004033d428>] __netif_receive_skb_one_core+0x64/0x74
[<000000000f39a009>] __netif_receive_skb+0x48/0xe0
[<00000000f2cd8888>] process_backlog+0x174/0x344
[<00000000507d6564>] __napi_poll+0x58/0x1f4
[<00000000b64ef9eb>] net_rx_action+0x504/0x590
[<00000000056fa5e4>] _stext+0x1b8/0x418
[<00000000878879d6>] run_ksoftirqd+0x74/0xa4
Releasing the extra reference from batadv_tt_global_hash_find even at
roam back when batadv_tt_global_free is called fixes this memory leak.
Cc: stable(a)vger.kernel.org
Fixes: 068ee6e204e1 ("batman-adv: roaming handling mechanism redesign")
Signed-off-by: Remi Pommarel <repk(a)triplefau.lt>
Signed-off-by; Sven Eckelmann <sven(a)narfation.org>
Signed-off-by: Simon Wunderlich <sw(a)simonwunderlich.de>
---
net/batman-adv/translation-table.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c
index 36ca31252a73..b95c36765d04 100644
--- a/net/batman-adv/translation-table.c
+++ b/net/batman-adv/translation-table.c
@@ -774,7 +774,6 @@ bool batadv_tt_local_add(struct net_device *soft_iface, const u8 *addr,
if (roamed_back) {
batadv_tt_global_free(bat_priv, tt_global,
"Roaming canceled");
- tt_global = NULL;
} else {
/* The global entry has to be marked as ROAMING and
* has to be kept for consistency purpose
--
2.39.2
From: Sven Eckelmann <sven(a)narfation.org>
If the user set an MTU value, it usually means that there are special
requirements for the MTU. But if an interface gots activated, the MTU was
always recalculated and then the user set value was overwritten.
The only reason why this user set value has to be overwritten, is when the
MTU has to be decreased because batman-adv is not able to transfer packets
with the user specified size.
Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sven Eckelmann <sven(a)narfation.org>
Signed-off-by: Simon Wunderlich <sw(a)simonwunderlich.de>
---
net/batman-adv/hard-interface.c | 14 +++++++++++++-
net/batman-adv/soft-interface.c | 3 +++
net/batman-adv/types.h | 6 ++++++
3 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c
index ae5762af0146..24c9c0c3f316 100644
--- a/net/batman-adv/hard-interface.c
+++ b/net/batman-adv/hard-interface.c
@@ -630,7 +630,19 @@ int batadv_hardif_min_mtu(struct net_device *soft_iface)
*/
void batadv_update_min_mtu(struct net_device *soft_iface)
{
- dev_set_mtu(soft_iface, batadv_hardif_min_mtu(soft_iface));
+ struct batadv_priv *bat_priv = netdev_priv(soft_iface);
+ int limit_mtu;
+ int mtu;
+
+ mtu = batadv_hardif_min_mtu(soft_iface);
+
+ if (bat_priv->mtu_set_by_user)
+ limit_mtu = bat_priv->mtu_set_by_user;
+ else
+ limit_mtu = ETH_DATA_LEN;
+
+ mtu = min(mtu, limit_mtu);
+ dev_set_mtu(soft_iface, mtu);
/* Check if the local translate table should be cleaned up to match a
* new (and smaller) MTU.
diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c
index d3fdf82282af..85d00dc9ce32 100644
--- a/net/batman-adv/soft-interface.c
+++ b/net/batman-adv/soft-interface.c
@@ -153,11 +153,14 @@ static int batadv_interface_set_mac_addr(struct net_device *dev, void *p)
static int batadv_interface_change_mtu(struct net_device *dev, int new_mtu)
{
+ struct batadv_priv *bat_priv = netdev_priv(dev);
+
/* check ranges */
if (new_mtu < 68 || new_mtu > batadv_hardif_min_mtu(dev))
return -EINVAL;
dev->mtu = new_mtu;
+ bat_priv->mtu_set_by_user = new_mtu;
return 0;
}
diff --git a/net/batman-adv/types.h b/net/batman-adv/types.h
index ca9449ec9836..cf1a0eafe3ab 100644
--- a/net/batman-adv/types.h
+++ b/net/batman-adv/types.h
@@ -1546,6 +1546,12 @@ struct batadv_priv {
/** @soft_iface: net device which holds this struct as private data */
struct net_device *soft_iface;
+ /**
+ * @mtu_set_by_user: MTU was set once by user
+ * protected by rtnl_lock
+ */
+ int mtu_set_by_user;
+
/**
* @bat_counters: mesh internal traffic statistic counters (see
* batadv_counters)
--
2.39.2
lwt xmit hook does not expect positive return values in function
ip_finish_output2 and ip6_finish_output. However, BPF programs can
directly return positive statuses such like NET_XMIT_DROP, NET_RX_DROP,
and etc to the caller. Such return values would make the kernel continue
processing already freed skbs and eventually panic.
This set fixes the return values from BPF ops to unexpected continue
processing, and checks strictly on the correct continue condition for
future proof. In addition, add missing selftests for BPF_REDIRECT
and BPF_REROUTE cases for BPF-CI.
v4: https://lore.kernel.org/bpf/ZMD1sFTW8SFiex+x@debian.debian/T/
v3: https://lore.kernel.org/bpf/cover.1690255889.git.yan@cloudflare.com/
v2: https://lore.kernel.org/netdev/ZLdY6JkWRccunvu0@debian.debian/
v1: https://lore.kernel.org/bpf/ZLbYdpWC8zt9EJtq@debian.debian/
changes since v4:
* fixed same error on BPF_REROUTE path
* re-implemented selftests under BPF-CI requirement
changes since v3:
* minor change in commit message and changelogs
* tested by Jakub Sitnicki
changes since v2:
* subject name changed
* also covered redirect to ingress case
* added selftests
changes since v1:
* minor code style changes
Yan Zhai (4):
lwt: fix return values of BPF ops
lwt: check LWTUNNEL_XMIT_CONTINUE strictly
selftests/bpf: add lwt_xmit tests for BPF_REDIRECT
selftests/bpf: add lwt_xmit tests for BPF_REROUTE
include/net/lwtunnel.h | 5 +-
net/core/lwt_bpf.c | 7 +-
net/ipv4/ip_output.c | 2 +-
net/ipv6/ip6_output.c | 2 +-
.../selftests/bpf/prog_tests/lwt_helpers.h | 139 ++++++++
.../selftests/bpf/prog_tests/lwt_redirect.c | 319 ++++++++++++++++++
.../selftests/bpf/prog_tests/lwt_reroute.c | 256 ++++++++++++++
.../selftests/bpf/progs/test_lwt_redirect.c | 58 ++++
.../selftests/bpf/progs/test_lwt_reroute.c | 36 ++
9 files changed, 817 insertions(+), 7 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/lwt_helpers.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/lwt_redirect.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/lwt_reroute.c
create mode 100644 tools/testing/selftests/bpf/progs/test_lwt_redirect.c
create mode 100644 tools/testing/selftests/bpf/progs/test_lwt_reroute.c
--
2.30.2
A recent change in clang allows it to consider more expressions as
compile time constants, which causes it to point out an implicit
conversion in the scanf tests:
lib/test_scanf.c:661:2: warning: implicit conversion from 'int' to 'unsigned char' changes value from -168 to 88 [-Wconstant-conversion]
661 | test_number_prefix(unsigned char, "0xA7", "%2hhx%hhx", 0, 0xa7, 2, check_uchar);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/test_scanf.c:609:29: note: expanded from macro 'test_number_prefix'
609 | T result[2] = {~expect[0], ~expect[1]}; \
| ~ ^~~~~~~~~~
1 warning generated.
The result of the bitwise negation is the type of the operand after
going through the integer promotion rules, so this truncation is
expected but harmless, as the initial values in the result array get
overwritten by _test() anyways. Add an explicit cast to the expected
type in test_number_prefix() to silence the warning. There is no
functional change, as all the tests still pass with GCC 13.1.0 and clang
18.0.0.
Cc: stable(a)vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/1899
Link: https://github.com/llvm/llvm-project/commit/610ec954e1f81c0e8fcadedcd25afe6…
Suggested-by: Nick Desaulniers <ndesaulniers(a)google.com>
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
---
Changes in v2:
- Add spaces in initializer to match result (Andy)
- Add 'Cc: stable', as builds with CONFIG_WERROR will be broken, such as
allmodconfig
- Link to v1: https://lore.kernel.org/r/20230803-test_scanf-wconstant-conversion-v1-1-74d…
---
lib/test_scanf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/test_scanf.c b/lib/test_scanf.c
index b620cf7de503..a2707af2951a 100644
--- a/lib/test_scanf.c
+++ b/lib/test_scanf.c
@@ -606,7 +606,7 @@ static void __init numbers_slice(void)
#define test_number_prefix(T, str, scan_fmt, expect0, expect1, n_args, fn) \
do { \
const T expect[2] = { expect0, expect1 }; \
- T result[2] = {~expect[0], ~expect[1]}; \
+ T result[2] = { (T)~expect[0], (T)~expect[1] }; \
\
_test(fn, &expect, str, scan_fmt, n_args, &result[0], &result[1]); \
} while (0)
---
base-commit: 5d0c230f1de8c7515b6567d9afba1f196fb4e2f4
change-id: 20230803-test_scanf-wconstant-conversion-d97efbf3bb5d
Best regards,
--
Nathan Chancellor <nathan(a)kernel.org>
This is the start of the stable review cycle for the 5.4.254 release.
There are 39 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Tue, 15 Aug 2023 21:16:53 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.254-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.4.254-rc1
Eric Dumazet <edumazet(a)google.com>
sch_netem: fix issues in netem_change() vs get_dist_table()
Masahiro Yamada <masahiroy(a)kernel.org>
alpha: remove __init annotation from exported page_is_ram()
Zhu Wang <wangzhu9(a)huawei.com>
scsi: core: Fix possible memory leak if device_add() fails
Zhu Wang <wangzhu9(a)huawei.com>
scsi: snic: Fix possible memory leak if device_add() fails
Alexandra Diupina <adiupina(a)astralinux.ru>
scsi: 53c700: Check that command slot is not NULL
Michael Kelley <mikelley(a)microsoft.com>
scsi: storvsc: Fix handling of virtual Fibre Channel timeouts
Tony Battersby <tonyb(a)cybernetics.com>
scsi: core: Fix legacy /proc parsing buffer overflow
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nf_tables: report use refcount overflow
Ming Lei <ming.lei(a)redhat.com>
nvme-rdma: fix potential unbalanced freeze & unfreeze
Ming Lei <ming.lei(a)redhat.com>
nvme-tcp: fix potential unbalanced freeze & unfreeze
Josef Bacik <josef(a)toxicpanda.com>
btrfs: set cache_block_group_error if we find an error
Christoph Hellwig <hch(a)lst.de>
btrfs: don't stop integrity writeback too early
Nick Child <nnac123(a)linux.ibm.com>
ibmvnic: Handle DMA unmapping of login buffs in release functions
Daniel Jurgens <danielj(a)nvidia.com>
net/mlx5: Allow 0 for total host VFs
Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
dmaengine: mcf-edma: Fix a potential un-allocated memory access
Felix Fietkau <nbd(a)nbd.name>
wifi: cfg80211: fix sband iftype data lookup for AP_VLAN
Douglas Miller <doug.miller(a)cornelisnetworks.com>
IB/hfi1: Fix possible panic during hotplug remove
Andrew Kanner <andrew.kanner(a)gmail.com>
drivers: net: prevent tun_build_skb() to exceed the packet size limit
Eric Dumazet <edumazet(a)google.com>
dccp: fix data-race around dp->dccps_mss_cache
Ziyang Xuan <william.xuanziyang(a)huawei.com>
bonding: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves
Eric Dumazet <edumazet(a)google.com>
net/packet: annotate data-races around tp->status
Nathan Chancellor <nathan(a)kernel.org>
mISDN: Update parameter type of dsp_cmx_send()
Mark Brown <broonie(a)kernel.org>
selftests/rseq: Fix build with undefined __weak
Karol Herbst <kherbst(a)redhat.com>
drm/nouveau/disp: Revert a NULL check inside nouveau_connector_get_modes
Arnd Bergmann <arnd(a)arndb.de>
x86: Move gds_ucode_mitigated() declaration to header
Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
x86/mm: Fix VDSO and VVAR placement on 5-level paging machines
Cristian Ciocaltea <cristian.ciocaltea(a)collabora.com>
x86/cpu/amd: Enable Zenbleed fix for AMD Custom APU 0405
Prashanth K <quic_prashk(a)quicinc.com>
usb: common: usb-conn-gpio: Prevent bailing out if initial role is none
Elson Roy Serrao <quic_eserrao(a)quicinc.com>
usb: dwc3: Properly handle processing of pending events
Alan Stern <stern(a)rowland.harvard.edu>
usb-storage: alauda: Fix uninit-value in alauda_check_media()
Qi Zheng <zhengqi.arch(a)bytedance.com>
binder: fix memory leak in binder_init()
Yiyuan Guo <yguoaz(a)gmail.com>
iio: cros_ec: Fix the allocation size for cros_ec_command
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix use-after-free of nilfs_root in dirtying inodes via iput
Thomas Gleixner <tglx(a)linutronix.de>
x86/pkeys: Revert a5eff7259790 ("x86/pkeys: Add PKRU value to init_fpstate")
Colin Ian King <colin.i.king(a)gmail.com>
radix tree test suite: fix incorrect allocation size for pthreads
Karol Herbst <kherbst(a)redhat.com>
drm/nouveau/gr: enable memory loads on helper invocation on all channels
Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
dmaengine: pl330: Return DMA_PAUSED when transaction is paused
Maciej Żenczykowski <maze(a)google.com>
ipv6: adjust ndisc_is_useropt() to also return true for PIO
Sergei Antonov <saproj(a)gmail.com>
mmc: moxart: read scr register without changing byte order
-------------
Diffstat:
Makefile | 4 +-
arch/alpha/kernel/setup.c | 3 +-
arch/x86/entry/vdso/vma.c | 4 +-
arch/x86/include/asm/processor.h | 2 +
arch/x86/kernel/cpu/amd.c | 1 +
arch/x86/kernel/cpu/common.c | 5 -
arch/x86/kvm/x86.c | 2 -
arch/x86/mm/pkeys.c | 6 -
drivers/android/binder.c | 1 +
drivers/android/binder_alloc.c | 6 +
drivers/android/binder_alloc.h | 1 +
drivers/dma/mcf-edma.c | 13 +-
drivers/dma/pl330.c | 18 ++-
drivers/gpu/drm/nouveau/nouveau_connector.c | 2 +-
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.h | 1 +
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgk104.c | 4 +-
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgk110.c | 10 ++
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgk110b.c | 1 +
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgk208.c | 1 +
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgm107.c | 1 +
.../common/cros_ec_sensors/cros_ec_sensors_core.c | 2 +-
drivers/infiniband/hw/hfi1/chip.c | 1 +
drivers/isdn/mISDN/dsp.h | 2 +-
drivers/isdn/mISDN/dsp_cmx.c | 2 +-
drivers/isdn/mISDN/dsp_core.c | 2 +-
drivers/mmc/host/moxart-mmc.c | 8 +-
drivers/net/bonding/bond_main.c | 4 +-
drivers/net/ethernet/ibm/ibmvnic.c | 15 +-
drivers/net/ethernet/mellanox/mlx5/core/sriov.c | 3 +-
drivers/net/tun.c | 2 +-
drivers/nvme/host/rdma.c | 3 +-
drivers/nvme/host/tcp.c | 3 +-
drivers/scsi/53c700.c | 2 +-
drivers/scsi/raid_class.c | 1 +
drivers/scsi/scsi_proc.c | 30 ++--
drivers/scsi/snic/snic_disc.c | 1 +
drivers/scsi/storvsc_drv.c | 4 -
drivers/usb/common/usb-conn-gpio.c | 6 +-
drivers/usb/dwc3/gadget.c | 9 +-
drivers/usb/storage/alauda.c | 9 +-
fs/btrfs/extent-tree.c | 5 +-
fs/btrfs/extent_io.c | 7 +-
fs/nilfs2/inode.c | 8 +
fs/nilfs2/segment.c | 2 +
fs/nilfs2/the_nilfs.h | 2 +
include/net/cfg80211.h | 3 +
include/net/netfilter/nf_tables.h | 31 +++-
net/dccp/output.c | 2 +-
net/dccp/proto.c | 10 +-
net/ipv6/ndisc.c | 3 +-
net/netfilter/nf_tables_api.c | 166 +++++++++++++--------
net/netfilter/nft_flow_offload.c | 6 +-
net/netfilter/nft_objref.c | 8 +-
net/packet/af_packet.c | 16 +-
net/sched/sch_netem.c | 59 ++++----
tools/testing/radix-tree/regression1.c | 2 +-
tools/testing/selftests/rseq/Makefile | 4 +-
tools/testing/selftests/rseq/rseq.c | 2 +
58 files changed, 337 insertions(+), 194 deletions(-)
This is the start of the stable review cycle for the 4.19.292 release.
There are 33 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Tue, 15 Aug 2023 21:16:53 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.292-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.19.292-rc1
Eric Dumazet <edumazet(a)google.com>
sch_netem: fix issues in netem_change() vs get_dist_table()
Masahiro Yamada <masahiroy(a)kernel.org>
alpha: remove __init annotation from exported page_is_ram()
Zhu Wang <wangzhu9(a)huawei.com>
scsi: core: Fix possible memory leak if device_add() fails
Zhu Wang <wangzhu9(a)huawei.com>
scsi: snic: Fix possible memory leak if device_add() fails
Alexandra Diupina <adiupina(a)astralinux.ru>
scsi: 53c700: Check that command slot is not NULL
Michael Kelley <mikelley(a)microsoft.com>
scsi: storvsc: Fix handling of virtual Fibre Channel timeouts
Tony Battersby <tonyb(a)cybernetics.com>
scsi: core: Fix legacy /proc parsing buffer overflow
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nf_tables: report use refcount overflow
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nf_tables: bogus EBUSY when deleting flowtable after flush
Christoph Hellwig <hch(a)lst.de>
btrfs: don't stop integrity writeback too early
Nick Child <nnac123(a)linux.ibm.com>
ibmvnic: Handle DMA unmapping of login buffs in release functions
Felix Fietkau <nbd(a)nbd.name>
wifi: cfg80211: fix sband iftype data lookup for AP_VLAN
Douglas Miller <doug.miller(a)cornelisnetworks.com>
IB/hfi1: Fix possible panic during hotplug remove
Andrew Kanner <andrew.kanner(a)gmail.com>
drivers: net: prevent tun_build_skb() to exceed the packet size limit
Eric Dumazet <edumazet(a)google.com>
dccp: fix data-race around dp->dccps_mss_cache
Ziyang Xuan <william.xuanziyang(a)huawei.com>
bonding: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves
Eric Dumazet <edumazet(a)google.com>
net/packet: annotate data-races around tp->status
Nathan Chancellor <nathan(a)kernel.org>
mISDN: Update parameter type of dsp_cmx_send()
Karol Herbst <kherbst(a)redhat.com>
drm/nouveau/disp: Revert a NULL check inside nouveau_connector_get_modes
Arnd Bergmann <arnd(a)arndb.de>
x86: Move gds_ucode_mitigated() declaration to header
Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
x86/mm: Fix VDSO and VVAR placement on 5-level paging machines
Cristian Ciocaltea <cristian.ciocaltea(a)collabora.com>
x86/cpu/amd: Enable Zenbleed fix for AMD Custom APU 0405
Elson Roy Serrao <quic_eserrao(a)quicinc.com>
usb: dwc3: Properly handle processing of pending events
Alan Stern <stern(a)rowland.harvard.edu>
usb-storage: alauda: Fix uninit-value in alauda_check_media()
Qi Zheng <zhengqi.arch(a)bytedance.com>
binder: fix memory leak in binder_init()
Yiyuan Guo <yguoaz(a)gmail.com>
iio: cros_ec: Fix the allocation size for cros_ec_command
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix use-after-free of nilfs_root in dirtying inodes via iput
Colin Ian King <colin.i.king(a)gmail.com>
radix tree test suite: fix incorrect allocation size for pthreads
Karol Herbst <kherbst(a)redhat.com>
drm/nouveau/gr: enable memory loads on helper invocation on all channels
Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
dmaengine: pl330: Return DMA_PAUSED when transaction is paused
Maciej Żenczykowski <maze(a)google.com>
ipv6: adjust ndisc_is_useropt() to also return true for PIO
Sergei Antonov <saproj(a)gmail.com>
mmc: moxart: read scr register without changing byte order
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
sparc: fix up arch_cpu_finalize_init() build breakage.
-------------
Diffstat:
Makefile | 4 +-
arch/alpha/kernel/setup.c | 3 +-
arch/sparc/Kconfig | 2 +-
arch/x86/entry/vdso/vma.c | 4 +-
arch/x86/include/asm/processor.h | 2 +
arch/x86/kernel/cpu/amd.c | 1 +
arch/x86/kvm/x86.c | 2 -
drivers/android/binder.c | 1 +
drivers/android/binder_alloc.c | 6 +
drivers/android/binder_alloc.h | 1 +
drivers/dma/pl330.c | 18 ++-
drivers/gpu/drm/nouveau/nouveau_connector.c | 2 +-
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.h | 1 +
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgk104.c | 4 +-
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgk110.c | 10 ++
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgk110b.c | 1 +
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgk208.c | 1 +
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgm107.c | 1 +
.../common/cros_ec_sensors/cros_ec_sensors_core.c | 2 +-
drivers/infiniband/hw/hfi1/chip.c | 1 +
drivers/isdn/mISDN/dsp.h | 2 +-
drivers/isdn/mISDN/dsp_cmx.c | 2 +-
drivers/isdn/mISDN/dsp_core.c | 2 +-
drivers/mmc/host/moxart-mmc.c | 8 +-
drivers/net/bonding/bond_main.c | 4 +-
drivers/net/ethernet/ibm/ibmvnic.c | 15 +-
drivers/net/tun.c | 2 +-
drivers/scsi/53c700.c | 2 +-
drivers/scsi/raid_class.c | 1 +
drivers/scsi/scsi_proc.c | 30 ++--
drivers/scsi/snic/snic_disc.c | 1 +
drivers/scsi/storvsc_drv.c | 4 -
drivers/usb/dwc3/gadget.c | 9 +-
drivers/usb/storage/alauda.c | 9 +-
fs/btrfs/extent_io.c | 7 +-
fs/nilfs2/inode.c | 8 +
fs/nilfs2/segment.c | 2 +
fs/nilfs2/the_nilfs.h | 2 +
include/net/cfg80211.h | 3 +
include/net/netfilter/nf_tables.h | 35 +++-
net/dccp/output.c | 2 +-
net/dccp/proto.c | 10 +-
net/ipv6/ndisc.c | 3 +-
net/netfilter/nf_tables_api.c | 180 ++++++++++++++-------
net/netfilter/nft_flow_offload.c | 23 ++-
net/netfilter/nft_objref.c | 8 +-
net/packet/af_packet.c | 16 +-
net/sched/sch_netem.c | 59 +++----
tools/testing/radix-tree/regression1.c | 2 +-
49 files changed, 349 insertions(+), 169 deletions(-)
Some usb hubs will negotiate DisplayPort Alt mode with the device
but will then negotiate a data role swap after entering the alt
mode. The data role swap causes the device to unregister all alt
modes, however the usb hub will still send Attention messages
even after failing to reregister the Alt Mode. type_altmode_attention
currently does not verify whether or not a device's altmode partner
exists, which results in a NULL pointer error when dereferencing
the typec_altmode and typec_altmode_ops belonging to the altmode
partner.
Verify the presence of a device's altmode partner before sending
the Attention message to the Alt Mode driver.
Fixes: 8a37d87d72f0 ("usb: typec: Bus type for alternate modes")
Cc: stable(a)vger.kernel.org
Signed-off-by: RD Babiera <rdbabiera(a)google.com>
---
Changes since v1:
* Only assigns pdev if altmode partner exists in typec_altmode_attention
* Removed error return in typec_altmode_attention if Alt Mode does
not implement Attention messages.
* Changed tcpm_log message to indicate that altmode partner does not exist,
as it only logs in that case.
---
Changes since v2:
* Changed tcpm_log message to accurately reflect error
* Revised commit message
---
Changes since v3:
* Fixed nits
---
drivers/usb/typec/bus.c | 12 ++++++++++--
drivers/usb/typec/tcpm/tcpm.c | 3 ++-
include/linux/usb/typec_altmode.h | 2 +-
3 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/drivers/usb/typec/bus.c b/drivers/usb/typec/bus.c
index fe5b9a2e61f5..e95ec7e382bb 100644
--- a/drivers/usb/typec/bus.c
+++ b/drivers/usb/typec/bus.c
@@ -183,12 +183,20 @@ EXPORT_SYMBOL_GPL(typec_altmode_exit);
*
* Notifies the partner of @adev about Attention command.
*/
-void typec_altmode_attention(struct typec_altmode *adev, u32 vdo)
+int typec_altmode_attention(struct typec_altmode *adev, u32 vdo)
{
- struct typec_altmode *pdev = &to_altmode(adev)->partner->adev;
+ struct altmode *partner = to_altmode(adev)->partner;
+ struct typec_altmode *pdev;
+
+ if (!partner)
+ return -ENODEV;
+
+ pdev = &partner->adev;
if (pdev->ops && pdev->ops->attention)
pdev->ops->attention(pdev, vdo);
+
+ return 0;
}
EXPORT_SYMBOL_GPL(typec_altmode_attention);
diff --git a/drivers/usb/typec/tcpm/tcpm.c b/drivers/usb/typec/tcpm/tcpm.c
index 5a7d8cc04628..77fe16190766 100644
--- a/drivers/usb/typec/tcpm/tcpm.c
+++ b/drivers/usb/typec/tcpm/tcpm.c
@@ -1877,7 +1877,8 @@ static void tcpm_handle_vdm_request(struct tcpm_port *port,
}
break;
case ADEV_ATTENTION:
- typec_altmode_attention(adev, p[1]);
+ if (typec_altmode_attention(adev, p[1]))
+ tcpm_log(port, "typec_altmode_attention no port partner altmode");
break;
}
}
diff --git a/include/linux/usb/typec_altmode.h b/include/linux/usb/typec_altmode.h
index 350d49012659..28aeef8f9e7b 100644
--- a/include/linux/usb/typec_altmode.h
+++ b/include/linux/usb/typec_altmode.h
@@ -67,7 +67,7 @@ struct typec_altmode_ops {
int typec_altmode_enter(struct typec_altmode *altmode, u32 *vdo);
int typec_altmode_exit(struct typec_altmode *altmode);
-void typec_altmode_attention(struct typec_altmode *altmode, u32 vdo);
+int typec_altmode_attention(struct typec_altmode *altmode, u32 vdo);
int typec_altmode_vdm(struct typec_altmode *altmode,
const u32 header, const u32 *vdo, int count);
int typec_altmode_notify(struct typec_altmode *altmode, unsigned long conf,
base-commit: f176638af476c6d46257cc3303f5c7cf47d5967d
--
2.41.0.694.ge786442a9b-goog
A KVM guest using SEV-ES or SEV-SNP with multiple vCPUs can trigger
a double fetch race condition vulnerability and invoke the VMGEXIT
handler recursively.
sev_handle_vmgexit() maps the GHCB page using kvm_vcpu_map() and then
fetches the exit code using ghcb_get_sw_exit_code(). Soon after,
sev_es_validate_vmgexit() fetches the exit code again. Since the GHCB
page is shared with the guest, the guest is able to quickly swap the
values with another vCPU and hence bypass the validation. One vmexit code
that can be rejected by sev_es_validate_vmgexit() is SVM_EXIT_VMGEXIT;
if sev_handle_vmgexit() observes it in the second fetch, the call
to svm_invoke_exit_handler() will invoke sev_handle_vmgexit() again
recursively.
To avoid the race, always fetch the GHCB data from the places where
sev_es_sync_from_ghcb stores it.
Exploiting recursions on linux kernel has been proven feasible
in the past, but the impact is mitigated by stack guard pages
(CONFIG_VMAP_STACK). Still, if an attacker manages to call the handler
multiple times, they can theoretically trigger a stack overflow and
cause a denial-of-service, or potentially guest-to-host escape in kernel
configurations without stack guard pages.
Note that winning the race reliably in every iteration is very tricky
due to the very tight window of the fetches; depending on the compiler
settings, they are often consecutive because of optimization and inlining.
Tested by booting an SEV-ES RHEL9 guest.
Fixes: CVE-2023-4155
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Cc: stable(a)vger.kernel.org
Reported-by: Andy Nguyen <theflow(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
---
arch/x86/kvm/svm/sev.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e898f0b2b0ba..ca4ba5fe9a01 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2445,9 +2445,15 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
}
+static u64 kvm_ghcb_get_sw_exit_code(struct vmcb_control_area *control)
+{
+ return (((u64)control->exit_code_hi) << 32) | control->exit_code;
+}
+
static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
{
- struct kvm_vcpu *vcpu;
+ struct vmcb_control_area *control = &svm->vmcb->control;
+ struct kvm_vcpu *vcpu = &svm->vcpu;
struct ghcb *ghcb;
u64 exit_code;
u64 reason;
@@ -2458,7 +2464,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
* Retrieve the exit code now even though it may not be marked valid
* as it could help with debugging.
*/
- exit_code = ghcb_get_sw_exit_code(ghcb);
+ exit_code = kvm_ghcb_get_sw_exit_code(control);
/* Only GHCB Usage code 0 is supported */
if (ghcb->ghcb_usage) {
@@ -2473,7 +2479,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
!kvm_ghcb_sw_exit_info_2_is_valid(svm))
goto vmgexit_err;
- switch (ghcb_get_sw_exit_code(ghcb)) {
+ switch (exit_code) {
case SVM_EXIT_READ_DR7:
break;
case SVM_EXIT_WRITE_DR7:
@@ -2490,18 +2496,18 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
if (!kvm_ghcb_rax_is_valid(svm) ||
!kvm_ghcb_rcx_is_valid(svm))
goto vmgexit_err;
- if (ghcb_get_rax(ghcb) == 0xd)
+ if (vcpu->arch.regs[VCPU_REGS_RAX] == 0xd)
if (!kvm_ghcb_xcr0_is_valid(svm))
goto vmgexit_err;
break;
case SVM_EXIT_INVD:
break;
case SVM_EXIT_IOIO:
- if (ghcb_get_sw_exit_info_1(ghcb) & SVM_IOIO_STR_MASK) {
+ if (control->exit_info_1 & SVM_IOIO_STR_MASK) {
if (!kvm_ghcb_sw_scratch_is_valid(svm))
goto vmgexit_err;
} else {
- if (!(ghcb_get_sw_exit_info_1(ghcb) & SVM_IOIO_TYPE_MASK))
+ if (!(control->exit_info_1 & SVM_IOIO_TYPE_MASK))
if (!kvm_ghcb_rax_is_valid(svm))
goto vmgexit_err;
}
@@ -2509,7 +2515,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
case SVM_EXIT_MSR:
if (!kvm_ghcb_rcx_is_valid(svm))
goto vmgexit_err;
- if (ghcb_get_sw_exit_info_1(ghcb)) {
+ if (control->exit_info_1) {
if (!kvm_ghcb_rax_is_valid(svm) ||
!kvm_ghcb_rdx_is_valid(svm))
goto vmgexit_err;
@@ -2553,8 +2559,6 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
return 0;
vmgexit_err:
- vcpu = &svm->vcpu;
-
if (reason == GHCB_ERR_INVALID_USAGE) {
vcpu_unimpl(vcpu, "vmgexit: ghcb usage %#x is not valid\n",
ghcb->ghcb_usage);
@@ -2852,8 +2856,6 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
trace_kvm_vmgexit_enter(vcpu->vcpu_id, ghcb);
- exit_code = ghcb_get_sw_exit_code(ghcb);
-
sev_es_sync_from_ghcb(svm);
ret = sev_es_validate_vmgexit(svm);
if (ret)
@@ -2862,6 +2864,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ghcb_set_sw_exit_info_1(ghcb, 0);
ghcb_set_sw_exit_info_2(ghcb, 0);
+ exit_code = kvm_ghcb_get_sw_exit_code(control);
switch (exit_code) {
case SVM_VMGEXIT_MMIO_READ:
ret = setup_vmgexit_scratch(svm, true, control->exit_info_2);
--
2.39.0
From: Laszlo Ersek <lersek(a)redhat.com>
stable inclusion
from stable-v5.10.189
commit 5ea23f1cb67e4468db7ff651627892c9217fec24
category: bugfix
bugzilla: 189104, https://gitee.com/src-openeuler/kernel/issues/I7QXHX
CVE: CVE-2023-4194
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id…
---------------------------
commit 9bc3047374d5bec163e83e743709e23753376f0c upstream.
Commit a096ccca6e50 initializes the "sk_uid" field in the protocol socket
(struct sock) from the "/dev/net/tun" device node's owner UID. Per
original commit 86741ec25462 ("net: core: Add a UID field to struct
sock.", 2016-11-04), that's wrong: the idea is to cache the UID of the
userspace process that creates the socket. Commit 86741ec25462 mentions
socket() and accept(); with "tun", the action that creates the socket is
open("/dev/net/tun").
Therefore the device node's owner UID is irrelevant. In most cases,
"/dev/net/tun" will be owned by root, so in practice, commit a096ccca6e50
has no observable effect:
- before, "sk_uid" would be zero, due to undefined behavior
(CVE-2023-1076),
- after, "sk_uid" would be zero, due to "/dev/net/tun" being owned by root.
What matters is the (fs)UID of the process performing the open(), so cache
that in "sk_uid".
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Lorenzo Colitti <lorenzo(a)google.com>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: Pietro Borrello <borrello(a)diag.uniroma1.it>
Cc: netdev(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Fixes: a096ccca6e50 ("tun: tun_chr_open(): correctly initialize socket uid")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2173435
Signed-off-by: Laszlo Ersek <lersek(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Dong Chenchen <dongchenchen2(a)huawei.com>
---
drivers/net/tun.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index f8feec522b32..50c2ce392cd1 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3456,7 +3456,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
tfile->socket.file = file;
tfile->socket.ops = &tun_socket_ops;
- sock_init_data_uid(&tfile->socket, &tfile->sk, inode->i_uid);
+ sock_init_data_uid(&tfile->socket, &tfile->sk, current_fsuid());
tfile->sk.sk_write_space = tun_sock_write_space;
tfile->sk.sk_sndbuf = INT_MAX;
--
2.25.1
From: Laszlo Ersek <lersek(a)redhat.com>
stable inclusion
from stable-v5.10.189
commit 33a339e717be2c88b7ad11375165168d5b40e38e
category: bugfix
bugzilla: 189104, https://gitee.com/src-openeuler/kernel/issues/I7QXHX
CVE: CVE-2023-4194
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id…
---------------------------
commit 5c9241f3ceab3257abe2923a59950db0dc8bb737 upstream.
Commit 66b2c338adce initializes the "sk_uid" field in the protocol socket
(struct sock) from the "/dev/tapX" device node's owner UID. Per original
commit 86741ec25462 ("net: core: Add a UID field to struct sock.",
2016-11-04), that's wrong: the idea is to cache the UID of the userspace
process that creates the socket. Commit 86741ec25462 mentions socket() and
accept(); with "tap", the action that creates the socket is
open("/dev/tapX").
Therefore the device node's owner UID is irrelevant. In most cases,
"/dev/tapX" will be owned by root, so in practice, commit 66b2c338adce has
no observable effect:
- before, "sk_uid" would be zero, due to undefined behavior
(CVE-2023-1076),
- after, "sk_uid" would be zero, due to "/dev/tapX" being owned by root.
What matters is the (fs)UID of the process performing the open(), so cache
that in "sk_uid".
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Lorenzo Colitti <lorenzo(a)google.com>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: Pietro Borrello <borrello(a)diag.uniroma1.it>
Cc: netdev(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Fixes: 66b2c338adce ("tap: tap_open(): correctly initialize socket uid")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2173435
Signed-off-by: Laszlo Ersek <lersek(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Dong Chenchen <dongchenchen2(a)huawei.com>
---
drivers/net/tap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index d9018d9fe310..2c9ae02ada3e 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -523,7 +523,7 @@ static int tap_open(struct inode *inode, struct file *file)
q->sock.state = SS_CONNECTED;
q->sock.file = file;
q->sock.ops = &tap_socket_ops;
- sock_init_data_uid(&q->sock, &q->sk, inode->i_uid);
+ sock_init_data_uid(&q->sock, &q->sk, current_fsuid());
q->sk.sk_write_space = tap_sock_write_space;
q->sk.sk_destruct = tap_sock_destruct;
q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
--
2.25.1
All small, fairly safe changes.
The following changes since commit 52a93d39b17dc7eb98b6aa3edb93943248e03b2f:
Linux 6.5-rc5 (2023-08-06 15:07:51 -0700)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to f55484fd7be923b740e8e1fc304070ba53675cb4:
virtio-mem: check if the config changed before fake offlining memory (2023-08-10 15:51:46 -0400)
----------------------------------------------------------------
virtio: bugfixes
just a bunch of bugfixes all over the place.
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
----------------------------------------------------------------
Allen Hubbe (2):
pds_vdpa: reset to vdpa specified mac
pds_vdpa: alloc irq vectors on DRIVER_OK
David Hildenbrand (4):
virtio-mem: remove unsafe unplug in Big Block Mode (BBM)
virtio-mem: convert most offline_and_remove_memory() errors to -EBUSY
virtio-mem: keep retrying on offline_and_remove_memory() errors in Sub Block Mode (SBM)
virtio-mem: check if the config changed before fake offlining memory
Dragos Tatulea (4):
vdpa: Enable strict validation for netlinks ops
vdpa/mlx5: Correct default number of queues when MQ is on
vdpa/mlx5: Fix mr->initialized semantics
vdpa/mlx5: Fix crash on shutdown for when no ndev exists
Eugenio Pérez (1):
vdpa/mlx5: Delete control vq iotlb in destroy_mr only when necessary
Feng Liu (1):
virtio-pci: Fix legacy device flag setting error in probe
Gal Pressman (1):
virtio-vdpa: Fix cpumask memory leak in virtio_vdpa_find_vqs()
Hawkins Jiawei (1):
virtio-net: Zero max_tx_vq field for VIRTIO_NET_CTRL_MQ_HASH_CONFIG case
Lin Ma (3):
vdpa: Add features attr to vdpa_nl_policy for nlattr length check
vdpa: Add queue index attr to vdpa_nl_policy for nlattr length check
vdpa: Add max vqp attr to vdpa_nl_policy for nlattr length check
Maxime Coquelin (1):
vduse: Use proper spinlock for IRQ injection
Mike Christie (3):
vhost-scsi: Fix alignment handling with windows
vhost-scsi: Rename vhost_scsi_iov_to_sgl
MAINTAINERS: add vhost-scsi entry and myself as a co-maintainer
Shannon Nelson (4):
pds_vdpa: protect Makefile from unconfigured debugfs
pds_vdpa: always allow offering VIRTIO_NET_F_MAC
pds_vdpa: clean and reset vqs entries
pds_vdpa: fix up debugfs feature bit printing
Wolfram Sang (1):
virtio-mmio: don't break lifecycle of vm_dev
MAINTAINERS | 11 ++-
drivers/net/virtio_net.c | 2 +-
drivers/vdpa/mlx5/core/mlx5_vdpa.h | 2 +
drivers/vdpa/mlx5/core/mr.c | 105 +++++++++++++++------
drivers/vdpa/mlx5/net/mlx5_vnet.c | 26 +++---
drivers/vdpa/pds/Makefile | 3 +-
drivers/vdpa/pds/debugfs.c | 15 ++-
drivers/vdpa/pds/vdpa_dev.c | 176 ++++++++++++++++++++++++----------
drivers/vdpa/pds/vdpa_dev.h | 5 +-
drivers/vdpa/vdpa.c | 9 +-
drivers/vdpa/vdpa_user/vduse_dev.c | 8 +-
drivers/vhost/scsi.c | 187 ++++++++++++++++++++++++++++++++-----
drivers/virtio/virtio_mem.c | 168 ++++++++++++++++++++++-----------
drivers/virtio/virtio_mmio.c | 5 +-
drivers/virtio/virtio_pci_common.c | 2 -
drivers/virtio/virtio_pci_legacy.c | 1 +
drivers/virtio/virtio_vdpa.c | 2 +
17 files changed, 519 insertions(+), 208 deletions(-)
This is the start of the stable review cycle for the 4.14.323 release.
There are 26 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Tue, 15 Aug 2023 21:16:53 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.323-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.14.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.14.323-rc1
Masahiro Yamada <masahiroy(a)kernel.org>
alpha: remove __init annotation from exported page_is_ram()
Zhu Wang <wangzhu9(a)huawei.com>
scsi: core: Fix possible memory leak if device_add() fails
Zhu Wang <wangzhu9(a)huawei.com>
scsi: snic: Fix possible memory leak if device_add() fails
Alexandra Diupina <adiupina(a)astralinux.ru>
scsi: 53c700: Check that command slot is not NULL
Michael Kelley <mikelley(a)microsoft.com>
scsi: storvsc: Fix handling of virtual Fibre Channel timeouts
Tony Battersby <tonyb(a)cybernetics.com>
scsi: core: Fix legacy /proc parsing buffer overflow
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nf_tables: report use refcount overflow
Christoph Hellwig <hch(a)lst.de>
btrfs: don't stop integrity writeback too early
Douglas Miller <doug.miller(a)cornelisnetworks.com>
IB/hfi1: Fix possible panic during hotplug remove
Andrew Kanner <andrew.kanner(a)gmail.com>
drivers: net: prevent tun_build_skb() to exceed the packet size limit
Eric Dumazet <edumazet(a)google.com>
dccp: fix data-race around dp->dccps_mss_cache
Ziyang Xuan <william.xuanziyang(a)huawei.com>
bonding: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves
Eric Dumazet <edumazet(a)google.com>
net/packet: annotate data-races around tp->status
Karol Herbst <kherbst(a)redhat.com>
drm/nouveau/disp: Revert a NULL check inside nouveau_connector_get_modes
Arnd Bergmann <arnd(a)arndb.de>
x86: Move gds_ucode_mitigated() declaration to header
Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
x86/mm: Fix VDSO and VVAR placement on 5-level paging machines
Elson Roy Serrao <quic_eserrao(a)quicinc.com>
usb: dwc3: Properly handle processing of pending events
Alan Stern <stern(a)rowland.harvard.edu>
usb-storage: alauda: Fix uninit-value in alauda_check_media()
Yiyuan Guo <yguoaz(a)gmail.com>
iio: cros_ec: Fix the allocation size for cros_ec_command
Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
test_firmware: return ENOMEM instead of ENOSPC on failed memory allocation
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix use-after-free of nilfs_root in dirtying inodes via iput
Colin Ian King <colin.i.king(a)gmail.com>
radix tree test suite: fix incorrect allocation size for pthreads
Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
dmaengine: pl330: Return DMA_PAUSED when transaction is paused
Maciej Żenczykowski <maze(a)google.com>
ipv6: adjust ndisc_is_useropt() to also return true for PIO
Sergei Antonov <saproj(a)gmail.com>
mmc: moxart: read scr register without changing byte order
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
sparc: fix up arch_cpu_finalize_init() build breakage.
-------------
Diffstat:
Makefile | 4 +-
arch/alpha/kernel/setup.c | 3 +-
arch/sparc/Kconfig | 2 +-
arch/x86/entry/vdso/vma.c | 4 +-
arch/x86/include/asm/processor.h | 2 +
arch/x86/kvm/x86.c | 2 -
drivers/dma/pl330.c | 18 ++-
drivers/gpu/drm/nouveau/nouveau_connector.c | 2 +-
.../common/cros_ec_sensors/cros_ec_sensors_core.c | 2 +-
drivers/infiniband/hw/hfi1/chip.c | 1 +
drivers/mmc/host/moxart-mmc.c | 8 +-
drivers/net/bonding/bond_main.c | 4 +-
drivers/net/tun.c | 2 +-
drivers/scsi/53c700.c | 2 +-
drivers/scsi/raid_class.c | 1 +
drivers/scsi/scsi_proc.c | 30 +++--
drivers/scsi/snic/snic_disc.c | 1 +
drivers/scsi/storvsc_drv.c | 4 -
drivers/usb/dwc3/gadget.c | 9 +-
drivers/usb/storage/alauda.c | 9 +-
fs/btrfs/extent_io.c | 7 +-
fs/nilfs2/inode.c | 8 ++
fs/nilfs2/segment.c | 2 +
fs/nilfs2/the_nilfs.h | 2 +
include/net/netfilter/nf_tables.h | 27 +++-
lib/test_firmware.c | 8 +-
net/dccp/output.c | 2 +-
net/dccp/proto.c | 10 +-
net/ipv6/ndisc.c | 3 +-
net/netfilter/nf_tables_api.c | 143 +++++++++++++--------
net/netfilter/nft_objref.c | 8 +-
net/packet/af_packet.c | 16 ++-
tools/testing/radix-tree/regression1.c | 2 +-
33 files changed, 228 insertions(+), 120 deletions(-)
Dear all,
I found in all versions of Linux (at least for kernel version 4/5/6),
the following bug exists:
When a user is granted full access to a file of which he is not the
owner, he can read/write/delete the file, but cannot "change only its
last modification date". In particular, `touch -m` fails and Python's
`os.utime()` also fails with "Operation not permitted", but `touch`
without -m works.
This applies to both FACL extended permission as well as basic Linux
file permission.
Thank you for fixing this in the future!
Cheers,
Xuancong
The quilt patch titled
Subject: watchdog: Fix lockdep warning
has been removed from the -mm tree. Its filename was
watchdog-fix-lockdep-warning.patch
This patch was dropped because it was withdrawn
------------------------------------------------------
From: Helge Deller <deller(a)gmx.de>
Subject: watchdog: Fix lockdep warning
Date: Fri, 11 Aug 2023 19:11:46 +0200
Fully initialize detector_work work struct to avoid this kernel warning
when lockdep is enabled:
=====================================
WARNING: bad unlock balance detected!
6.5.0-rc5+ #687 Not tainted
-------------------------------------
swapper/0/1 is trying to release lock (detector_work) at:
[<000000004037e554>] __flush_work+0x60/0x658
but there are no more locks to release!
other info that might help us debug this:
no locks held by swapper/0/1.
stack backtrace:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc5+ #687
Hardware name: 9000/785/C3700
Backtrace:
[<0000000041455d5c>] print_unlock_imbalance_bug.part.0+0x20c/0x230
[<000000004040d5e8>] lock_release+0x2e8/0x3f8
[<000000004037e5cc>] __flush_work+0xd8/0x658
[<000000004037eb7c>] flush_work+0x30/0x60
[<000000004011f140>] lockup_detector_check+0x54/0x128
[<0000000040306430>] do_one_initcall+0x9c/0x408
[<0000000040102d44>] kernel_init_freeable+0x688/0x7f0
[<000000004146df68>] kernel_init+0x64/0x3a8
[<0000000040302020>] ret_from_kernel_thread+0x20/0x28
Signed-off-by: Helge Deller <deller(a)gmx.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/watchdog.c | 1 +
1 file changed, 1 insertion(+)
--- a/kernel/watchdog.c~watchdog-fix-lockdep-warning
+++ a/kernel/watchdog.c
@@ -1022,5 +1022,6 @@ void __init lockup_detector_init(void)
else
allow_lockup_detector_init_retry = true;
+ INIT_WORK(&detector_work, lockup_detector_delay_init);
lockup_detector_setup();
}
_
Patches currently in -mm which might be from deller(a)gmx.de are
The quilt patch titled
Subject: init: add lockdep annotation to kthreadd_done completer
has been removed from the -mm tree. Its filename was
init-add-lockdep-annotation-to-kthreadd_done-completer.patch
This patch was dropped because it was withdrawn
------------------------------------------------------
From: Helge Deller <deller(a)gmx.de>
Subject: init: add lockdep annotation to kthreadd_done completer
Date: Fri, 11 Aug 2023 18:04:22 +0200
Add the missing lockdep annotation to avoid this warning:
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc5+ #681
Hardware name: 9000/785/C3700
Backtrace:
[<000000004030bcd0>] show_stack+0x74/0xb0
[<0000000041469c7c>] dump_stack_lvl+0x104/0x180
[<0000000041469d2c>] dump_stack+0x34/0x48
[<000000004040e5b4>] register_lock_class+0xd24/0xd30
[<000000004040c21c>] __lock_acquire.isra.0+0xb4/0xac8
[<000000004040cd60>] lock_acquire+0x130/0x298
[<000000004146df54>] _raw_spin_lock_irq+0x60/0xb8
[<0000000041472044>] wait_for_completion+0xa0/0x2d0
[<000000004146b544>] kernel_init+0x48/0x3a8
[<0000000040302020>] ret_from_kernel_thread+0x20/0x28
Link: https://lkml.kernel.org/r/ZNZcBkiVkm87+Tvr@p100
Signed-off-by: Helge Deller <deller(a)gmx.de>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
init/main.c | 2 ++
1 file changed, 2 insertions(+)
--- a/init/main.c~init-add-lockdep-annotation-to-kthreadd_done-completer
+++ a/init/main.c
@@ -682,6 +682,8 @@ noinline void __ref __noreturn rest_init
struct task_struct *tsk;
int pid;
+ init_completion(&kthreadd_done);
+
rcu_scheduler_starting();
/*
* We need to spawn init first so that it obtains pid 1, however
_
Patches currently in -mm which might be from deller(a)gmx.de are
watchdog-fix-lockdep-warning.patch
The quilt patch titled
Subject: mm: add lockdep annotation to pgdat_init_all_done_comp completer
has been removed from the -mm tree. Its filename was
mm-add-lockdep-annotation-to-pgdat_init_all_done_comp-completer.patch
This patch was dropped because it was withdrawn
------------------------------------------------------
From: Helge Deller <deller(a)gmx.de>
Subject: mm: add lockdep annotation to pgdat_init_all_done_comp completer
Date: Fri, 11 Aug 2023 18:06:19 +0200
Add the missing lockdep annotation to avoid this kernel warning:
smp: Brought up 1 node, 1 CPU
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?'
turning off the locking correctness validator.
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc5+ #683
Hardware name: 9000/785/C3700
Backtrace:
[<000000004030bcd0>] show_stack+0x74/0xb0
[<000000004146c63c>] dump_stack_lvl+0x104/0x180
[<000000004146c6ec>] dump_stack+0x34/0x48
[<000000004040e5b4>] register_lock_class+0xd24/0xd30
[<000000004040c21c>] __lock_acquire.isra.0+0xb4/0xac8
[<000000004040cd60>] lock_acquire+0x130/0x298
[<000000004147095c>] _raw_spin_lock_irq+0x60/0xb8
[<0000000041474a4c>] wait_for_completion+0xa0/0x2d0
[<000000004012bf04>] page_alloc_init_late+0xf8/0x2b0
[<0000000040102b20>] kernel_init_freeable+0x464/0x7f0
[<000000004146df68>] kernel_init+0x64/0x3a8
[<0000000040302020>] ret_from_kernel_thread+0x20/0x28
Link: https://lkml.kernel.org/r/ZNZce1KGxP1dxpTN@p100
Signed-off-by: Helge Deller <deller(a)gmx.de>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mm_init.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/mm_init.c~mm-add-lockdep-annotation-to-pgdat_init_all_done_comp-completer
+++ a/mm/mm_init.c
@@ -2377,6 +2377,7 @@ void __init page_alloc_init_late(void)
int nid;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ init_completion(&pgdat_init_all_done_comp);
/* There will be num_node_state(N_MEMORY) threads */
atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY));
_
Patches currently in -mm which might be from deller(a)gmx.de are
init-add-lockdep-annotation-to-kthreadd_done-completer.patch
watchdog-fix-lockdep-warning.patch
The patch titled
Subject: memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
has been added to the -mm mm-unstable branch. Its filename is
memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Aleksa Sarai <cyphar(a)cyphar.com>
Subject: memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
Date: Mon, 14 Aug 2023 18:41:00 +1000
This sysctl has the very unusual behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were
to set this sysctl to a more restrictive option in the host pidns you
would need to reboot your machine in order to reset it.
The justification given in [1] is that this is a security feature and thus
it should not be possible to disable. Aside from the fact that we have
plenty of security-related sysctls that can be disabled after being
enabled (fs.protected_symlinks for instance), the protection provided by
the sysctl is to stop users from being able to create a binary and then
execute it. A user with CAP_SYS_ADMIN can trivially do this without
memfd_create(2):
% cat mount-memfd.c
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/mount.h>
#define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"
int main(void)
{
int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
assert(fsfd >= 0);
assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));
int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
assert(dfd >= 0);
int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
assert(execfd >= 0);
assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
assert(!close(execfd));
char *execpath = NULL;
char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
assert(execfd >= 0);
assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
assert(!execve(execpath, argv, envp));
}
% ./mount-memfd
this file was executed from this totally private tmpfs: /proc/self/fd/5
%
Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host filesystem
(not to mention the many other things a CAP_SYS_ADMIN process would be
able to do that would be equivalent or worse), it seems strange to cause a
fair amount of headache to admins when there doesn't appear to be an
actual security benefit to blocking this. There appear to be concerns
about confused-deputy-esque attacks[2] but a confused deputy that can
write to arbitrary sysctls is a bigger security issue than executable
memfds.
/* New API */
The primary requirement from the original author appears to be more based
on the need to be able to restrict an entire system in a hierarchical
manner[3], such that child namespaces cannot re-enable executable memfds.
So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
evaluated up the pidns tree to &init_pid_ns and you have the most
restrictive value applied to you. The new lower limit you can set
vm.memfd_noexec is whatever limit applies to your parent.
Note that a pidns will inherit a copy of the parent pidns's effective
vm.memfd_noexec setting at unshare() time. This matches the existing
behaviour, and it also ensures that a pidns will never have its
vm.memfd_noexec setting *lowered* behind its back (but it will be raised
if the parent raises theirs).
/* Backwards Compatibility */
As the previous version of the sysctl didn't allow you to lower the
setting at all, there are no backwards compatibility issues with this
aspect of the change.
However it should be noted that now that the setting is completely
hierarchical. Previously, a cloned pidns would just copy the current
pidns setting, meaning that if the parent's vm.memfd_noexec was changed it
wouldn't propoagate to existing pid namespaces. Now, the restriction
applies recursively. This is a uAPI change, however:
* The sysctl is very new, having been merged in 6.3.
* Several aspects of the sysctl were broken up until this patchset and
the other patchset by Jeff Xu last month.
And thus it seems incredibly unlikely that any real users would run into
this issue. In the worst case, if this causes userspace isues we could
make it so that modifying the setting follows the hierarchical rules but
the restriction checking uses the cached copy.
[1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw…
[2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg…
[3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw…
Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e1…
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Dominique Martinet <asmadeus(a)codewreck.org>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Daniel Verkamp <dverkamp(a)chromium.org>
Cc: Jeff Xu <jeffxu(a)google.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/pid_namespace.h | 23 ++++++++++++++++++++++-
kernel/pid.c | 3 +++
kernel/pid_namespace.c | 6 +++---
kernel/pid_sysctl.h | 30 +++++++++++++-----------------
mm/memfd.c | 3 ++-
5 files changed, 43 insertions(+), 22 deletions(-)
--- a/include/linux/pid_namespace.h~memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy
+++ a/include/linux/pid_namespace.h
@@ -39,7 +39,6 @@ struct pid_namespace {
int reboot; /* group exit code if this pidns was rebooted */
struct ns_common ns;
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
- /* sysctl for vm.memfd_noexec */
int memfd_noexec_scope;
#endif
} __randomize_layout;
@@ -56,6 +55,23 @@ static inline struct pid_namespace *get_
return ns;
}
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
+{
+ int scope = MEMFD_NOEXEC_SCOPE_EXEC;
+
+ for (; ns; ns = ns->parent)
+ scope = max(scope, READ_ONCE(ns->memfd_noexec_scope));
+
+ return scope;
+}
+#else
+static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
+{
+ return 0;
+}
+#endif
+
extern struct pid_namespace *copy_pid_ns(unsigned long flags,
struct user_namespace *user_ns, struct pid_namespace *ns);
extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
@@ -70,6 +86,11 @@ static inline struct pid_namespace *get_
return ns;
}
+static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
+{
+ return 0;
+}
+
static inline struct pid_namespace *copy_pid_ns(unsigned long flags,
struct user_namespace *user_ns, struct pid_namespace *ns)
{
--- a/kernel/pid.c~memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy
+++ a/kernel/pid.c
@@ -83,6 +83,9 @@ struct pid_namespace init_pid_ns = {
#ifdef CONFIG_PID_NS
.ns.ops = &pidns_operations,
#endif
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+ .memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC,
+#endif
};
EXPORT_SYMBOL_GPL(init_pid_ns);
--- a/kernel/pid_namespace.c~memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy
+++ a/kernel/pid_namespace.c
@@ -110,9 +110,9 @@ static struct pid_namespace *create_pid_
ns->user_ns = get_user_ns(user_ns);
ns->ucounts = ucounts;
ns->pid_allocated = PIDNS_ADDING;
-
- initialize_memfd_noexec_scope(ns);
-
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+ ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
+#endif
return ns;
out_free_idr:
--- a/kernel/pid_sysctl.h~memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy
+++ a/kernel/pid_sysctl.h
@@ -5,33 +5,30 @@
#include <linux/pid_namespace.h>
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
-static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns)
-{
- ns->memfd_noexec_scope =
- task_active_pid_ns(current)->memfd_noexec_scope;
-}
-
static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table,
int write, void *buf, size_t *lenp, loff_t *ppos)
{
struct pid_namespace *ns = task_active_pid_ns(current);
struct ctl_table table_copy;
+ int err, scope, parent_scope;
if (write && !ns_capable(ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
table_copy = *table;
- if (ns != &init_pid_ns)
- table_copy.data = &ns->memfd_noexec_scope;
-
- /*
- * set minimum to current value, the effect is only bigger
- * value is accepted.
- */
- if (*(int *)table_copy.data > *(int *)table_copy.extra1)
- table_copy.extra1 = table_copy.data;
- return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
+ /* You cannot set a lower enforcement value than your parent. */
+ parent_scope = pidns_memfd_noexec_scope(ns->parent);
+ /* Equivalent to pidns_memfd_noexec_scope(ns). */
+ scope = max(READ_ONCE(ns->memfd_noexec_scope), parent_scope);
+
+ table_copy.data = &scope;
+ table_copy.extra1 = &parent_scope;
+
+ err = proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
+ if (!err && write)
+ WRITE_ONCE(ns->memfd_noexec_scope, scope);
+ return err;
}
static struct ctl_table pid_ns_ctl_table_vm[] = {
@@ -51,7 +48,6 @@ static inline void register_pid_ns_sysct
register_sysctl("vm", pid_ns_ctl_table_vm);
}
#else
-static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) {}
static inline void register_pid_ns_sysctl_table_vm(void) {}
#endif
--- a/mm/memfd.c~memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy
+++ a/mm/memfd.c
@@ -271,7 +271,8 @@ long memfd_fcntl(struct file *file, unsi
static int check_sysctl_memfd_noexec(unsigned int *flags)
{
#ifdef CONFIG_SYSCTL
- int sysctl = task_active_pid_ns(current)->memfd_noexec_scope;
+ struct pid_namespace *ns = task_active_pid_ns(current);
+ int sysctl = pidns_memfd_noexec_scope(ns);
if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL)
_
Patches currently in -mm which might be from cyphar(a)cyphar.com are
selftests-memfd-error-out-test-process-when-child-test-fails.patch
memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2.patch
memfd-improve-userspace-warnings-for-missing-exec-related-flags.patch
memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy.patch
selftests-improve-vmmemfd_noexec-sysctl-tests.patch
The patch titled
Subject: memfd: improve userspace warnings for missing exec-related flags
has been added to the -mm mm-unstable branch. Its filename is
memfd-improve-userspace-warnings-for-missing-exec-related-flags.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Aleksa Sarai <cyphar(a)cyphar.com>
Subject: memfd: improve userspace warnings for missing exec-related flags
Date: Mon, 14 Aug 2023 18:40:59 +1000
In order to incentivise userspace to switch to passing MFD_EXEC and
MFD_NOEXEC_SEAL, we need to provide a warning on each attempt to call
memfd_create() without the new flags. pr_warn_once() is not useful
because on most systems the one warning is burned up during the boot
process (on my system, systemd does this within the first second of boot)
and thus userspace will in practice never see the warnings to push them to
switch to the new flags.
The original patchset[1] used pr_warn_ratelimited(), however there were
concerns about the degree of spam in the kernel log[2,3]. The resulting
inability to detect every case was flagged as an issue at the time[4].
While we could come up with an alternative rate-limiting scheme such as
only outputting the message if vm.memfd_noexec has been modified, or only
outputting the message once for a given task, these alternatives have
downsides that don't make sense given how low-stakes a single kernel
warning message is. Switching to pr_info_ratelimited() instead should be
fine -- it's possible some monitoring tool will be unhappy with a stream
of warning-level messages but there's already plenty of info-level message
spam in dmesg.
[1]: https://lore.kernel.org/20221215001205.51969-4-jeffxu@google.com/
[2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/
[3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
[4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundatio…
Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-3-7ff9e3e1…
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Daniel Verkamp <dverkamp(a)chromium.org>
Cc: Dominique Martinet <asmadeus(a)codewreck.org>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memfd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/memfd.c~memfd-improve-userspace-warnings-for-missing-exec-related-flags
+++ a/mm/memfd.c
@@ -315,7 +315,7 @@ SYSCALL_DEFINE2(memfd_create,
return -EINVAL;
if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
- pr_warn_once(
+ pr_info_ratelimited(
"%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n",
current->comm, task_pid_nr(current));
}
_
Patches currently in -mm which might be from cyphar(a)cyphar.com are
selftests-memfd-error-out-test-process-when-child-test-fails.patch
memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2.patch
memfd-improve-userspace-warnings-for-missing-exec-related-flags.patch
memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy.patch
selftests-improve-vmmemfd_noexec-sysctl-tests.patch
The patch titled
Subject: memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2
has been added to the -mm mm-unstable branch. Its filename is
memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Aleksa Sarai <cyphar(a)cyphar.com>
Subject: memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2
Date: Mon, 14 Aug 2023 18:40:58 +1000
Given the difficulty of auditing all of userspace to figure out whether
every memfd_create() user has switched to passing MFD_EXEC and
MFD_NOEXEC_SEAL flags, it seems far less distruptive to make it possible
for older programs that don't make use of executable memfds to run under
vm.memfd_noexec=2. Otherwise, a small dependency change can result in
spurious errors. For programs that don't use executable memfds, passing
MFD_NOEXEC_SEAL is functionally a no-op and thus having the same
In addition, every failure under vm.memfd_noexec=2 needs to print to the
kernel log so that userspace can figure out where the error came from.
The concerns about pr_warn_ratelimited() spam that caused the switch to
pr_warn_once()[1,2] do not apply to the vm.memfd_noexec=2 case.
This is a user-visible API change, but as it allows programs to do
something that would be blocked before, and the sysctl itself was broken
and recently released, it seems unlikely this will cause any issues.
[1]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
[2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/
Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-2-7ff9e3e1…
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Dominique Martinet <asmadeus(a)codewreck.org>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Daniel Verkamp <dverkamp(a)chromium.org>
Cc: Jeff Xu <jeffxu(a)google.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/pid_namespace.h | 16 ++--------
mm/memfd.c | 30 ++++++-------------
tools/testing/selftests/memfd/memfd_test.c | 22 ++++++++++---
3 files changed, 32 insertions(+), 36 deletions(-)
--- a/include/linux/pid_namespace.h~memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2
+++ a/include/linux/pid_namespace.h
@@ -17,18 +17,10 @@
struct fs_pin;
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
-/*
- * sysctl for vm.memfd_noexec
- * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
- * acts like MFD_EXEC was set.
- * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
- * acts like MFD_NOEXEC_SEAL was set.
- * 2: memfd_create() without MFD_NOEXEC_SEAL will be
- * rejected.
- */
-#define MEMFD_NOEXEC_SCOPE_EXEC 0
-#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1
-#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2
+/* modes for vm.memfd_noexec sysctl */
+#define MEMFD_NOEXEC_SCOPE_EXEC 0 /* MFD_EXEC implied if unset */
+#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 /* MFD_NOEXEC_SEAL implied if unset */
+#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 /* same as 1, except MFD_EXEC rejected */
#endif
struct pid_namespace {
--- a/mm/memfd.c~memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2
+++ a/mm/memfd.c
@@ -271,30 +271,22 @@ long memfd_fcntl(struct file *file, unsi
static int check_sysctl_memfd_noexec(unsigned int *flags)
{
#ifdef CONFIG_SYSCTL
- char comm[TASK_COMM_LEN];
- int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
- struct pid_namespace *ns;
-
- ns = task_active_pid_ns(current);
- if (ns)
- sysctl = ns->memfd_noexec_scope;
+ int sysctl = task_active_pid_ns(current)->memfd_noexec_scope;
if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
- if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL)
+ if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL)
*flags |= MFD_NOEXEC_SEAL;
else
*flags |= MFD_EXEC;
}
- if (*flags & MFD_EXEC && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) {
- pr_warn_once(
- "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n",
- task_pid_nr(current), get_task_comm(comm, current));
-
+ if (!(*flags & MFD_NOEXEC_SEAL) && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) {
+ pr_err_ratelimited(
+ "%s[%d]: memfd_create() requires MFD_NOEXEC_SEAL with vm.memfd_noexec=%d\n",
+ current->comm, task_pid_nr(current), sysctl);
return -EACCES;
}
#endif
-
return 0;
}
@@ -302,7 +294,6 @@ SYSCALL_DEFINE2(memfd_create,
const char __user *, uname,
unsigned int, flags)
{
- char comm[TASK_COMM_LEN];
unsigned int *file_seals;
struct file *file;
int fd, error;
@@ -325,12 +316,13 @@ SYSCALL_DEFINE2(memfd_create,
if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
pr_warn_once(
- "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n",
- task_pid_nr(current), get_task_comm(comm, current));
+ "%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n",
+ current->comm, task_pid_nr(current));
}
- if (check_sysctl_memfd_noexec(&flags) < 0)
- return -EACCES;
+ error = check_sysctl_memfd_noexec(&flags);
+ if (error < 0)
+ return error;
/* length includes terminating zero */
len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
--- a/tools/testing/selftests/memfd/memfd_test.c~memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2
+++ a/tools/testing/selftests/memfd/memfd_test.c
@@ -1145,11 +1145,23 @@ static void test_sysctl_child(void)
printf("%s sysctl 2\n", memfd_str);
sysctl_assert_write("2");
- mfd_fail_new("kern_memfd_sysctl_2",
- MFD_CLOEXEC | MFD_ALLOW_SEALING);
- mfd_fail_new("kern_memfd_sysctl_2_MFD_EXEC",
- MFD_CLOEXEC | MFD_EXEC);
- fd = mfd_assert_new("", 0, MFD_NOEXEC_SEAL);
+ mfd_fail_new("kern_memfd_sysctl_2_exec",
+ MFD_EXEC | MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+ fd = mfd_assert_new("kern_memfd_sysctl_2_dfl",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_mode(fd, 0666);
+ mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_fail_chmod(fd, 0777);
+ close(fd);
+
+ fd = mfd_assert_new("kern_memfd_sysctl_2_noexec_seal",
+ mfd_def_size,
+ MFD_NOEXEC_SEAL | MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_mode(fd, 0666);
+ mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_fail_chmod(fd, 0777);
close(fd);
sysctl_fail_write("0");
_
Patches currently in -mm which might be from cyphar(a)cyphar.com are
selftests-memfd-error-out-test-process-when-child-test-fails.patch
memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2.patch
memfd-improve-userspace-warnings-for-missing-exec-related-flags.patch
memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy.patch
selftests-improve-vmmemfd_noexec-sysctl-tests.patch
The patch titled
Subject: mm: multi-gen LRU: don't spin during memcg release
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-multi-gen-lru-dont-spin-during-memcg-release.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "T.J. Mercier" <tjmercier(a)google.com>
Subject: mm: multi-gen LRU: don't spin during memcg release
Date: Mon, 14 Aug 2023 15:16:36 +0000
When a memcg is in the process of being released mem_cgroup_tryget will
fail because its reference count has already reached 0. This can happen
during reclaim if the memcg has already been offlined, and we reclaim all
remaining pages attributed to the offlined memcg. shrink_many attempts to
skip the empty memcg in this case, and continue reclaiming from the
remaining memcgs in the old generation. If there is only one memcg
remaining, or if all remaining memcgs are in the process of being released
then shrink_many will spin until all memcgs have finished being released.
The release occurs through a workqueue, so it can take a while before
kswapd is able to make any further progress.
This fix results in reductions in kswapd activity and direct reclaim in
a test where 28 apps (working set size > total memory) are repeatedly
launched in a random sequence:
A B delta ratio(%)
allocstall_movable 5962 3539 -2423 -40.64
allocstall_normal 2661 2417 -244 -9.17
kswapd_high_wmark_hit_quickly 53152 7594 -45558 -85.71
pageoutrun 57365 11750 -45615 -79.52
Link: https://lkml.kernel.org/r/20230814151636.1639123-1-tjmercier@google.com
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
Acked-by: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
--- a/mm/vmscan.c~mm-multi-gen-lru-dont-spin-during-memcg-release
+++ a/mm/vmscan.c
@@ -4854,16 +4854,17 @@ void lru_gen_release_memcg(struct mem_cg
spin_lock_irq(&pgdat->memcg_lru.lock);
- VM_WARN_ON_ONCE(hlist_nulls_unhashed(&lruvec->lrugen.list));
+ if (hlist_nulls_unhashed(&lruvec->lrugen.list))
+ goto unlock;
gen = lruvec->lrugen.gen;
- hlist_nulls_del_rcu(&lruvec->lrugen.list);
+ hlist_nulls_del_init_rcu(&lruvec->lrugen.list);
pgdat->memcg_lru.nr_memcgs[gen]--;
if (!pgdat->memcg_lru.nr_memcgs[gen] && gen == get_memcg_gen(pgdat->memcg_lru.seq))
WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
-
+unlock:
spin_unlock_irq(&pgdat->memcg_lru.lock);
}
}
@@ -5435,8 +5436,10 @@ restart:
rcu_read_lock();
hlist_nulls_for_each_entry_rcu(lrugen, pos, &pgdat->memcg_lru.fifo[gen][bin], list) {
- if (op)
+ if (op) {
lru_gen_rotate_memcg(lruvec, op);
+ op = 0;
+ }
mem_cgroup_put(memcg);
@@ -5444,7 +5447,7 @@ restart:
memcg = lruvec_memcg(lruvec);
if (!mem_cgroup_tryget(memcg)) {
- op = 0;
+ lru_gen_release_memcg(memcg);
memcg = NULL;
continue;
}
_
Patches currently in -mm which might be from tjmercier(a)google.com are
mm-multi-gen-lru-dont-spin-during-memcg-release.patch