The pmd_trans_huge() code in mfill_atomic() is wrong in two different
ways depending on kernel version:
1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
the right two race windows) - I've tested this in a kernel build with
some extra mdelay() calls. See the commit message for a description
of the race scenario.
On older kernels (before 6.5), I think the same bug can even
theoretically lead to accessing transhuge page contents as a page table
if you …
[View More]hit the right 5 narrow race windows (I haven't tested this case).
2. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
to yank page tables out from under us (though I haven't tested that),
so I think the BUG_ON() checks in mfill_atomic() are just wrong.
I decided to write two separate fixes for these, so that the first fix
can be backported to kernels affected by the first bug.
Signed-off-by: Jann Horn <jannh(a)google.com>
---
Jann Horn (2):
userfaultfd: Fix pmd_trans_huge() recheck race
userfaultfd: Don't BUG_ON() if khugepaged yanks our page table
mm/userfaultfd.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
---
base-commit: d4560686726f7a357922f300fc81f5964be8df04
change-id: 20240812-uffd-thp-flip-fix-20f91f1151b9
--
Jann Horn <jannh(a)google.com>
[View Less]
Sense data can be in either fixed format or descriptor format.
SAT-6 revision 1, 10.4.6 Control mode page, says that if the D_SENSE bit
is set to zero (i.e., fixed format sense data), then the SATL should
return fixed format sense data for ATA PASS-THROUGH commands.
A lot of user space programs incorrectly assume that the sense data is in
descriptor format, without checking the RESPONSE CODE field of the
returned sense data (to see which format the sense data is in).
The libata SATL has …
[View More]always kept D_SENSE set to zero by default.
(It is however possible to change the value using a MODE SELECT command.)
For failed ATA PASS-THROUGH commands, we correctly generated sense data
according to the D_SENSE bit. However, because of a bug, sense data for
successful ATA PASS-THROUGH commands was always generated in the
descriptor format.
This was fixed to consistently respect D_SENSE for both failed and
successful ATA PASS-THROUGH commands in commit 28ab9769117c ("ata:
libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error").
After commit 28ab9769117c ("ata: libata-scsi: Honor the D_SENSE bit for
CK_COND=1 and no error"), we started receiving bug reports that we broke
these user space programs (these user space programs must never have
encountered a failing command, as the sense data for failing commands has
always correctly respected D_SENSE, which by default meant fixed format).
Since a lot of user space programs seem to assume that the sense data is
in descriptor format (without checking the type), let's simply change the
default to have D_SENSE set to one by default.
That way:
-Broken user space programs will see no regression.
-Both failed and successful ATA PASS-THROUGH commands will respect D_SENSE,
as per SAT-6 revision 1.
-Apparently it seems way more common for user space applications to assume
that the sense data is in descriptor format, rather than fixed format.
(A user space program should of course support both, and check the
RESPONSE CODE field to see which format the returned sense data is in.)
Cc: stable(a)vger.kernel.org # 4.19+
Reported-by: Stephan Eisvogel <eisvogel(a)seitics.de>
Reported-by: Christian Heusel <christian(a)heusel.eu>
Closes: https://lore.kernel.org/linux-ide/0bf3f2f0-0fc6-4ba5-a420-c0874ef82d64@heus…
Fixes: 28ab9769117c ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error")
Signed-off-by: Niklas Cassel <cassel(a)kernel.org>
---
drivers/ata/libata-core.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index c7752dc80028..590bebe1354d 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5368,6 +5368,13 @@ void ata_dev_init(struct ata_device *dev)
*/
spin_lock_irqsave(ap->lock, flags);
dev->flags &= ~ATA_DFLAG_INIT_MASK;
+
+ /*
+ * A lot of user space programs incorrectly assume that the sense data
+ * is in descriptor format, without checking the RESPONSE CODE field of
+ * the returned sense data (to see which format the sense data is in).
+ */
+ dev->flags |= ATA_DFLAG_D_SENSE;
dev->horkage = 0;
spin_unlock_irqrestore(ap->lock, flags);
--
2.46.0
[View Less]
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 0ecc5be200c84e67114f3640064ba2bae3ba2f5a
Gitweb: https://git.kernel.org/tip/0ecc5be200c84e67114f3640064ba2bae3ba2f5a
Author: Yuntao Wang <yuntao.wang(a)linux.dev>
AuthorDate: Tue, 13 Aug 2024 09:48:27 +08:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Tue, 13 Aug 2024 15:15:19 +02:00
x86/apic: Make x2apic_disable() work correctly
x2apic_disable() clears …
[View More]x2apic_state and x2apic_mode unconditionally, even
when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable
it thereby creating inconsistent state.
Due to the early state check for X2APIC_ON, the code path which warns about
a locked X2APIC cannot be reached.
Test for state < X2APIC_ON instead and move the clearing of the state and
mode variables to the place which actually disables X2APIC.
[ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the
right place for back ports ]
Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup")
Signed-off-by: Yuntao Wang <yuntao.wang(a)linux.dev>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240813014827.895381-1-yuntao.wang@linux.dev
---
arch/x86/kernel/apic/apic.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 66fd4b2..3736386 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1775,12 +1775,9 @@ static __init void apic_set_fixmap(bool read_apic);
static __init void x2apic_disable(void)
{
- u32 x2apic_id, state = x2apic_state;
+ u32 x2apic_id;
- x2apic_mode = 0;
- x2apic_state = X2APIC_DISABLED;
-
- if (state != X2APIC_ON)
+ if (x2apic_state < X2APIC_ON)
return;
x2apic_id = read_apic_id();
@@ -1793,6 +1790,10 @@ static __init void x2apic_disable(void)
}
__x2apic_disable();
+
+ x2apic_mode = 0;
+ x2apic_state = X2APIC_DISABLED;
+
/*
* Don't reread the APIC ID as it was already done from
* check_x2apic() and the APIC driver still is a x2APIC variant,
[View Less]
DW_HDMA_V0_LIE and DW_HDMA_V0_RIE are initialized as BIT(3) and BIT(4)
respectively in dw_hdma_control enum. But as per HDMA register these
bits are corresponds to LWIE and RWIE bit i.e local watermark interrupt
enable and remote watermarek interrupt enable. In linked list mode LWIE
and RWIE bits only enable the local and remote watermark interrupt.
Since the watermark interrupts are not used but enabled, this leads to
spurious interrupts getting generated. So remove the code that enables
them …
[View More]to avoid generating spurious watermark interrupts.
And also rename DW_HDMA_V0_LIE to DW_HDMA_V0_LWIE and DW_HDMA_V0_RIE to
DW_HDMA_V0_RWIE as there is no LIE and RIE bits in HDMA and those bits
are corresponds to LWIE and RWIE bits.
Fixes: e74c39573d35 ("dmaengine: dw-edma: Add support for native HDMA")
cc: stable(a)vger.kernel.org
Signed-off-by: Mrinmay Sarkar <quic_msarkar(a)quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Reviewed-by: Serge Semin <fancer.lancer(a)gmail.com>
---
drivers/dma/dw-edma/dw-hdma-v0-core.c | 17 +++--------------
1 file changed, 3 insertions(+), 14 deletions(-)
diff --git a/drivers/dma/dw-edma/dw-hdma-v0-core.c b/drivers/dma/dw-edma/dw-hdma-v0-core.c
index a0aabdd..c4c73c4 100644
--- a/drivers/dma/dw-edma/dw-hdma-v0-core.c
+++ b/drivers/dma/dw-edma/dw-hdma-v0-core.c
@@ -17,8 +17,8 @@ enum dw_hdma_control {
DW_HDMA_V0_CB = BIT(0),
DW_HDMA_V0_TCB = BIT(1),
DW_HDMA_V0_LLP = BIT(2),
- DW_HDMA_V0_LIE = BIT(3),
- DW_HDMA_V0_RIE = BIT(4),
+ DW_HDMA_V0_LWIE = BIT(3),
+ DW_HDMA_V0_RWIE = BIT(4),
DW_HDMA_V0_CCS = BIT(8),
DW_HDMA_V0_LLE = BIT(9),
};
@@ -195,25 +195,14 @@ static void dw_hdma_v0_write_ll_link(struct dw_edma_chunk *chunk,
static void dw_hdma_v0_core_write_chunk(struct dw_edma_chunk *chunk)
{
struct dw_edma_burst *child;
- struct dw_edma_chan *chan = chunk->chan;
u32 control = 0, i = 0;
- int j;
if (chunk->cb)
control = DW_HDMA_V0_CB;
- j = chunk->bursts_alloc;
- list_for_each_entry(child, &chunk->burst->list, list) {
- j--;
- if (!j) {
- control |= DW_HDMA_V0_LIE;
- if (!(chan->dw->chip->flags & DW_EDMA_CHIP_LOCAL))
- control |= DW_HDMA_V0_RIE;
- }
-
+ list_for_each_entry(child, &chunk->burst->list, list)
dw_hdma_v0_write_ll_data(chunk, i++, control, child->sz,
child->sar, child->dar);
- }
control = DW_HDMA_V0_LLP | DW_HDMA_V0_TCB;
if (!chunk->cb)
--
2.7.4
[View Less]
From: Mahesh Salgaonkar <mahesh(a)linux.ibm.com>
[ Upstream commit 0db880fc865ffb522141ced4bfa66c12ab1fbb70 ]
nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
interrupt handler) if percpu allocation comes from vmalloc area.
Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
percpu allocation …
[View More]is from the embedded first chunk. However with
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
allocation can come from the vmalloc area.
With kernel command line "percpu_alloc=page" we can force percpu allocation
to come from vmalloc area and can see kernel crash in machine_check_early:
[ 1.215714] NIP [c000000000e49eb4] rcu_nmi_enter+0x24/0x110
[ 1.215717] LR [c0000000000461a0] machine_check_early+0xf0/0x2c0
[ 1.215719] --- interrupt: 200
[ 1.215720] [c000000fffd73180] [0000000000000000] 0x0 (unreliable)
[ 1.215722] [c000000fffd731b0] [0000000000000000] 0x0
[ 1.215724] [c000000fffd73210] [c000000000008364] machine_check_early_common+0x134/0x1f8
Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
first chunk is not embedded.
CVE-2024-42126
Cc: stable(a)vger.kernel.org#5.10.x
Cc: gregkh(a)linuxfoundation.org
Reviewed-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Tested-by: Shirisha Ganta <shirisha(a)linux.ibm.com>
Signed-off-by: Mahesh Salgaonkar <mahesh(a)linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://msgid.link/20240410043006.81577-1-mahesh@linux.ibm.com
[ Conflicts in arch/powerpc/include/asm/interrupt.h
because machine_check_early() and machine_check_exception()
has been refactored. ]
Signed-off-by: Jinjie Ruan <ruanjinjie(a)huawei.com>
---
v2:
- Also fix for CONFIG_PPC_BOOK3S_64 not enabled.
- Add Upstream.
- Cc stable(a)vger.kernel.org.
---
arch/powerpc/include/asm/percpu.h | 10 ++++++++++
arch/powerpc/kernel/mce.c | 14 +++++++++++---
arch/powerpc/kernel/setup_64.c | 2 ++
arch/powerpc/kernel/traps.c | 8 +++++++-
4 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/include/asm/percpu.h b/arch/powerpc/include/asm/percpu.h
index 8e5b7d0b851c..634970ce13c6 100644
--- a/arch/powerpc/include/asm/percpu.h
+++ b/arch/powerpc/include/asm/percpu.h
@@ -15,6 +15,16 @@
#endif /* CONFIG_SMP */
#endif /* __powerpc64__ */
+#if defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && defined(CONFIG_SMP)
+#include <linux/jump_label.h>
+DECLARE_STATIC_KEY_FALSE(__percpu_first_chunk_is_paged);
+
+#define percpu_first_chunk_is_paged \
+ (static_key_enabled(&__percpu_first_chunk_is_paged.key))
+#else
+#define percpu_first_chunk_is_paged false
+#endif /* CONFIG_PPC64 && CONFIG_SMP */
+
#include <asm-generic/percpu.h>
#include <asm/paca.h>
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 63702c0badb9..259343040e1b 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -594,8 +594,15 @@ long notrace machine_check_early(struct pt_regs *regs)
u8 ftrace_enabled = this_cpu_get_ftrace_enabled();
this_cpu_set_ftrace_enabled(0);
- /* Do not use nmi_enter/exit for pseries hpte guest */
- if (radix_enabled() || !firmware_has_feature(FW_FEATURE_LPAR))
+ /*
+ * Do not use nmi_enter/exit for pseries hpte guest
+ *
+ * Likewise, do not use it in real mode if percpu first chunk is not
+ * embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there
+ * are chances where percpu allocation can come from vmalloc area.
+ */
+ if ((radix_enabled() || !firmware_has_feature(FW_FEATURE_LPAR)) &&
+ !percpu_first_chunk_is_paged)
nmi_enter();
hv_nmi_check_nonrecoverable(regs);
@@ -606,7 +613,8 @@ long notrace machine_check_early(struct pt_regs *regs)
if (ppc_md.machine_check_early)
handled = ppc_md.machine_check_early(regs);
- if (radix_enabled() || !firmware_has_feature(FW_FEATURE_LPAR))
+ if ((radix_enabled() || !firmware_has_feature(FW_FEATURE_LPAR)) &&
+ !percpu_first_chunk_is_paged)
nmi_exit();
this_cpu_set_ftrace_enabled(ftrace_enabled);
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 3f8426bccd16..899d87de0165 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -824,6 +824,7 @@ static int pcpu_cpu_distance(unsigned int from, unsigned int to)
unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
EXPORT_SYMBOL(__per_cpu_offset);
+DEFINE_STATIC_KEY_FALSE(__percpu_first_chunk_is_paged);
static void __init pcpu_populate_pte(unsigned long addr)
{
@@ -903,6 +904,7 @@ void __init setup_per_cpu_areas(void)
if (rc < 0)
panic("cannot initialize percpu area (err=%d)", rc);
+ static_key_enable(&__percpu_first_chunk_is_paged.key);
delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu) {
__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index b0e87dce2b9a..b4d108bef814 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -835,8 +835,14 @@ void machine_check_exception(struct pt_regs *regs)
* This is silly. The BOOK3S_64 should just call a different function
* rather than expecting semantics to magically change. Something
* like 'non_nmi_machine_check_exception()', perhaps?
+ *
+ * Do not use nmi_enter/exit in real mode if percpu first chunk is
+ * not embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled
+ * there are chances where percpu allocation can come from
+ * vmalloc area.
*/
- const bool nmi = !IS_ENABLED(CONFIG_PPC_BOOK3S_64);
+ const bool nmi = !IS_ENABLED(CONFIG_PPC_BOOK3S_64) &&
+ !percpu_first_chunk_is_paged;
if (nmi) nmi_enter();
--
2.34.1
[View Less]
From: Mahesh Salgaonkar <mahesh(a)linux.ibm.com>
[ Upstream commit 0db880fc865ffb522141ced4bfa66c12ab1fbb70 ]
nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
interrupt handler) if percpu allocation comes from vmalloc area.
Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
percpu allocation …
[View More]is from the embedded first chunk. However with
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
allocation can come from the vmalloc area.
With kernel command line "percpu_alloc=page" we can force percpu allocation
to come from vmalloc area and can see kernel crash in machine_check_early:
[ 1.215714] NIP [c000000000e49eb4] rcu_nmi_enter+0x24/0x110
[ 1.215717] LR [c0000000000461a0] machine_check_early+0xf0/0x2c0
[ 1.215719] --- interrupt: 200
[ 1.215720] [c000000fffd73180] [0000000000000000] 0x0 (unreliable)
[ 1.215722] [c000000fffd731b0] [0000000000000000] 0x0
[ 1.215724] [c000000fffd73210] [c000000000008364] machine_check_early_common+0x134/0x1f8
Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
first chunk is not embedded.
CVE-2024-42126
Cc: stable(a)vger.kernel.org#5.15.x
Cc: gregkh(a)linuxfoundation.org
Reviewed-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Tested-by: Shirisha Ganta <shirisha(a)linux.ibm.com>
Signed-off-by: Mahesh Salgaonkar <mahesh(a)linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://msgid.link/20240410043006.81577-1-mahesh@linux.ibm.com
[ Conflicts in arch/powerpc/include/asm/interrupt.h
because interrupt_nmi_enter_prepare() and interrupt_nmi_exit_prepare()
has been refactored. ]
Signed-off-by: Jinjie Ruan <ruanjinjie(a)huawei.com>
---
arch/powerpc/include/asm/interrupt.h | 14 ++++++++++----
arch/powerpc/include/asm/percpu.h | 10 ++++++++++
arch/powerpc/kernel/setup_64.c | 2 ++
3 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/include/asm/interrupt.h b/arch/powerpc/include/asm/interrupt.h
index e592e65e7665..49285b147afe 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -285,18 +285,24 @@ static inline void interrupt_nmi_enter_prepare(struct pt_regs *regs, struct inte
/*
* Do not use nmi_enter() for pseries hash guest taking a real-mode
* NMI because not everything it touches is within the RMA limit.
+ *
+ * Likewise, do not use it in real mode if percpu first chunk is not
+ * embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there
+ * are chances where percpu allocation can come from vmalloc area.
*/
- if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64) ||
+ if ((!IS_ENABLED(CONFIG_PPC_BOOK3S_64) ||
!firmware_has_feature(FW_FEATURE_LPAR) ||
- radix_enabled() || (mfmsr() & MSR_DR))
+ radix_enabled() || (mfmsr() & MSR_DR)) &&
+ !percpu_first_chunk_is_paged)
nmi_enter();
}
static inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, struct interrupt_nmi_state *state)
{
- if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64) ||
+ if ((!IS_ENABLED(CONFIG_PPC_BOOK3S_64) ||
!firmware_has_feature(FW_FEATURE_LPAR) ||
- radix_enabled() || (mfmsr() & MSR_DR))
+ radix_enabled() || (mfmsr() & MSR_DR)) &&
+ !percpu_first_chunk_is_paged)
nmi_exit();
/*
diff --git a/arch/powerpc/include/asm/percpu.h b/arch/powerpc/include/asm/percpu.h
index 8e5b7d0b851c..634970ce13c6 100644
--- a/arch/powerpc/include/asm/percpu.h
+++ b/arch/powerpc/include/asm/percpu.h
@@ -15,6 +15,16 @@
#endif /* CONFIG_SMP */
#endif /* __powerpc64__ */
+#if defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && defined(CONFIG_SMP)
+#include <linux/jump_label.h>
+DECLARE_STATIC_KEY_FALSE(__percpu_first_chunk_is_paged);
+
+#define percpu_first_chunk_is_paged \
+ (static_key_enabled(&__percpu_first_chunk_is_paged.key))
+#else
+#define percpu_first_chunk_is_paged false
+#endif /* CONFIG_PPC64 && CONFIG_SMP */
+
#include <asm-generic/percpu.h>
#include <asm/paca.h>
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index eaa79a0996d1..37d5683ab298 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -825,6 +825,7 @@ static int pcpu_cpu_distance(unsigned int from, unsigned int to)
unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
EXPORT_SYMBOL(__per_cpu_offset);
+DEFINE_STATIC_KEY_FALSE(__percpu_first_chunk_is_paged);
static void __init pcpu_populate_pte(unsigned long addr)
{
@@ -904,6 +905,7 @@ void __init setup_per_cpu_areas(void)
if (rc < 0)
panic("cannot initialize percpu area (err=%d)", rc);
+ static_key_enable(&__percpu_first_chunk_is_paged.key);
delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu) {
__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
--
2.34.1
[View Less]