Since 5.16 and prior to 6.13 KVM can't be used with FSDAX
guest memory (PMD pages). To reproduce the issue you need to reserve
guest memory with `memmap=` cmdline, create and mount FS in DAX mode
(tested both XFS and ext4), see doc link below. ndctl command for test:
ndctl create-namespace -v -e namespace1.0 --map=dev --mode=fsdax -a 2M
Then pass memory object to qemu like:
-m 8G -object memory-backend-file,id=ram0,size=8G,\
mem-path=/mnt/pmem/guestmem,share=on,prealloc=on,dump=off,align=2097152 \
-numa node,memdev=ram0,cpus=0-1
QEMU fails to run guest with error: kvm run failed Bad address
and there are two warnings in dmesg:
WARN_ON_ONCE(!page_count(page)) in kvm_is_zone_device_page() and
WARN_ON_ONCE(folio_ref_count(folio) <= 0) in try_grab_folio() (v6.6.63)
It looks like in the past assumption was made that pfn won't change from
faultin_pfn() to release_pfn_clean(), e.g. see
commit 4cd071d13c5c ("KVM: x86/mmu: Move calls to thp_adjust() down a level")
But kvm_page_fault structure made pfn part of mutable state, so
now release_pfn_clean() can take hugepage-adjusted pfn.
And it works for all cases (/dev/shm, hugetlb, devdax) except fsdax.
Apparently in fsdax mode faultin-pfn and adjusted-pfn may refer to
different folios, so we're getting get_page/put_page imbalance.
To solve this preserve faultin pfn in separate kvm_page_fault
field and pass it in kvm_release_pfn_clean(). Patch tested for all
mentioned guest memory backends with tdp_mmu={0,1}.
No bug in upstream as it was solved fundamentally by
commit 8dd861cc07e2 ("KVM: x86/mmu: Put refcounted pages instead of blindly releasing pfns")
and related patch series.
Link: https://nvdimm.docs.kernel.org/2mib_fs_dax.html
Fixes: 2f6305dd5676 ("KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault")
Signed-off-by: Nikolay Kuratov <kniv(a)yandex-team.ru>
---
arch/x86/kvm/mmu/mmu.c | 5 +++--
arch/x86/kvm/mmu/mmu_internal.h | 2 ++
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 294775b7383b..2105f3bc2e59 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4321,6 +4321,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
smp_rmb();
ret = __kvm_faultin_pfn(vcpu, fault);
+ fault->faultin_pfn = fault->pfn;
if (ret != RET_PF_CONTINUE)
return ret;
@@ -4398,7 +4399,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ kvm_release_pfn_clean(fault->faultin_pfn);
return r;
}
@@ -4474,7 +4475,7 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
out_unlock:
read_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ kvm_release_pfn_clean(fault->faultin_pfn);
return r;
}
#endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index decc1f153669..a016b51f9c62 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -236,6 +236,8 @@ struct kvm_page_fault {
/* Outputs of kvm_faultin_pfn. */
unsigned long mmu_seq;
kvm_pfn_t pfn;
+ /* pfn copy for kvm_release_pfn_clean(), constant after kvm_faultin_pfn() */
+ kvm_pfn_t faultin_pfn;
hva_t hva;
bool map_writable;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c85255073f67..b945dde6e3be 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -848,7 +848,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ kvm_release_pfn_clean(fault->faultin_pfn);
return r;
}
--
2.34.1
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x d677ce521334d8f1f327cafc8b1b7854b0833158
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024120641-patriarch-wiring-1eed@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d677ce521334d8f1f327cafc8b1b7854b0833158 Mon Sep 17 00:00:00 2001
From: Nathan Chancellor <nathan(a)kernel.org>
Date: Wed, 30 Oct 2024 11:41:37 -0700
Subject: [PATCH] powerpc/vdso: Drop -mstack-protector-guard flags in 32-bit
files with clang
Under certain conditions, the 64-bit '-mstack-protector-guard' flags may
end up in the 32-bit vDSO flags, resulting in build failures due to the
structure of clang's argument parsing of the stack protector options,
which validates the arguments of the stack protector guard flags
unconditionally in the frontend, choking on the 64-bit values when
targeting 32-bit:
clang: error: invalid value 'r13' in 'mstack-protector-guard-reg=', expected one of: r2
clang: error: invalid value 'r13' in 'mstack-protector-guard-reg=', expected one of: r2
make[3]: *** [arch/powerpc/kernel/vdso/Makefile:85: arch/powerpc/kernel/vdso/vgettimeofday-32.o] Error 1
make[3]: *** [arch/powerpc/kernel/vdso/Makefile:87: arch/powerpc/kernel/vdso/vgetrandom-32.o] Error 1
Remove these flags by adding them to the CC32FLAGSREMOVE variable, which
already handles situations similar to this. Additionally, reformat and
align a comment better for the expanding CONFIG_CC_IS_CLANG block.
Cc: stable(a)vger.kernel.org # v6.1+
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://patch.msgid.link/20241030-powerpc-vdso-drop-stackp-flags-clang-v1-1…
diff --git a/arch/powerpc/kernel/vdso/Makefile b/arch/powerpc/kernel/vdso/Makefile
index 31ca5a547004..c568cad6a22e 100644
--- a/arch/powerpc/kernel/vdso/Makefile
+++ b/arch/powerpc/kernel/vdso/Makefile
@@ -54,10 +54,14 @@ ldflags-y += $(filter-out $(CC_AUTO_VAR_INIT_ZERO_ENABLER) $(CC_FLAGS_FTRACE) -W
CC32FLAGS := -m32
CC32FLAGSREMOVE := -mcmodel=medium -mabi=elfv1 -mabi=elfv2 -mcall-aixdesc
- # This flag is supported by clang for 64-bit but not 32-bit so it will cause
- # an unused command line flag warning for this file.
ifdef CONFIG_CC_IS_CLANG
+# This flag is supported by clang for 64-bit but not 32-bit so it will cause
+# an unused command line flag warning for this file.
CC32FLAGSREMOVE += -fno-stack-clash-protection
+# -mstack-protector-guard values from the 64-bit build are not valid for the
+# 32-bit one. clang validates the values passed to these arguments during
+# parsing, even when -fno-stack-protector is passed afterwards.
+CC32FLAGSREMOVE += -mstack-protector-guard%
endif
LD32FLAGS := -Wl,-soname=linux-vdso32.so.1
AS32FLAGS := -D__VDSO32__
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x d677ce521334d8f1f327cafc8b1b7854b0833158
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024120642-steep-reply-a3bd@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d677ce521334d8f1f327cafc8b1b7854b0833158 Mon Sep 17 00:00:00 2001
From: Nathan Chancellor <nathan(a)kernel.org>
Date: Wed, 30 Oct 2024 11:41:37 -0700
Subject: [PATCH] powerpc/vdso: Drop -mstack-protector-guard flags in 32-bit
files with clang
Under certain conditions, the 64-bit '-mstack-protector-guard' flags may
end up in the 32-bit vDSO flags, resulting in build failures due to the
structure of clang's argument parsing of the stack protector options,
which validates the arguments of the stack protector guard flags
unconditionally in the frontend, choking on the 64-bit values when
targeting 32-bit:
clang: error: invalid value 'r13' in 'mstack-protector-guard-reg=', expected one of: r2
clang: error: invalid value 'r13' in 'mstack-protector-guard-reg=', expected one of: r2
make[3]: *** [arch/powerpc/kernel/vdso/Makefile:85: arch/powerpc/kernel/vdso/vgettimeofday-32.o] Error 1
make[3]: *** [arch/powerpc/kernel/vdso/Makefile:87: arch/powerpc/kernel/vdso/vgetrandom-32.o] Error 1
Remove these flags by adding them to the CC32FLAGSREMOVE variable, which
already handles situations similar to this. Additionally, reformat and
align a comment better for the expanding CONFIG_CC_IS_CLANG block.
Cc: stable(a)vger.kernel.org # v6.1+
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://patch.msgid.link/20241030-powerpc-vdso-drop-stackp-flags-clang-v1-1…
diff --git a/arch/powerpc/kernel/vdso/Makefile b/arch/powerpc/kernel/vdso/Makefile
index 31ca5a547004..c568cad6a22e 100644
--- a/arch/powerpc/kernel/vdso/Makefile
+++ b/arch/powerpc/kernel/vdso/Makefile
@@ -54,10 +54,14 @@ ldflags-y += $(filter-out $(CC_AUTO_VAR_INIT_ZERO_ENABLER) $(CC_FLAGS_FTRACE) -W
CC32FLAGS := -m32
CC32FLAGSREMOVE := -mcmodel=medium -mabi=elfv1 -mabi=elfv2 -mcall-aixdesc
- # This flag is supported by clang for 64-bit but not 32-bit so it will cause
- # an unused command line flag warning for this file.
ifdef CONFIG_CC_IS_CLANG
+# This flag is supported by clang for 64-bit but not 32-bit so it will cause
+# an unused command line flag warning for this file.
CC32FLAGSREMOVE += -fno-stack-clash-protection
+# -mstack-protector-guard values from the 64-bit build are not valid for the
+# 32-bit one. clang validates the values passed to these arguments during
+# parsing, even when -fno-stack-protector is passed afterwards.
+CC32FLAGSREMOVE += -mstack-protector-guard%
endif
LD32FLAGS := -Wl,-soname=linux-vdso32.so.1
AS32FLAGS := -D__VDSO32__
Hi all,
Here are even more bugfixes for 6.13 that have been accumulating since
6.12 was released. Now with a few extra comments that were requested
by reviewers.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=pr…
---
Commits in this patchset:
* xfs: don't move nondir/nonreg temporary repair files to the metadir namespace
* xfs: don't crash on corrupt /quotas dirent
* xfs: check pre-metadir fields correctly
* xfs: fix zero byte checking in the superblock scrubber
* xfs: return from xfs_symlink_verify early on V4 filesystems
* xfs: port xfs_ioc_start_commit to multigrain timestamps
---
fs/xfs/libxfs/xfs_symlink_remote.c | 4 ++
fs/xfs/scrub/agheader.c | 71 ++++++++++++++++++++++++++++--------
fs/xfs/scrub/tempfile.c | 3 ++
fs/xfs/xfs_exchrange.c | 14 ++++---
fs/xfs/xfs_qm.c | 7 ++++
5 files changed, 76 insertions(+), 23 deletions(-)
From: Luca Stefani <luca.stefani.ge1(a)gmail.com>
commit 69313850dce33ce8c24b38576a279421f4c60996 upstream.
There are reports that system cannot suspend due to running trim because
the task responsible for trimming the device isn't able to finish in
time, especially since we have a free extent discarding phase, which can
trim a lot of unallocated space. There are no limits on the trim size
(unlike the block group part).
Since trime isn't a critical call it can be interrupted at any time,
in such cases we stop the trim, report the amount of discarded bytes and
return an error.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: stable(a)vger.kernel.org # 5.15+
Signed-off-by: Luca Stefani <luca.stefani.ge1(a)gmail.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
---
fs/btrfs/extent-tree.c | 7 ++++++-
fs/btrfs/free-space-cache.c | 4 ++--
fs/btrfs/free-space-cache.h | 7 +++++++
3 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b3680e1c7054..599407120513 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1319,6 +1319,11 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
start += bytes_to_discard;
bytes_left -= bytes_to_discard;
*discarded_bytes += bytes_to_discard;
+
+ if (btrfs_trim_interrupted()) {
+ ret = -ERESTARTSYS;
+ break;
+ }
}
return ret;
@@ -6094,7 +6099,7 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
start += len;
*trimmed += bytes;
- if (fatal_signal_pending(current)) {
+ if (btrfs_trim_interrupted()) {
ret = -ERESTARTSYS;
break;
}
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 3bcf4a30cad7..9a6ec9344c3e 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3808,7 +3808,7 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
if (async && *total_trimmed)
break;
- if (fatal_signal_pending(current)) {
+ if (btrfs_trim_interrupted()) {
ret = -ERESTARTSYS;
break;
}
@@ -3999,7 +3999,7 @@ static int trim_bitmaps(struct btrfs_block_group *block_group,
}
block_group->discard_cursor = start;
- if (fatal_signal_pending(current)) {
+ if (btrfs_trim_interrupted()) {
if (start != offset)
reset_trimming_bitmap(ctl, offset);
ret = -ERESTARTSYS;
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 33b4da3271b1..bd80c7b2af96 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -6,6 +6,8 @@
#ifndef BTRFS_FREE_SPACE_CACHE_H
#define BTRFS_FREE_SPACE_CACHE_H
+#include <linux/freezer.h>
+
/*
* This is the trim state of an extent or bitmap.
*
@@ -43,6 +45,11 @@ static inline bool btrfs_free_space_trimming_bitmap(
return (info->trim_state == BTRFS_TRIM_STATE_TRIMMING);
}
+static inline bool btrfs_trim_interrupted(void)
+{
+ return fatal_signal_pending(current) || freezing(current);
+}
+
/*
* Deltas are an effective way to populate global statistics. Give macro names
* to make it clear what we're doing. An example is discard_extents in
--
2.45.0
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x ef5f5e7b6f73f79538892a8be3a3bee2342acc9f
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024120613-exploring-lifter-215f@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ef5f5e7b6f73f79538892a8be3a3bee2342acc9f Mon Sep 17 00:00:00 2001
From: Jean-Baptiste Maneyrol <jean-baptiste.maneyrol(a)tdk.com>
Date: Mon, 21 Oct 2024 10:38:42 +0200
Subject: [PATCH] iio: invensense: fix multiple odr switch when FIFO is off
When multiple ODR switch happens during FIFO off, the change could
not be taken into account if you get back to previous FIFO on value.
For example, if you run sensor buffer at 50Hz, stop, change to
200Hz, then back to 50Hz and restart buffer, data will be timestamped
at 200Hz. This due to testing against mult and not new_mult.
To prevent this, let's just run apply_odr automatically when FIFO is
off. It will also simplify driver code.
Update inv_mpu6050 and inv_icm42600 to delete now useless apply_odr.
Fixes: 95444b9eeb8c ("iio: invensense: fix odr switching to same value")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jean-Baptiste Maneyrol <jean-baptiste.maneyrol(a)tdk.com>
Link: https://patch.msgid.link/20241021-invn-inv-sensors-timestamp-fix-switch-fif…
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
diff --git a/drivers/iio/common/inv_sensors/inv_sensors_timestamp.c b/drivers/iio/common/inv_sensors/inv_sensors_timestamp.c
index f44458c380d9..37d0bdaa8d82 100644
--- a/drivers/iio/common/inv_sensors/inv_sensors_timestamp.c
+++ b/drivers/iio/common/inv_sensors/inv_sensors_timestamp.c
@@ -70,6 +70,10 @@ int inv_sensors_timestamp_update_odr(struct inv_sensors_timestamp *ts,
if (mult != ts->mult)
ts->new_mult = mult;
+ /* When FIFO is off, directly apply the new ODR */
+ if (!fifo)
+ inv_sensors_timestamp_apply_odr(ts, 0, 0, 0);
+
return 0;
}
EXPORT_SYMBOL_NS_GPL(inv_sensors_timestamp_update_odr, IIO_INV_SENSORS_TIMESTAMP);
diff --git a/drivers/iio/imu/inv_icm42600/inv_icm42600_accel.c b/drivers/iio/imu/inv_icm42600/inv_icm42600_accel.c
index 56ac19814250..7968aa27f9fd 100644
--- a/drivers/iio/imu/inv_icm42600/inv_icm42600_accel.c
+++ b/drivers/iio/imu/inv_icm42600/inv_icm42600_accel.c
@@ -200,7 +200,6 @@ static int inv_icm42600_accel_update_scan_mode(struct iio_dev *indio_dev,
{
struct inv_icm42600_state *st = iio_device_get_drvdata(indio_dev);
struct inv_icm42600_sensor_state *accel_st = iio_priv(indio_dev);
- struct inv_sensors_timestamp *ts = &accel_st->ts;
struct inv_icm42600_sensor_conf conf = INV_ICM42600_SENSOR_CONF_INIT;
unsigned int fifo_en = 0;
unsigned int sleep_temp = 0;
@@ -229,7 +228,6 @@ static int inv_icm42600_accel_update_scan_mode(struct iio_dev *indio_dev,
}
/* update data FIFO write */
- inv_sensors_timestamp_apply_odr(ts, 0, 0, 0);
ret = inv_icm42600_buffer_set_fifo_en(st, fifo_en | st->fifo.en);
out_unlock:
diff --git a/drivers/iio/imu/inv_icm42600/inv_icm42600_gyro.c b/drivers/iio/imu/inv_icm42600/inv_icm42600_gyro.c
index 938af5b640b0..c6bb68bf5e14 100644
--- a/drivers/iio/imu/inv_icm42600/inv_icm42600_gyro.c
+++ b/drivers/iio/imu/inv_icm42600/inv_icm42600_gyro.c
@@ -99,8 +99,6 @@ static int inv_icm42600_gyro_update_scan_mode(struct iio_dev *indio_dev,
const unsigned long *scan_mask)
{
struct inv_icm42600_state *st = iio_device_get_drvdata(indio_dev);
- struct inv_icm42600_sensor_state *gyro_st = iio_priv(indio_dev);
- struct inv_sensors_timestamp *ts = &gyro_st->ts;
struct inv_icm42600_sensor_conf conf = INV_ICM42600_SENSOR_CONF_INIT;
unsigned int fifo_en = 0;
unsigned int sleep_gyro = 0;
@@ -128,7 +126,6 @@ static int inv_icm42600_gyro_update_scan_mode(struct iio_dev *indio_dev,
}
/* update data FIFO write */
- inv_sensors_timestamp_apply_odr(ts, 0, 0, 0);
ret = inv_icm42600_buffer_set_fifo_en(st, fifo_en | st->fifo.en);
out_unlock:
diff --git a/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c b/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c
index 3bfeabab0ec4..5b1088cc3704 100644
--- a/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c
+++ b/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c
@@ -112,7 +112,6 @@ int inv_mpu6050_prepare_fifo(struct inv_mpu6050_state *st, bool enable)
if (enable) {
/* reset timestamping */
inv_sensors_timestamp_reset(&st->timestamp);
- inv_sensors_timestamp_apply_odr(&st->timestamp, 0, 0, 0);
/* reset FIFO */
d = st->chip_config.user_ctrl | INV_MPU6050_BIT_FIFO_RST;
ret = regmap_write(st->map, st->reg->user_ctrl, d);
Currently, the pointer stored in call->prog_array is loaded in
__uprobe_perf_func(), with no RCU annotation and no RCU protection, so the
loaded pointer can immediately be dangling. Later,
bpf_prog_run_array_uprobe() starts a RCU-trace read-side critical section,
but this is too late. It then uses rcu_dereference_check(), but this use of
rcu_dereference_check() does not actually dereference anything.
It looks like the intention was to pass a pointer to the member
call->prog_array into bpf_prog_run_array_uprobe() and actually dereference
the pointer in there. Fix the issue by actually doing that.
Fixes: 8c7dcb84e3b7 ("bpf: implement sleepable uprobes by chaining gps")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jann Horn <jannh(a)google.com>
---
To reproduce, patch include/linux/bpf.h like this:
```
@@ -30,6 +30,7 @@
#include <linux/static_call.h>
#include <linux/memcontrol.h>
#include <linux/cfi.h>
+#include <linux/delay.h>
struct bpf_verifier_env;
struct bpf_verifier_log;
@@ -2203,6 +2204,7 @@ bpf_prog_run_array_uprobe(const struct bpf_prog_array __rcu *array_rcu,
struct bpf_trace_run_ctx run_ctx;
u32 ret = 1;
+ mdelay(10000);
might_fault();
rcu_read_lock_trace();
```
Build this userspace program:
```
$ cat dummy.c
#include <stdio.h>
int main(void) {
printf("hello world\n");
}
$ gcc -o dummy dummy.c
```
Then build this BPF program and load it (change the path to point to
the "dummy" binary you built):
```
$ cat bpf-uprobe-kern.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
char _license[] SEC("license") = "GPL";
SEC("uprobe//home/user/bpf-uprobe-uaf/dummy:main")
int BPF_UPROBE(main_uprobe) {
bpf_printk("main uprobe triggered\n");
return 0;
}
$ clang -O2 -g -target bpf -c -o bpf-uprobe-kern.o bpf-uprobe-kern.c
$ sudo bpftool prog loadall bpf-uprobe-kern.o uprobe-test autoattach
```
Then run ./dummy in one terminal, and after launching it, run
`sudo umount uprobe-test` in another terminal. Once the 10-second
mdelay() is over, a use-after-free should occur, which may or may
not crash your kernel at the `prog->sleepable` check in
bpf_prog_run_array_uprobe() depending on your luck.
---
include/linux/bpf.h | 4 ++--
kernel/trace/trace_uprobe.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index eaee2a819f4c150a34a7b1075584711609682e4c..00b3c5b197df75a0386233b9885b480b2ce72f5f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2193,7 +2193,7 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
* rcu-protected dynamically sized maps.
*/
static __always_inline u32
-bpf_prog_run_array_uprobe(const struct bpf_prog_array __rcu *array_rcu,
+bpf_prog_run_array_uprobe(struct bpf_prog_array __rcu **array_rcu,
const void *ctx, bpf_prog_run_fn run_prog)
{
const struct bpf_prog_array_item *item;
@@ -2210,7 +2210,7 @@ bpf_prog_run_array_uprobe(const struct bpf_prog_array __rcu *array_rcu,
run_ctx.is_uprobe = true;
- array = rcu_dereference_check(array_rcu, rcu_read_lock_trace_held());
+ array = rcu_dereference_check(*array_rcu, rcu_read_lock_trace_held());
if (unlikely(!array))
goto out;
old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index fed382b7881b82ee3c334ea77860cce77581a74d..c4eef1eb5ddb3c65205aa9d64af1c72d62fab87f 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1404,7 +1404,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
if (bpf_prog_array_valid(call)) {
u32 ret;
- ret = bpf_prog_run_array_uprobe(call->prog_array, regs, bpf_prog_run);
+ ret = bpf_prog_run_array_uprobe(&call->prog_array, regs, bpf_prog_run);
if (!ret)
return;
}
---
base-commit: 509df676c2d79c985ec2eaa3e3a3bbe557645861
change-id: 20241206-bpf-fix-uprobe-uaf-53d928bab3d0
--
Jann Horn <jannh(a)google.com>
Changes in v2:
- Fixup for uninitilized md5sig_count stack variable
(Oops! Kudos to kernel test robot <lkp(a)intel.com>)
- Correct space damage, add a missing Fixes tag &
reformat tcp_ulp_ops_size() (Kuniyuki Iwashima)
- Take out patch for maximum attribute length, see (4) below.
Going to send it later with the next TCP-AO-diag part
(Kuniyuki Iwashima)
- Link to v1: https://lore.kernel.org/r/20241106-tcp-md5-diag-prep-v1-0-d62debf3dded@gmai…
My original intent was to replace the last non-upstream Arista's TCP-AO
piece. That is per-netns procfs seqfile which lists AO keys. In my view
an acceptable upstream alternative would be TCP-AO-diag uAPI.
So, I started by looking and reviewing TCP-MD5-diag code. And straight
away I saw a bunch of issues:
1. Similarly to TCP_MD5SIG_EXT, which doesn't check tcpm_flags for
unknown flags and so being non-extendable setsockopt(), the same
way tcp_diag_put_md5sig() dumps md5 keys in an array of
tcp_diag_md5sig, which makes it ABI non-extendable structure
as userspace can't tolerate any new members in it.
2. Inet-diag allocates netlink message for sockets in
inet_diag_dump_one_icsk(), which uses a TCP-diag callback
.idiag_get_aux_size(), that pre-calculates the needed space for
TCP-diag related information. But as neither socket lock nor
rcu_readlock() are held between allocation and the actual TCP
info filling, the TCP-related space requirement may change before
reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
3. Inet-diag "do" request* can create skb of any message required size.
But "dump" request* the skb size, since d35c99ff77ec ("netlink: do
not enter direct reclaim from netlink_dump()") is limited by
32 KB. Having in mind that sizeof(struct tcp_diag_md5sig) = 100 bytes,
dumps for sockets that have more than 327 keys are going to fail
(not counting other diag infos, which lower this limit futher).
That is much lower than the number of TCP-MD5 keys that can be
allocated on a socket with the current default
optmem_max limit (128Kb).
So, then I went and written selftests for TCP-MD5-diag and besides
confirming that (2) and (3) are not theoretical issues, I also
discovered another issues, that I didn't notice on code inspection:
4. nlattr::nla_len is __u16, which limits the largest netlink attibute
by 64Kb or by 655 tcp_diag_md5sig keys in the diag array. What
happens de-facto is that the netlink attribute gets u16 overflow,
breaking the userspace parsing - RTA_NEXT(), that should point
to the next attribute, points into the middle of md5 keys array.
In this patch set issues (2) and (4) are addressed.
(2) by not returning EMSGSIZE when the dump raced with modifying
TCP-MD5 keys on a socket, but mark the dump inconsistent by setting
NLM_F_DUMP_INTR nlmsg flag. Which changes uAPI in situations where
previously kernel did WARN() and errored the dump.
(4) by artificially limiting the maximum attribute size by U16_MAX - 1.
In order to remove the new limit from (4) solution, my plan is to
convert the dump of TCP-MD5 keys from an array to
NL_ATTR_TYPE_NESTED_ARRAY (or alike), which should also address (1).
And for (3), it's needed to teach tcp-diag how-to remember not only
socket on which previous recvmsg() stopped, but potentially TCP-MD5
key as well.
I plan in the next part of patch set address (3), (1) and the new limit
for (4), together with adding new TCP-AO-diag.
* Terminology from Documentation/userspace-api/netlink/intro.rst
Signed-off-by: Dmitry Safonov <0x7f454c46(a)gmail.com>
---
Dmitry Safonov (5):
net/diag: Do not race on dumping MD5 keys with adding new MD5 keys
net/diag: Warn only once on EMSGSIZE
net/diag: Pre-allocate optional info only if requested
net/diag: Always pre-allocate tcp_ulp info
net/netlink: Correct the comment on netlink message max cap
include/linux/inet_diag.h | 3 +-
include/net/tcp.h | 1 -
net/ipv4/inet_diag.c | 89 ++++++++++++++++++++++++++++++++++++++---------
net/ipv4/tcp_diag.c | 68 ++++++++++++++++++------------------
net/mptcp/diag.c | 20 -----------
net/netlink/af_netlink.c | 4 +--
net/tls/tls_main.c | 17 ---------
7 files changed, 110 insertions(+), 92 deletions(-)
---
base-commit: f1b785f4c7870c42330b35522c2514e39a1e28e7
change-id: 20241106-tcp-md5-diag-prep-2f0dcf371d90
Best regards,
--
Dmitry Safonov <0x7f454c46(a)gmail.com>