Supported codec bitmask is populated from the payload from venus firmware.
There is a possible case when all the bits in the codec bitmask is set. In
such case, core cap for decoder is filled and MAX_CODEC_NUM is utilized.
Now while filling the caps for encoder, it can lead to access the caps
array beyong 32 index. Hence leading to OOB write.
The fix counts the supported encoder and decoder. If the count is more than
max, then it skips accessing the caps.
Cc: stable(a)vger.kernel.org
Fixes: 1a73374a04e5 ("media: venus: hfi_parser: add common capability parser")
Signed-off-by: Vikash Garodia <quic_vgarodia(a)quicinc.com>
---
drivers/media/platform/qcom/venus/hfi_parser.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/media/platform/qcom/venus/hfi_parser.c b/drivers/media/platform/qcom/venus/hfi_parser.c
index ec73cac..651e215 100644
--- a/drivers/media/platform/qcom/venus/hfi_parser.c
+++ b/drivers/media/platform/qcom/venus/hfi_parser.c
@@ -14,11 +14,26 @@
typedef void (*func)(struct hfi_plat_caps *cap, const void *data,
unsigned int size);
+static int count_setbits(u32 input)
+{
+ u32 count = 0;
+
+ while (input > 0) {
+ if ((input & 1) == 1)
+ count++;
+ input >>= 1;
+ }
+ return count;
+}
+
static void init_codecs(struct venus_core *core)
{
struct hfi_plat_caps *caps = core->caps, *cap;
unsigned long bit;
+ if ((count_setbits(core->dec_codecs) + count_setbits(core->enc_codecs)) > MAX_CODEC_NUM)
+ return;
+
for_each_set_bit(bit, &core->dec_codecs, MAX_CODEC_NUM) {
cap = &caps[core->codecs_count++];
cap->codec = BIT(bit);
--
2.7.4
Supported codec bitmask is populated from the payload from venus firmware.
There is a possible case when all the bits in the codec bitmask is set. In
such case, core cap for decoder is filled and MAX_CODEC_NUM is utilized.
Now while filling the caps for encoder, it can lead to access the caps
array beyong 32 index. Hence leading to OOB write.
The fix counts the supported encoder and decoder. If the count is more than
max, then it skips accessing the caps.
Cc: stable(a)vger.kernel.org
Fixes: 1a73374a04e5 ("media: venus: hfi_parser: add common capability parser")
Signed-off-by: Vikash Garodia <quic_vgarodia(a)quicinc.com>
---
drivers/media/platform/qcom/venus/hfi_parser.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/media/platform/qcom/venus/hfi_parser.c b/drivers/media/platform/qcom/venus/hfi_parser.c
index ec73cac..651e215 100644
--- a/drivers/media/platform/qcom/venus/hfi_parser.c
+++ b/drivers/media/platform/qcom/venus/hfi_parser.c
@@ -14,11 +14,26 @@
typedef void (*func)(struct hfi_plat_caps *cap, const void *data,
unsigned int size);
+static int count_setbits(u32 input)
+{
+ u32 count = 0;
+
+ while (input > 0) {
+ if ((input & 1) == 1)
+ count++;
+ input >>= 1;
+ }
+ return count;
+}
+
static void init_codecs(struct venus_core *core)
{
struct hfi_plat_caps *caps = core->caps, *cap;
unsigned long bit;
+ if ((count_setbits(core->dec_codecs) + count_setbits(core->enc_codecs)) > MAX_CODEC_NUM)
+ return;
+
for_each_set_bit(bit, &core->dec_codecs, MAX_CODEC_NUM) {
cap = &caps[core->codecs_count++];
cap->codec = BIT(bit);
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
Hi, linux-stable-mirror
Wish you have a nice day!
This is Carrie. Do you may need below customized non-woven bags for your branding activities or products packing?
We are bag factory directly who has been in this field for more than 12 years. Below is one of non-woven bags we did for your reference.
Item: Ultrasonic Non-woven Bag
Fabric: 80g non-woven fabric
Size: 40w x 40h x 10d cm
Printing: 2 colors-2 sides
MOQ: 5,000 pcs
Price: Ask me, please!
Lead Time: 15-25 days
All bags details, including size/ fabric/ printing/ package, can be customized according to your request. Please reply directly if you need any further information.
Best regards,
Carrie Pan
+86 18906051620
Please kindly reply “remove” if you are not interested. Wish you have a nice day!
But was she mourning— or just confused? Do non-human animals grieve? 00:48 This question is tricky It's all about just building your resume In fact, it was 1987" 00:25 And imagine your musician saying, "No, it's not true
From: Arnd Bergmann <arnd(a)arndb.de>
bpf_probe_read_kernel() has a __weak definition in core.c and another
definition with an incompatible prototype in kernel/trace/bpf_trace.c,
when CONFIG_BPF_EVENTS is enabled.
Since the two are incompatible, there cannot be a shared declaration in
a header file, but the lack of a prototype causes a W=1 warning:
kernel/bpf/core.c:1638:12: error: no previous prototype for 'bpf_probe_read_kernel' [-Werror=missing-prototypes]
On 32-bit architectures, the local prototype
u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
passes arguments in other registers as the one in bpf_trace.c
BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size,
const void *, unsafe_ptr)
which uses 64-bit arguments in pairs of registers.
As both versions of the function are fairly simple and only really
differ in one line, just move them into a header file as an inline
function that does not add any overhead for the bpf_trace.c callers
and actually avoids a function call for the other one.
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/ac25cb0f-b804-1649-3afb-1dc6138c2716@iogearbox.…
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
--
v4: rewrite again to use a shared inline helper
v3: clarify changelog text further.
v2: rewrite completely to fix the mismatch.
---
include/linux/bpf.h | 12 ++++++++++++
kernel/bpf/core.c | 10 ++--------
kernel/trace/bpf_trace.c | 11 -----------
3 files changed, 14 insertions(+), 19 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ceaa8c23287fc..abe75063630b8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2661,6 +2661,18 @@ static inline void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
}
#endif /* CONFIG_BPF_SYSCALL */
+static __always_inline int
+bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr)
+{
+ int ret = -EFAULT;
+
+ if (IS_ENABLED(CONFIG_BPF_EVENTS))
+ ret = copy_from_kernel_nofault(dst, unsafe_ptr, size);
+ if (unlikely(ret < 0))
+ memset(dst, 0, size);
+ return ret;
+}
+
void __bpf_free_used_btfs(struct bpf_prog_aux *aux,
struct btf_mod_pair *used_btfs, u32 len);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index dd70c58c9d3a3..9cdf53bfb8bd3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1634,12 +1634,6 @@ bool bpf_opcode_in_insntable(u8 code)
}
#ifndef CONFIG_BPF_JIT_ALWAYS_ON
-u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
-{
- memset(dst, 0, size);
- return -EFAULT;
-}
-
/**
* ___bpf_prog_run - run eBPF program on a given context
* @regs: is the array of MAX_BPF_EXT_REG eBPF pseudo-registers
@@ -1930,8 +1924,8 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn)
DST = *(SIZE *)(unsigned long) (SRC + insn->off); \
CONT; \
LDX_PROBE_MEM_##SIZEOP: \
- bpf_probe_read_kernel(&DST, sizeof(SIZE), \
- (const void *)(long) (SRC + insn->off)); \
+ bpf_probe_read_kernel_common(&DST, sizeof(SIZE), \
+ (const void *)(long) (SRC + insn->off)); \
DST = *((SIZE *)&DST); \
CONT;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index c92eb8c6ff08d..83bde2475ae54 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -223,17 +223,6 @@ const struct bpf_func_proto bpf_probe_read_user_str_proto = {
.arg3_type = ARG_ANYTHING,
};
-static __always_inline int
-bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr)
-{
- int ret;
-
- ret = copy_from_kernel_nofault(dst, unsafe_ptr, size);
- if (unlikely(ret < 0))
- memset(dst, 0, size);
- return ret;
-}
-
BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size,
const void *, unsafe_ptr)
{
--
2.39.2
The bison and flex generate C files from the source (.y and .l)
files. When O= option is used, they are saved in a separate directory
but the default build rule assumes the .C files are in the source
directory. So it might read invalid file if there are generated files
from an old version. The same is true for the pmu-events files.
For example, the following command would cause a build failure:
$ git checkout v6.3
$ make -C tools/perf # build in the same directory
$ git checkout v6.5-rc2
$ mkdir build # create a build directory
$ make -C tools/perf O=build # build in a different directory but it
# refers files in the source directory
Let's update the build rule to specify those cases explicitly to depend
on the files in the output directory.
Note that it's not a complete fix and it needs the next patch for the
include path too.
Fixes: 80eeb67fe577 ("perf jevents: Program to convert JSON file")
Cc: stable(a)vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung(a)kernel.org>
---
tools/build/Makefile.build | 8 ++++++++
tools/perf/pmu-events/Build | 4 ++++
2 files changed, 12 insertions(+)
diff --git a/tools/build/Makefile.build b/tools/build/Makefile.build
index 89430338a3d9..f9396696fcbf 100644
--- a/tools/build/Makefile.build
+++ b/tools/build/Makefile.build
@@ -117,6 +117,14 @@ $(OUTPUT)%.s: %.c FORCE
$(call rule_mkdir)
$(call if_changed_dep,cc_s_c)
+$(OUTPUT)%-bison.o: $(OUTPUT)%-bison.c FORCE
+ $(call rule_mkdir)
+ $(call if_changed_dep,$(host)cc_o_c)
+
+$(OUTPUT)%-flex.o: $(OUTPUT)%-flex.c FORCE
+ $(call rule_mkdir)
+ $(call if_changed_dep,$(host)cc_o_c)
+
# Gather build data:
# obj-y - list of build objects
# subdir-y - list of directories to nest
diff --git a/tools/perf/pmu-events/Build b/tools/perf/pmu-events/Build
index 150765f2baee..f38a27765604 100644
--- a/tools/perf/pmu-events/Build
+++ b/tools/perf/pmu-events/Build
@@ -35,3 +35,7 @@ $(PMU_EVENTS_C): $(JSON) $(JSON_TEST) $(JEVENTS_PY) $(METRIC_PY) $(METRIC_TEST_L
$(call rule_mkdir)
$(Q)$(call echo-cmd,gen)$(PYTHON) $(JEVENTS_PY) $(JEVENTS_ARCH) $(JEVENTS_MODEL) pmu-events/arch $@
endif
+
+$(OUTPUT)pmu-events/pmu-events.o: $(PMU_EVENTS_C)
+ $(call rule_mkdir)
+ $(call if_changed_dep,$(host)cc_o_c)
--
2.41.0.487.g6d72f3e995-goog
From: liubo <liubo254(a)huawei.com>
In commit 474098edac26 ("mm/gup: replace FOLL_NUMA by
gup_can_follow_protnone()"), FOLL_NUMA was removed and replaced by
the gup_can_follow_protnone interface.
However, for the case where the user-mode process uses transparent
huge pages, when analyzing the memory usage through
/proc/pid/smaps_rollup, the obtained memory usage is not consistent
with the RSS in /proc/pid/status.
Related examples are as follows:
cat /proc/15427/status
VmRSS: 20973024 kB
RssAnon: 20971616 kB
RssFile: 1408 kB
RssShmem: 0 kB
cat /proc/15427/smaps_rollup
00400000-7ffcc372d000 ---p 00000000 00:00 0 [rollup]
Rss: 14419432 kB
Pss: 14418079 kB
Pss_Dirty: 14418016 kB
Pss_Anon: 14418016 kB
Pss_File: 63 kB
Pss_Shmem: 0 kB
Anonymous: 14418016 kB
LazyFree: 0 kB
AnonHugePages: 14417920 kB
The root cause is that the traversal In the page table, the number of
pages obtained by smaps_pmd_entry does not include the pages
corresponding to PROTNONE,resulting in a different situation.
Therefore, when obtaining pages through the follow_trans_huge_pmd
interface, add the FOLL_FORCE flag to count the pages corresponding to
PROTNONE to solve the above problem.
Signed-off-by: liubo <liubo254(a)huawei.com>
Cc: stable(a)vger.kernel.org
Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
Signed-off-by: David Hildenbrand <david(a)redhat.com> # AKPM fixups, cc stable
Signed-off-by: David Hildenbrand <david(a)redhat.com>
---
fs/proc/task_mmu.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c1e6531cb02a..7075ce11dc7d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -571,8 +571,12 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
bool migration = false;
if (pmd_present(*pmd)) {
- /* FOLL_DUMP will return -EFAULT on huge zero page */
- page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+ /*
+ * FOLL_DUMP will return -EFAULT on huge zero page
+ * FOLL_FORCE follow a PROT_NONE mapped page
+ */
+ page = follow_trans_huge_pmd(vma, addr, pmd,
+ FOLL_DUMP | FOLL_FORCE);
} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
swp_entry_t entry = pmd_to_swp_entry(*pmd);
--
2.41.0
The quilt patch titled
Subject: mm/memory-failure: fix hardware poison check in unpoison_memory()
has been removed from the -mm tree. Its filename was
mm-memory-failure-fix-hardware-poison-check-in-unpoison_memory.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Sidhartha Kumar <sidhartha.kumar(a)oracle.com>
Subject: mm/memory-failure: fix hardware poison check in unpoison_memory()
Date: Mon, 17 Jul 2023 11:18:12 -0700
It was pointed out[1] that using folio_test_hwpoison() is wrong as we need
to check the indiviual page that has poison. folio_test_hwpoison() only
checks the head page so go back to using PageHWPoison().
User-visible effects include existing hwpoison-inject tests possibly
failing as unpoisoning a single subpage could lead to unpoisoning an
entire folio. Memory unpoisoning could also not work as expected as
the function will break early due to only checking the head page and
not the actually poisoned subpage.
[1]: https://lore.kernel.org/lkml/ZLIbZygG7LqSI9xe@casper.infradead.org/
Link: https://lkml.kernel.org/r/20230717181812.167757-1-sidhartha.kumar@oracle.com
Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
Signed-off-by: Sidhartha Kumar <sidhartha.kumar(a)oracle.com>
Reported-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/memory-failure.c~mm-memory-failure-fix-hardware-poison-check-in-unpoison_memory
+++ a/mm/memory-failure.c
@@ -2487,7 +2487,7 @@ int unpoison_memory(unsigned long pfn)
goto unlock_mutex;
}
- if (!folio_test_hwpoison(folio)) {
+ if (!PageHWPoison(p)) {
unpoison_pr_info("Unpoison: Page was already unpoisoned %#lx\n",
pfn, &unpoison_rs);
goto unlock_mutex;
_
Patches currently in -mm which might be from sidhartha.kumar(a)oracle.com are
mm-increase-usage-of-folio_next_index-helper.patch
mm-memory-convert-do_page_mkwrite-to-use-folios.patch
mm-memory-convert-wp_page_shared-to-use-folios.patch
mm-memory-convert-do_shared_fault-to-folios.patch
mm-memory-convert-do_read_fault-to-use-folios.patch
mm-memory-pass-folio-into-do_page_mkwrite.patch
mm-hugetlb-get-rid-of-page_hstate.patch
The quilt patch titled
Subject: proc/vmcore: fix signedness bug in read_from_oldmem()
has been removed from the -mm tree. Its filename was
proc-vmcore-fix-signedness-bug-in-read_from_oldmem.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Dan Carpenter <dan.carpenter(a)linaro.org>
Subject: proc/vmcore: fix signedness bug in read_from_oldmem()
Date: Tue, 25 Jul 2023 20:03:16 +0300
The bug is the error handling:
if (tmp < nr_bytes) {
"tmp" can hold negative error codes but because "nr_bytes" is type size_t
the negative error codes are treated as very high positive values
(success). Fix this by changing "nr_bytes" to type ssize_t. The
"nr_bytes" variable is used to store values between 1 and PAGE_SIZE and
they can fit in ssize_t without any issue.
Link: https://lkml.kernel.org/r/b55f7eed-1c65-4adc-95d1-6c7c65a54a6e@moroto.mount…
Fixes: 5d8de293c224 ("vmcore: convert copy_oldmem_page() to take an iov_iter")
Signed-off-by: Dan Carpenter <dan.carpenter(a)linaro.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Vivek Goyal <vgoyal(a)redhat.com>
Cc: Alexey Dobriyan <adobriyan(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/vmcore.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/proc/vmcore.c~proc-vmcore-fix-signedness-bug-in-read_from_oldmem
+++ a/fs/proc/vmcore.c
@@ -132,7 +132,7 @@ ssize_t read_from_oldmem(struct iov_iter
u64 *ppos, bool encrypted)
{
unsigned long pfn, offset;
- size_t nr_bytes;
+ ssize_t nr_bytes;
ssize_t read = 0, tmp;
int idx;
_
Patches currently in -mm which might be from dan.carpenter(a)linaro.org are
The quilt patch titled
Subject: mm: lock VMA in dup_anon_vma() before setting ->anon_vma
has been removed from the -mm tree. Its filename was
mm-lock-vma-in-dup_anon_vma-before-setting-anon_vma.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Jann Horn <jannh(a)google.com>
Subject: mm: lock VMA in dup_anon_vma() before setting ->anon_vma
Date: Fri, 21 Jul 2023 05:46:43 +0200
When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the
VMA that is being expanded to cover the area previously occupied by
another VMA. This currently happens while `dst` is not write-locked.
This means that, in the `src->anon_vma && !dst->anon_vma` case, as soon as
the assignment `dst->anon_vma = src->anon_vma` has happened, concurrent
page faults can happen on `dst` under the per-VMA lock. This is already
icky in itself, since such page faults can now install pages into `dst`
that are attached to an `anon_vma` that is not yet tied back to the
`anon_vma` with an `anon_vma_chain`. But if `anon_vma_clone()` fails due
to an out-of-memory error, things get much worse: `anon_vma_clone()` then
reverts `dst->anon_vma` back to NULL, and `dst` remains completely
unconnected to the `anon_vma`, even though we can have pages in the area
covered by `dst` that point to the `anon_vma`.
This means the `anon_vma` of such pages can be freed while the pages are
still mapped into userspace, which leads to UAF when a helper like
folio_lock_anon_vma_read() tries to look up the anon_vma of such a page.
This theoretically is a security bug, but I believe it is really hard to
actually trigger as an unprivileged user because it requires that you can
make an order-0 GFP_KERNEL allocation fail, and the page allocator tries
pretty hard to prevent that.
I think doing the vma_start_write() call inside dup_anon_vma() is the most
straightforward fix for now.
For a kernel-assisted reproducer, see the notes section of the patch mail.
Link: https://lkml.kernel.org/r/20230721034643.616851-1-jannh@google.com
Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmap.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/mmap.c~mm-lock-vma-in-dup_anon_vma-before-setting-anon_vma
+++ a/mm/mmap.c
@@ -615,6 +615,7 @@ static inline int dup_anon_vma(struct vm
* anon pages imported.
*/
if (src->anon_vma && !dst->anon_vma) {
+ vma_start_write(dst);
dst->anon_vma = src->anon_vma;
return anon_vma_clone(dst, src);
}
_
Patches currently in -mm which might be from jannh(a)google.com are
mm-dont-drop-vma-locks-in-mm_drop_all_locks.patch
The quilt patch titled
Subject: mm: fix memory ordering for mm_lock_seq and vm_lock_seq
has been removed from the -mm tree. Its filename was
mm-fix-memory-ordering-for-mm_lock_seq-and-vm_lock_seq.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Jann Horn <jannh(a)google.com>
Subject: mm: fix memory ordering for mm_lock_seq and vm_lock_seq
Date: Sat, 22 Jul 2023 00:51:07 +0200
mm->mm_lock_seq effectively functions as a read/write lock; therefore it
must be used with acquire/release semantics.
A specific example is the interaction between userfaultfd_register() and
lock_vma_under_rcu().
userfaultfd_register() does the following from the point where it changes
a VMA's flags to the point where concurrent readers are permitted again
(in a simple scenario where only a single private VMA is accessed and no
merging/splitting is involved):
userfaultfd_register
userfaultfd_set_vm_flags
vm_flags_reset
vma_start_write
down_write(&vma->vm_lock->lock)
vma->vm_lock_seq = mm_lock_seq [marks VMA as busy]
up_write(&vma->vm_lock->lock)
vm_flags_init
[sets VM_UFFD_* in __vm_flags]
vma->vm_userfaultfd_ctx.ctx = ctx
mmap_write_unlock
vma_end_write_all
WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1) [unlocks VMA]
There are no memory barriers in between the __vm_flags update and the
mm->mm_lock_seq update that unlocks the VMA, so the unlock can be
reordered to above the `vm_flags_init()` call, which means from the
perspective of a concurrent reader, a VMA can be marked as a userfaultfd
VMA while it is not VMA-locked. That's bad, we definitely need a
store-release for the unlock operation.
The non-atomic write to vma->vm_lock_seq in vma_start_write() is mostly
fine because all accesses to vma->vm_lock_seq that matter are always
protected by the VMA lock. There is a racy read in vma_start_read()
though that can tolerate false-positives, so we should be using
WRITE_ONCE() to keep things tidy and data-race-free (including for KCSAN).
On the other side, lock_vma_under_rcu() works as follows in the relevant
region for locking and userfaultfd check:
lock_vma_under_rcu
vma_start_read
vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [early bailout]
down_read_trylock(&vma->vm_lock->lock)
vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [main check]
userfaultfd_armed
checks vma->vm_flags & __VM_UFFD_FLAGS
Here, the interesting aspect is how far down the mm->mm_lock_seq read can
be reordered - if this read is reordered down below the vma->vm_flags
access, this could cause lock_vma_under_rcu() to partly operate on
information that was read while the VMA was supposed to be locked. To
prevent this kind of downwards bleeding of the mm->mm_lock_seq read, we
need to read it with a load-acquire.
Some of the comment wording is based on suggestions by Suren.
BACKPORT WARNING: One of the functions changed by this patch (which I've
written against Linus' tree) is vma_try_start_write(), but this function
no longer exists in mm/mm-everything. I don't know whether the merged
version of this patch will be ordered before or after the patch that
removes vma_try_start_write(). If you're backporting this patch to a tree
with vma_try_start_write(), make sure this patch changes that function.
Link: https://lkml.kernel.org/r/20230721225107.942336-1-jannh@google.com
Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 29 +++++++++++++++++++++++------
include/linux/mm_types.h | 28 ++++++++++++++++++++++++++++
include/linux/mmap_lock.h | 10 ++++++++--
3 files changed, 59 insertions(+), 8 deletions(-)
--- a/include/linux/mmap_lock.h~mm-fix-memory-ordering-for-mm_lock_seq-and-vm_lock_seq
+++ a/include/linux/mmap_lock.h
@@ -76,8 +76,14 @@ static inline void mmap_assert_write_loc
static inline void vma_end_write_all(struct mm_struct *mm)
{
mmap_assert_write_locked(mm);
- /* No races during update due to exclusive mmap_lock being held */
- WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
+ /*
+ * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
+ * mmap_lock being held.
+ * We need RELEASE semantics here to ensure that preceding stores into
+ * the VMA take effect before we unlock it with this store.
+ * Pairs with ACQUIRE semantics in vma_start_read().
+ */
+ smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
}
#else
static inline void vma_end_write_all(struct mm_struct *mm) {}
--- a/include/linux/mm.h~mm-fix-memory-ordering-for-mm_lock_seq-and-vm_lock_seq
+++ a/include/linux/mm.h
@@ -641,8 +641,14 @@ static inline void vma_numab_state_free(
*/
static inline bool vma_start_read(struct vm_area_struct *vma)
{
- /* Check before locking. A race might cause false locked result. */
- if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+ /*
+ * Check before locking. A race might cause false locked result.
+ * We can use READ_ONCE() for the mm_lock_seq here, and don't need
+ * ACQUIRE semantics, because this is just a lockless check whose result
+ * we don't rely on for anything - the mm_lock_seq read against which we
+ * need ordering is below.
+ */
+ if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq))
return false;
if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
@@ -653,8 +659,13 @@ static inline bool vma_start_read(struct
* False unlocked result is impossible because we modify and check
* vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
* modification invalidates all existing locks.
+ *
+ * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
+ * racing with vma_end_write_all(), we only start reading from the VMA
+ * after it has been unlocked.
+ * This pairs with RELEASE semantics in vma_end_write_all().
*/
- if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+ if (unlikely(vma->vm_lock_seq == smp_load_acquire(&vma->vm_mm->mm_lock_seq))) {
up_read(&vma->vm_lock->lock);
return false;
}
@@ -676,7 +687,7 @@ static bool __is_vma_write_locked(struct
* current task is holding mmap_write_lock, both vma->vm_lock_seq and
* mm->mm_lock_seq can't be concurrently modified.
*/
- *mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
+ *mm_lock_seq = vma->vm_mm->mm_lock_seq;
return (vma->vm_lock_seq == *mm_lock_seq);
}
@@ -688,7 +699,13 @@ static inline void vma_start_write(struc
return;
down_write(&vma->vm_lock->lock);
- vma->vm_lock_seq = mm_lock_seq;
+ /*
+ * We should use WRITE_ONCE() here because we can have concurrent reads
+ * from the early lockless pessimistic check in vma_start_read().
+ * We don't really care about the correctness of that early check, but
+ * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
+ */
+ WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
up_write(&vma->vm_lock->lock);
}
@@ -702,7 +719,7 @@ static inline bool vma_try_start_write(s
if (!down_write_trylock(&vma->vm_lock->lock))
return false;
- vma->vm_lock_seq = mm_lock_seq;
+ WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
up_write(&vma->vm_lock->lock);
return true;
}
--- a/include/linux/mm_types.h~mm-fix-memory-ordering-for-mm_lock_seq-and-vm_lock_seq
+++ a/include/linux/mm_types.h
@@ -514,6 +514,20 @@ struct vm_area_struct {
};
#ifdef CONFIG_PER_VMA_LOCK
+ /*
+ * Can only be written (using WRITE_ONCE()) while holding both:
+ * - mmap_lock (in write mode)
+ * - vm_lock->lock (in write mode)
+ * Can be read reliably while holding one of:
+ * - mmap_lock (in read or write mode)
+ * - vm_lock->lock (in read or write mode)
+ * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
+ * while holding nothing (except RCU to keep the VMA struct allocated).
+ *
+ * This sequence counter is explicitly allowed to overflow; sequence
+ * counter reuse can only lead to occasional unnecessary use of the
+ * slowpath.
+ */
int vm_lock_seq;
struct vma_lock *vm_lock;
@@ -679,6 +693,20 @@ struct mm_struct {
* by mmlist_lock
*/
#ifdef CONFIG_PER_VMA_LOCK
+ /*
+ * This field has lock-like semantics, meaning it is sometimes
+ * accessed with ACQUIRE/RELEASE semantics.
+ * Roughly speaking, incrementing the sequence number is
+ * equivalent to releasing locks on VMAs; reading the sequence
+ * number can be part of taking a read lock on a VMA.
+ *
+ * Can be modified under write mmap_lock using RELEASE
+ * semantics.
+ * Can be read with no other protection when holding write
+ * mmap_lock.
+ * Can be read with ACQUIRE semantics if not holding write
+ * mmap_lock.
+ */
int mm_lock_seq;
#endif
_
Patches currently in -mm which might be from jannh(a)google.com are
mm-dont-drop-vma-locks-in-mm_drop_all_locks.patch