This is a bit of a mess, to put it mildly. But, it's a bug
that seems to have gone unticked up to now, probably because
nobody uses MPX. The other alternative to this fix is to just
deprecate MPX, even in -stable kernels.
MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory. When
memory is unmapped, there is also a need to unmap the MPX
bounds tables. Barring this, unused bounds tables can eat 80%
of the address space.
But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state. It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.
To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful. Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.
For the common success case this is functionally identical to
what was there before. For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use. But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).
I can't imagine anyone doing this:
ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.
Because if you're doing munmap(), you are *done* with the
memory. There's probably no good data in there _anyway_.
This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.
====
The long story:
munmap() has a couple of pieces:
1. Find the affected VMA(s)
2. Split the start/end one(s) if neceesary
3. Pull the VMAs out of the rbtree
4. Actually zap the memory via unmap_region(), including
freeing page tables (or queueing them to be freed).
5. Fixup some of the accounting (like fput()) and actually
free the VMA itself.
I decided to put the arch_unmap() call right afer #3. This
was *just* before mmap_sem looked like it might get downgraded
(it won't in this context), but it looked right. It wasn't.
Richard Biener reported a test that shows this in dmesg:
[1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551
[1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576
What triggered this was the recursive do_munmap() called via
arch_unmap(). It was freeing page tables that has not been
properly zapped.
But, the problem was bigger than this. For one, arch_unmap()
can free VMAs. But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still valid.
I tried a couple of things here. First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue. I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart. That spiralled out of control in complexity pretty
fast.
Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.
Reported-by: Richard Biener <rguenther(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: x86(a)kernel.org
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: stable(a)vger.kernel.org
---
b/arch/x86/include/asm/mmu_context.h | 6 +++---
b/arch/x86/include/asm/mpx.h | 5 ++---
b/arch/x86/mm/mpx.c | 10 ++++++----
b/include/asm-generic/mm_hooks.h | 1 -
b/mm/mmap.c | 15 ++++++++-------
5 files changed, 19 insertions(+), 18 deletions(-)
diff -puN mm/mmap.c~mpx-rss-pass-no-vma mm/mmap.c
--- a/mm/mmap.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.409411123 -0700
+++ b/mm/mmap.c 2019-04-01 06:56:53.423411123 -0700
@@ -2731,9 +2731,17 @@ int __do_munmap(struct mm_struct *mm, un
return -EINVAL;
len = PAGE_ALIGN(len);
+ end = start + len;
if (len == 0)
return -EINVAL;
+ /*
+ * arch_unmap() might do unmaps itself. It must be called
+ * and finish any rbtree manipulation before this code
+ * runs and also starts to manipulate the rbtree.
+ */
+ arch_unmap(mm, start, end);
+
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2742,7 +2750,6 @@ int __do_munmap(struct mm_struct *mm, un
/* we have start < vma->vm_end */
/* if it doesn't overlap, we have nothing.. */
- end = start + len;
if (vma->vm_start >= end)
return 0;
@@ -2812,12 +2819,6 @@ int __do_munmap(struct mm_struct *mm, un
/* Detach vmas from rbtree */
detach_vmas_to_be_unmapped(mm, vma, prev, end);
- /*
- * mpx unmap needs to be called with mmap_sem held for write.
- * It is safe to call it before unmap_region().
- */
- arch_unmap(mm, vma, start, end);
-
if (downgrade)
downgrade_write(&mm->mmap_sem);
diff -puN arch/x86/include/asm/mmu_context.h~mpx-rss-pass-no-vma arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~mpx-rss-pass-no-vma 2019-04-01 06:56:53.412411123 -0700
+++ b/arch/x86/include/asm/mmu_context.h 2019-04-01 06:56:53.423411123 -0700
@@ -277,8 +277,8 @@ static inline void arch_bprm_mm_init(str
mpx_mm_init(mm);
}
-static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static inline void arch_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
/*
* mpx_notify_unmap() goes and reads a rarely-hot
@@ -298,7 +298,7 @@ static inline void arch_unmap(struct mm_
* consistently wrong.
*/
if (unlikely(cpu_feature_enabled(X86_FEATURE_MPX)))
- mpx_notify_unmap(mm, vma, start, end);
+ mpx_notify_unmap(mm, start, end);
}
/*
diff -puN include/asm-generic/mm_hooks.h~mpx-rss-pass-no-vma include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~mpx-rss-pass-no-vma 2019-04-01 06:56:53.414411123 -0700
+++ b/include/asm-generic/mm_hooks.h 2019-04-01 06:56:53.423411123 -0700
@@ -18,7 +18,6 @@ static inline void arch_exit_mmap(struct
}
static inline void arch_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
}
diff -puN arch/x86/mm/mpx.c~mpx-rss-pass-no-vma arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.416411123 -0700
+++ b/arch/x86/mm/mpx.c 2019-04-01 06:56:53.423411123 -0700
@@ -881,9 +881,10 @@ static int mpx_unmap_tables(struct mm_st
* the virtual address region start...end have already been split if
* necessary, and the 'vma' is the first vma in this range (start -> end).
*/
-void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+void mpx_notify_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
+ struct vm_area_struct *vma;
int ret;
/*
@@ -902,11 +903,12 @@ void mpx_notify_unmap(struct mm_struct *
* which should not occur normally. Being strict about it here
* helps ensure that we do not have an exploitable stack overflow.
*/
- do {
+ vma = find_vma(mm, start);
+ while (vma && vma->vm_start < end) {
if (vma->vm_flags & VM_MPX)
return;
vma = vma->vm_next;
- } while (vma && vma->vm_start < end);
+ }
ret = mpx_unmap_tables(mm, start, end);
if (ret)
diff -puN arch/x86/include/asm/mpx.h~mpx-rss-pass-no-vma arch/x86/include/asm/mpx.h
--- a/arch/x86/include/asm/mpx.h~mpx-rss-pass-no-vma 2019-04-01 06:56:53.418411123 -0700
+++ b/arch/x86/include/asm/mpx.h 2019-04-01 06:56:53.424411123 -0700
@@ -78,8 +78,8 @@ static inline void mpx_mm_init(struct mm
*/
mm->context.bd_addr = MPX_INVALID_BOUNDS_DIR;
}
-void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end);
+void mpx_notify_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end);
unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
unsigned long flags);
@@ -100,7 +100,6 @@ static inline void mpx_mm_init(struct mm
{
}
static inline void mpx_notify_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
}
_
From: Gao Xiang <hsiangkao(a)redhat.com>
EROFS has _only one_ ondisk timestamp (ctime is currently
documented and recorded, we might also record mtime instead
with a new compat feature if needed) for each extended inode
since EROFS isn't mainly for archival purposes so no need to
keep all timestamps on disk especially for Android scenarios
due to security concerns. Also, romfs/cramfs don't have their
own on-disk timestamp, and squashfs only records mtime instead.
Let's also derive access time from ondisk timestamp rather than
leaving it empty, and if mtime/atime for each file are really
needed for specific scenarios as well, we can also use xattrs
to record them then.
Reported-by: nl6720 <nl6720(a)gmail.com>
[ Gao Xiang: It'd be better to backport for user-friendly concern. ]
Fixes: 431339ba9042 ("staging: erofs: add inode operations")
Cc: stable <stable(a)vger.kernel.org> # 4.19+
Signed-off-by: Gao Xiang <hsiangkao(a)redhat.com>
---
fs/erofs/inode.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 139d0bed42f8..3e21c0e8adae 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -107,11 +107,9 @@ static struct page *erofs_read_inode(struct inode *inode,
i_gid_write(inode, le32_to_cpu(die->i_gid));
set_nlink(inode, le32_to_cpu(die->i_nlink));
- /* ns timestamp */
- inode->i_mtime.tv_sec = inode->i_ctime.tv_sec =
- le64_to_cpu(die->i_ctime);
- inode->i_mtime.tv_nsec = inode->i_ctime.tv_nsec =
- le32_to_cpu(die->i_ctime_nsec);
+ /* extended inode has its own timestamp */
+ inode->i_ctime.tv_sec = le64_to_cpu(die->i_ctime);
+ inode->i_ctime.tv_nsec = le32_to_cpu(die->i_ctime_nsec);
inode->i_size = le64_to_cpu(die->i_size);
@@ -149,11 +147,9 @@ static struct page *erofs_read_inode(struct inode *inode,
i_gid_write(inode, le16_to_cpu(dic->i_gid));
set_nlink(inode, le16_to_cpu(dic->i_nlink));
- /* use build time to derive all file time */
- inode->i_mtime.tv_sec = inode->i_ctime.tv_sec =
- sbi->build_time;
- inode->i_mtime.tv_nsec = inode->i_ctime.tv_nsec =
- sbi->build_time_nsec;
+ /* use build time for compact inodes */
+ inode->i_ctime.tv_sec = sbi->build_time;
+ inode->i_ctime.tv_nsec = sbi->build_time_nsec;
inode->i_size = le32_to_cpu(dic->i_size);
if (erofs_inode_is_data_compressed(vi->datalayout))
@@ -167,6 +163,11 @@ static struct page *erofs_read_inode(struct inode *inode,
goto err_out;
}
+ inode->i_mtime.tv_sec = inode->i_ctime.tv_sec;
+ inode->i_atime.tv_sec = inode->i_ctime.tv_sec;
+ inode->i_mtime.tv_nsec = inode->i_ctime.tv_nsec;
+ inode->i_atime.tv_nsec = inode->i_ctime.tv_nsec;
+
if (!nblks)
/* measure inode.i_blocks as generic filesystems */
inode->i_blocks = roundup(inode->i_size, EROFS_BLKSIZ) >> 9;
--
2.24.0
Making perf with gcc-9.1.1 generates the following warning:
CC ui/browsers/hists.o
ui/browsers/hists.c: In function 'perf_evsel__hists_browse':
ui/browsers/hists.c:3078:61: error: '%d' directive output may be \
truncated writing between 1 and 11 bytes into a region of size \
between 2 and 12 [-Werror=format-truncation=]
3078 | "Max event group index to sort is %d (index from 0 to %d)",
| ^~
ui/browsers/hists.c:3078:7: note: directive argument in the range [-2147483648, 8]
3078 | "Max event group index to sort is %d (index from 0 to %d)",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/stdio.h:937,
from ui/browsers/hists.c:5:
IOW, the string in line 3078 might be too long for buf[] of 64 bytes.
Fix this by increasing the size of buf[] to 128.
Fixes: dbddf1747441 ("perf report/top TUI: Support hotkeys to let user select any event for sorting")
Cc: stable <stable(a)vger.kernel.org> # v5.7+
Cc: Jin Yao <yao.jin(a)linux.intel.com>
Cc: Jiri Olsa <jolsa(a)kernel.org>
Cc: Arnaldo Carvalho de Melo <acme(a)redhat.com>
Cc: Arnaldo Carvalho de Melo <acme(a)kernel.org>
Signed-off-by: Song Liu <songliubraving(a)fb.com>
---
tools/perf/ui/browsers/hists.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c
index a07626f072087..b0e1880cf992b 100644
--- a/tools/perf/ui/browsers/hists.c
+++ b/tools/perf/ui/browsers/hists.c
@@ -2963,7 +2963,7 @@ static int perf_evsel__hists_browse(struct evsel *evsel, int nr_events,
struct popup_action actions[MAX_OPTIONS];
int nr_options = 0;
int key = -1;
- char buf[64];
+ char buf[128];
int delay_secs = hbt ? hbt->refresh : 0;
#define HIST_BROWSER_HELP_COMMON \
--
2.24.1
From: Arnd Bergmann <arnd(a)arndb.de>
It sounds unwise to let user space pass an unchecked 32-bit
offset into a kernel structure in an ioctl. This is an unsigned
variable, so checking the upper bound for the size of the structure
it points into is sufficient to avoid data corruption, but as
the pointer might also be unaligned, it has to be written carefully
as well.
While I stumbled over this problem by reading the code, I did not
continue checking the function for further problems like it.
Cc: <stable(a)vger.kernel.org> # v2.6.15+
Fixes: c4a3e0a529ab ("[SCSI] MegaRAID SAS RAID: new driver")
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
---
drivers/scsi/megaraid/megaraid_sas_base.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 41cd66fc7d81..b1b9a8823c8c 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -8134,7 +8134,7 @@ megasas_mgmt_fw_ioctl(struct megasas_instance *instance,
int error = 0, i;
void *sense = NULL;
dma_addr_t sense_handle;
- unsigned long *sense_ptr;
+ void *sense_ptr;
u32 opcode = 0;
int ret = DCMD_SUCCESS;
@@ -8257,6 +8257,12 @@ megasas_mgmt_fw_ioctl(struct megasas_instance *instance,
}
if (ioc->sense_len) {
+ /* make sure the pointer is part of the frame */
+ if (ioc->sense_off > (sizeof(union megasas_frame) - sizeof(__le64))) {
+ error = -EINVAL;
+ goto out;
+ }
+
sense = dma_alloc_coherent(&instance->pdev->dev, ioc->sense_len,
&sense_handle, GFP_KERNEL);
if (!sense) {
@@ -8264,12 +8270,11 @@ megasas_mgmt_fw_ioctl(struct megasas_instance *instance,
goto out;
}
- sense_ptr =
- (unsigned long *) ((unsigned long)cmd->frame + ioc->sense_off);
+ sense_ptr = (void *)cmd->frame + ioc->sense_off;
if (instance->consistent_mask_64bit)
- *sense_ptr = cpu_to_le64(sense_handle);
+ put_unaligned_le64(sense_handle, sense_ptr);
else
- *sense_ptr = cpu_to_le32(sense_handle);
+ put_unaligned_le32(sense_handle, sense_ptr);
}
/*
--
2.27.0
Hi,
Please backport [1] to 4.4, 4.9, 4.14, 4.19.
It fixes a commit that has been backported to all the current stable releases but for some reason the fixup was only backported to 5.4 & 5.8.
Debian 9 no longer boots as a Xen HVM guest because it is missing the fixup patch.
Thanks,
Ross
[1]:
commit 0891fb39ba67bd7ae023ea0d367297ffff010781
Author: Juergen Gross <jgross(a)suse.com>
Date: Wed Sep 30 11:16:14 2020 +0200
xen/events: don't use chip_data for legacy IRQs
Since commit c330fb1ddc0a ("XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.")
Xen is using the chip_data pointer for storing IRQ specific data. When
running as a HVM domain this can result in problems for legacy IRQs, as
those might use chip_data for their own purposes.
Use a local array for this purpose in case of legacy IRQs, avoiding the
double use.
Cc: stable(a)vger.kernel.org
Fixes: c330fb1ddc0a ("XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.")
Signed-off-by: Juergen Gross <jgross(a)suse.com>
Tested-by: Stefan Bader <stefan.bader(a)canonical.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
Link: https://lore.kernel.org/r/20200930091614.13660-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>