From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: cma: don't quit at first error when activating reserved areas
The routine cma_init_reserved_areas is designed to activate all
reserved cma areas. It quits when it first encounters an error.
This can leave some areas in a state where they are reserved but
not activated. There is no feedback to code which performed the
reservation. Attempting to allocate memory from areas in such a
state will result in a BUG.
Modify cma_init_reserved_areas to always attempt to activate all
areas. The called routine, cma_activate_area is responsible for
leaving the area in a valid state. No one is making active use
of returned error codes, so change the routine to void.
How to reproduce: This example uses kernelcore, hugetlb and cma
as an easy way to reproduce. However, this is a more general cma
issue.
Two node x86 VM 16GB total, 8GB per node
Kernel command line parameters, kernelcore=4G hugetlb_cma=8G
Related boot time messages,
hugetlb_cma: reserve 8192 MiB, up to 4096 MiB per node
cma: Reserved 4096 MiB at 0x0000000100000000
hugetlb_cma: reserved 4096 MiB on node 0
cma: Reserved 4096 MiB at 0x0000000300000000
hugetlb_cma: reserved 4096 MiB on node 1
cma: CMA area hugetlb could not be activated
# echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
...
Call Trace:
bitmap_find_next_zero_area_off+0x51/0x90
cma_alloc+0x1a5/0x310
alloc_fresh_huge_page+0x78/0x1a0
alloc_pool_huge_page+0x6f/0xf0
set_max_huge_pages+0x10c/0x250
nr_hugepages_store_common+0x92/0x120
? __kmalloc+0x171/0x270
kernfs_fop_write+0xc1/0x1a0
vfs_write+0xc7/0x1f0
ksys_write+0x5f/0xe0
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Link: http://lkml.kernel.org/r/20200730163123.6451-1-mike.kravetz@oracle.com
Fixes: c64be2bb1c6e ("drivers: add Contiguous Memory Allocator")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Barry Song <song.bao.hua(a)hisilicon.com>
Cc: Marek Szyprowski <m.szyprowski(a)samsung.com>
Cc: Michal Nazarewicz <mina86(a)mina86.com>
Cc: Kyungmin Park <kyungmin.park(a)samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/cma.c | 23 +++++++++--------------
1 file changed, 9 insertions(+), 14 deletions(-)
--- a/mm/cma.c~cma-dont-quit-at-first-error-when-activating-reserved-areas
+++ a/mm/cma.c
@@ -93,17 +93,15 @@ static void cma_clear_bitmap(struct cma
mutex_unlock(&cma->lock);
}
-static int __init cma_activate_area(struct cma *cma)
+static void __init cma_activate_area(struct cma *cma)
{
unsigned long base_pfn = cma->base_pfn, pfn = base_pfn;
unsigned i = cma->count >> pageblock_order;
struct zone *zone;
cma->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma), GFP_KERNEL);
- if (!cma->bitmap) {
- cma->count = 0;
- return -ENOMEM;
- }
+ if (!cma->bitmap)
+ goto out_error;
WARN_ON_ONCE(!pfn_valid(pfn));
zone = page_zone(pfn_to_page(pfn));
@@ -133,25 +131,22 @@ static int __init cma_activate_area(stru
spin_lock_init(&cma->mem_head_lock);
#endif
- return 0;
+ return;
not_in_zone:
- pr_err("CMA area %s could not be activated\n", cma->name);
bitmap_free(cma->bitmap);
+out_error:
cma->count = 0;
- return -EINVAL;
+ pr_err("CMA area %s could not be activated\n", cma->name);
+ return;
}
static int __init cma_init_reserved_areas(void)
{
int i;
- for (i = 0; i < cma_area_count; i++) {
- int ret = cma_activate_area(&cma_areas[i]);
-
- if (ret)
- return ret;
- }
+ for (i = 0; i < cma_area_count; i++)
+ cma_activate_area(&cma_areas[i]);
return 0;
}
_
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem
Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
in at least read mode. This is because the explicit locking in
huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring
the code, the call to huge_pte_alloc in the else block at the beginning of
hugetlb_fault was missed.
Unfortunately, that else clause is exercised when there is no page table
entry. This will likely lead to a call to huge_pmd_share. If
huge_pmd_share thinks pmd sharing is possible, it will traverse the
mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is
modifying the tree, bad things such as addressing exceptions or worse
could happen.
Simply remove the else clause. It should have been removed previously.
The code following the else will call huge_pte_alloc with the appropriate
locking.
To prevent this type of issue in the future, add routines to assert that
i_mmap_rwsem is held, and call these routines in huge pmd sharing
routines.
Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Suggested-by: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Kirill A.Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/fs.h | 10 ++++++++++
include/linux/hugetlb.h | 8 +++++---
mm/hugetlb.c | 15 +++++++--------
mm/rmap.c | 2 +-
4 files changed, 23 insertions(+), 12 deletions(-)
--- a/include/linux/fs.h~hugetlbfs-remove-call-to-huge_pte_alloc-without-i_mmap_rwsem
+++ a/include/linux/fs.h
@@ -518,6 +518,16 @@ static inline void i_mmap_unlock_read(st
up_read(&mapping->i_mmap_rwsem);
}
+static inline void i_mmap_assert_locked(struct address_space *mapping)
+{
+ lockdep_assert_held(&mapping->i_mmap_rwsem);
+}
+
+static inline void i_mmap_assert_write_locked(struct address_space *mapping)
+{
+ lockdep_assert_held_write(&mapping->i_mmap_rwsem);
+}
+
/*
* Might pages of this file be mapped into userspace?
*/
--- a/include/linux/hugetlb.h~hugetlbfs-remove-call-to-huge_pte_alloc-without-i_mmap_rwsem
+++ a/include/linux/hugetlb.h
@@ -164,7 +164,8 @@ pte_t *huge_pte_alloc(struct mm_struct *
unsigned long addr, unsigned long sz);
pte_t *huge_pte_offset(struct mm_struct *mm,
unsigned long addr, unsigned long sz);
-int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
+int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long *addr, pte_t *ptep);
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end);
struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
@@ -203,8 +204,9 @@ static inline struct address_space *huge
return NULL;
}
-static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
- pte_t *ptep)
+static inline int huge_pmd_unshare(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long *addr, pte_t *ptep)
{
return 0;
}
--- a/mm/hugetlb.c~hugetlbfs-remove-call-to-huge_pte_alloc-without-i_mmap_rwsem
+++ a/mm/hugetlb.c
@@ -3967,7 +3967,7 @@ void __unmap_hugepage_range(struct mmu_g
continue;
ptl = huge_pte_lock(h, mm, ptep);
- if (huge_pmd_unshare(mm, &address, ptep)) {
+ if (huge_pmd_unshare(mm, vma, &address, ptep)) {
spin_unlock(ptl);
/*
* We just unmapped a page of PMDs by clearing a PUD.
@@ -4554,10 +4554,6 @@ vm_fault_t hugetlb_fault(struct mm_struc
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
- } else {
- ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
- if (!ptep)
- return VM_FAULT_OOM;
}
/*
@@ -5034,7 +5030,7 @@ unsigned long hugetlb_change_protection(
if (!ptep)
continue;
ptl = huge_pte_lock(h, mm, ptep);
- if (huge_pmd_unshare(mm, &address, ptep)) {
+ if (huge_pmd_unshare(mm, vma, &address, ptep)) {
pages++;
spin_unlock(ptl);
shared_pmd = true;
@@ -5415,12 +5411,14 @@ out:
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
*/
-int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
+int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long *addr, pte_t *ptep)
{
pgd_t *pgd = pgd_offset(mm, *addr);
p4d_t *p4d = p4d_offset(pgd, *addr);
pud_t *pud = pud_offset(p4d, *addr);
+ i_mmap_assert_write_locked(vma->vm_file->f_mapping);
BUG_ON(page_count(virt_to_page(ptep)) == 0);
if (page_count(virt_to_page(ptep)) == 1)
return 0;
@@ -5438,7 +5436,8 @@ pte_t *huge_pmd_share(struct mm_struct *
return NULL;
}
-int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
+int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long *addr, pte_t *ptep)
{
return 0;
}
--- a/mm/rmap.c~hugetlbfs-remove-call-to-huge_pte_alloc-without-i_mmap_rwsem
+++ a/mm/rmap.c
@@ -1469,7 +1469,7 @@ static bool try_to_unmap_one(struct page
* do this outside rmap routines.
*/
VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
- if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
+ if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
/*
* huge_pmd_unshare unmapped an entire PMD
* page. There is no way of knowing exactly
_
OK, some patches in the series add buggy code which is then fixed by
follow-up patches, but none of the bugs fixed are severe regressions on
common configs (e.g. compiler warnings, lockdep/rt errors, or bugs in
new drivers). So I thought it's more important to preserve the credit
for the fixes.
I had to pull 5 patches from git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux mlx5-next
to get the mlx5 things to work, this seems to be how mellanox guys are
always managing things, and they told me they are ok with it.
The following changes since commit bcf876870b95592b52519ed4aafcf9d95999bc9c:
Linux 5.8 (2020-08-02 14:21:45 -0700)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to 8a7c3213db068135e816a6a517157de6443290d6:
vdpa/mlx5: fix up endian-ness for mtu (2020-08-10 10:38:55 -0400)
----------------------------------------------------------------
virtio: fixes, features
IRQ bypass support for vdpa and IFC
MLX5 vdpa driver
Endian-ness fixes for virtio drivers
Misc other fixes
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
----------------------------------------------------------------
Alex Dewar (1):
vdpa/mlx5: Fix uninitialised variable in core/mr.c
Colin Ian King (1):
vdpa/mlx5: fix memory allocation failure checks
Dan Carpenter (2):
vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config()
vdpa: Fix pointer math bug in vdpasim_get_config()
Eli Cohen (9):
net/mlx5: Support setting access rights of dma addresses
net/mlx5: Add VDPA interface type to supported enumerations
net/mlx5: Add interface changes required for VDPA
net/vdpa: Use struct for set/get vq state
vdpa: Modify get_vq_state() to return error code
vdpa/mlx5: Add hardware descriptive header file
vdpa/mlx5: Add support library for mlx5 VDPA implementation
vdpa/mlx5: Add shared memory registration code
vdpa/mlx5: Add VDPA driver for supported mlx5 devices
Gustavo A. R. Silva (1):
vhost: Use flex_array_size() helper in copy_from_user()
Jason Wang (6):
vhost: vdpa: remove per device feature whitelist
vhost-vdpa: refine ioctl pre-processing
vhost: generialize backend features setting/getting
vhost-vdpa: support get/set backend features
vhost-vdpa: support IOTLB batching hints
vdpasim: support batch updating
Liao Pingfang (1):
virtio_pci_modern: Fix the comment of virtio_pci_find_capability()
Mao Wenan (1):
virtio_ring: Avoid loop when vq is broken in virtqueue_poll
Maor Gottlieb (2):
net/mlx5: Export resource dump interface
net/mlx5: Add support in query QP, CQ and MKEY segments
Max Gurtovoy (2):
vdpasim: protect concurrent access to iommu iotlb
vdpa: remove hard coded virtq num
Meir Lichtinger (1):
RDMA/mlx5: ConnectX-7 new capabilities to set relaxed ordering by UMR
Michael Guralnik (2):
net/mlx5: Enable QP number request when creating IPoIB underlay QP
net/mlx5: Enable count action for rules with allow action
Michael S. Tsirkin (44):
virtio: VIRTIO_F_IOMMU_PLATFORM -> VIRTIO_F_ACCESS_PLATFORM
virtio: virtio_has_iommu_quirk -> virtio_has_dma_quirk
virtio_balloon: fix sparse warning
virtio_ring: sparse warning fixup
virtio: allow __virtioXX, __leXX in config space
virtio_9p: correct tags for config space fields
virtio_balloon: correct tags for config space fields
virtio_blk: correct tags for config space fields
virtio_console: correct tags for config space fields
virtio_crypto: correct tags for config space fields
virtio_fs: correct tags for config space fields
virtio_gpu: correct tags for config space fields
virtio_input: correct tags for config space fields
virtio_iommu: correct tags for config space fields
virtio_mem: correct tags for config space fields
virtio_net: correct tags for config space fields
virtio_pmem: correct tags for config space fields
virtio_scsi: correct tags for config space fields
virtio_config: disallow native type fields
mlxbf-tmfifo: sparse tags for config access
vdpa: make sure set_features is invoked for legacy
vhost/vdpa: switch to new helpers
virtio_vdpa: legacy features handling
vdpa_sim: fix endian-ness of config space
virtio_config: cread/write cleanup
virtio_config: rewrite using _Generic
virtio_config: disallow native type fields (again)
virtio_config: LE config space accessors
virtio_caif: correct tags for config space fields
virtio_config: add virtio_cread_le_feature
virtio_balloon: use LE config space accesses
virtio_input: convert to LE accessors
virtio_fs: convert to LE accessors
virtio_crypto: convert to LE accessors
virtio_pmem: convert to LE accessors
drm/virtio: convert to LE accessors
virtio_mem: convert to LE accessors
virtio-iommu: convert to LE accessors
virtio_config: drop LE option from config space
virtio_net: use LE accessors for speed/duplex
Merge branch 'mlx5-next' of git://git.kernel.org/.../mellanox/linux into HEAD
virtio_config: fix up warnings on parisc
vdpa_sim: init iommu lock
vdpa/mlx5: fix up endian-ness for mtu
Parav Pandit (2):
net/mlx5: Avoid RDMA file inclusion in core driver
net/mlx5: Avoid eswitch header inclusion in fs core layer
Tariq Toukan (1):
net/mlx5: kTLS, Improve TLS params layout structures
Zhu Lingshan (7):
vhost: introduce vhost_vring_call
kvm: detect assigned device via irqbypass manager
vDPA: add get_vq_irq() in vdpa_config_ops
vhost_vdpa: implement IRQ offloading in vhost_vdpa
ifcvf: implement vdpa_config_ops.get_vq_irq()
irqbypass: do not start cons/prod when failed connect
vDPA: dont change vq irq after DRIVER_OK
arch/um/drivers/virtio_uml.c | 2 +-
arch/x86/kvm/x86.c | 12 +-
drivers/crypto/virtio/virtio_crypto_core.c | 46 +-
drivers/gpu/drm/virtio/virtgpu_kms.c | 16 +-
drivers/gpu/drm/virtio/virtgpu_object.c | 2 +-
drivers/gpu/drm/virtio/virtgpu_vq.c | 4 +-
drivers/iommu/virtio-iommu.c | 34 +-
drivers/net/ethernet/mellanox/mlx5/core/alloc.c | 11 +-
.../ethernet/mellanox/mlx5/core/diag/rsc_dump.c | 6 +
.../ethernet/mellanox/mlx5/core/diag/rsc_dump.h | 33 +-
drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h | 2 +-
.../ethernet/mellanox/mlx5/core/en_accel/ktls.h | 2 +-
.../ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 14 +-
.../mellanox/mlx5/core/en_accel/tls_rxtx.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 10 -
drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/fs_core.h | 10 +
.../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c | 7 +
drivers/net/ethernet/mellanox/mlx5/core/main.c | 3 +
drivers/net/virtio_net.c | 9 +-
drivers/nvdimm/virtio_pmem.c | 4 +-
drivers/platform/mellanox/mlxbf-tmfifo.c | 13 +-
drivers/scsi/virtio_scsi.c | 4 +-
drivers/vdpa/Kconfig | 19 +
drivers/vdpa/Makefile | 1 +
drivers/vdpa/ifcvf/ifcvf_base.c | 4 +-
drivers/vdpa/ifcvf/ifcvf_base.h | 6 +-
drivers/vdpa/ifcvf/ifcvf_main.c | 31 +-
drivers/vdpa/mlx5/Makefile | 4 +
drivers/vdpa/mlx5/core/mlx5_vdpa.h | 91 +
drivers/vdpa/mlx5/core/mlx5_vdpa_ifc.h | 168 ++
drivers/vdpa/mlx5/core/mr.c | 486 +++++
drivers/vdpa/mlx5/core/resources.c | 284 +++
drivers/vdpa/mlx5/net/main.c | 76 +
drivers/vdpa/mlx5/net/mlx5_vnet.c | 1974 ++++++++++++++++++++
drivers/vdpa/mlx5/net/mlx5_vnet.h | 24 +
drivers/vdpa/vdpa.c | 4 +
drivers/vdpa/vdpa_sim/vdpa_sim.c | 124 +-
drivers/vhost/Kconfig | 1 +
drivers/vhost/net.c | 22 +-
drivers/vhost/vdpa.c | 183 +-
drivers/vhost/vhost.c | 39 +-
drivers/vhost/vhost.h | 11 +-
drivers/virtio/virtio_balloon.c | 30 +-
drivers/virtio/virtio_input.c | 32 +-
drivers/virtio/virtio_mem.c | 30 +-
drivers/virtio/virtio_pci_modern.c | 1 +
drivers/virtio/virtio_ring.c | 7 +-
drivers/virtio/virtio_vdpa.c | 9 +-
fs/fuse/virtio_fs.c | 4 +-
include/linux/mlx5/cq.h | 1 -
include/linux/mlx5/device.h | 13 +-
include/linux/mlx5/driver.h | 2 +
include/linux/mlx5/mlx5_ifc.h | 134 +-
include/linux/mlx5/qp.h | 2 +-
include/linux/mlx5/rsc_dump.h | 51 +
include/linux/vdpa.h | 66 +-
include/linux/virtio_caif.h | 6 +-
include/linux/virtio_config.h | 191 +-
include/linux/virtio_ring.h | 19 +-
include/uapi/linux/vhost.h | 2 +
include/uapi/linux/vhost_types.h | 11 +
include/uapi/linux/virtio_9p.h | 4 +-
include/uapi/linux/virtio_balloon.h | 10 +-
include/uapi/linux/virtio_blk.h | 26 +-
include/uapi/linux/virtio_config.h | 10 +-
include/uapi/linux/virtio_console.h | 8 +-
include/uapi/linux/virtio_crypto.h | 26 +-
include/uapi/linux/virtio_fs.h | 2 +-
include/uapi/linux/virtio_gpu.h | 8 +-
include/uapi/linux/virtio_input.h | 18 +-
include/uapi/linux/virtio_iommu.h | 12 +-
include/uapi/linux/virtio_mem.h | 14 +-
include/uapi/linux/virtio_net.h | 8 +-
include/uapi/linux/virtio_pmem.h | 4 +-
include/uapi/linux/virtio_scsi.h | 20 +-
tools/virtio/linux/virtio_config.h | 6 +-
virt/lib/irqbypass.c | 16 +-
78 files changed, 4116 insertions(+), 487 deletions(-)
create mode 100644 drivers/vdpa/mlx5/Makefile
create mode 100644 drivers/vdpa/mlx5/core/mlx5_vdpa.h
create mode 100644 drivers/vdpa/mlx5/core/mlx5_vdpa_ifc.h
create mode 100644 drivers/vdpa/mlx5/core/mr.c
create mode 100644 drivers/vdpa/mlx5/core/resources.c
create mode 100644 drivers/vdpa/mlx5/net/main.c
create mode 100644 drivers/vdpa/mlx5/net/mlx5_vnet.c
create mode 100644 drivers/vdpa/mlx5/net/mlx5_vnet.h
create mode 100644 include/linux/mlx5/rsc_dump.h