A shmem folio can be either in page cache or in swap cache, but not at the
same time. Namely, once it is in swap cache, folio->mapping should be NULL,
and the folio is no longer in a shmem mapping.
In __folio_migrate_mapping(), to determine the number of xarray entries
to update, folio_test_swapbacked() is used, but that conflates shmem in
page cache case and shmem in swap cache case. It leads to xarray
multi-index entry corruption, since it turns a sibling entry to a
normal entry during xas_store() (see [1] for a userspace reproduction).
Fix it by only using folio_test_swapcache() to determine whether xarray
is storing swap cache entries or not to choose the right number of xarray
entries to update.
[1] https://lore.kernel.org/linux-mm/Z8idPCkaJW1IChjT@casper.infradead.org/
Note:
In __split_huge_page(), folio_test_anon() && folio_test_swapcache() is used
to get swap_cache address space, but that ignores the shmem folio in swap
cache case. It could lead to NULL pointer dereferencing when a
in-swap-cache shmem folio is split at __xa_store(), since
!folio_test_anon() is true and folio->mapping is NULL. But fortunately,
its caller split_huge_page_to_list_to_order() bails out early with EBUSY
when folio->mapping is NULL. So no need to take care of it here.
Fixes: fc346d0a70a1 ("mm: migrate high-order folios in swap cache correctly")
Reported-by: Liu Shixin <liushixin2(a)huawei.com>
Closes: https://lore.kernel.org/all/28546fb4-5210-bf75-16d6-43e1f8646080@huawei.com/
Suggested-by: Hugh Dickins <hughd(a)google.com>
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Cc: stable(a)vger.kernel.org
---
mm/migrate.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index fb4afd31baf0..c0adea67cd62 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -518,15 +518,13 @@ static int __folio_migrate_mapping(struct address_space *mapping,
if (folio_test_anon(folio) && folio_test_large(folio))
mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1);
folio_ref_add(newfolio, nr); /* add cache reference */
- if (folio_test_swapbacked(folio)) {
+ if (folio_test_swapbacked(folio))
__folio_set_swapbacked(newfolio);
- if (folio_test_swapcache(folio)) {
- folio_set_swapcache(newfolio);
- newfolio->private = folio_get_private(folio);
- }
+ if (folio_test_swapcache(folio)) {
+ folio_set_swapcache(newfolio);
+ newfolio->private = folio_get_private(folio);
entries = nr;
} else {
- VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
entries = 1;
}
--
2.47.2
The current implementation of iommufd_device_do_replace() implicitly
assumes that the input device has already been attached. However, there
is no explicit check to verify this assumption. If another device within
the same group has been attached, the replace operation might succeed,
but the input device itself may not have been attached yet.
As a result, the input device might not be tracked in the
igroup->device_list, and its reserved IOVA might not be added. Despite
this, the caller might incorrectly assume that the device has been
successfully replaced, which could lead to unexpected behavior or errors.
To address this issue, add a check to ensure that the input device has
been attached before proceeding with the replace operation. This check
will help maintain the integrity of the device tracking system and prevent
potential issues arising from incorrect assumptions about the device's
attachment status.
Fixes: e88d4ec154a8 ("iommufd: Add iommufd_device_replace()")
Cc: stable(a)vger.kernel.org
Reviewed-by: Kevin Tian <kevin.tian(a)intel.com>
Signed-off-by: Yi Liu <yi.l.liu(a)intel.com>
---
Change log:
v2:
- Add r-b tag (Kevin)
- Minor tweaks. I swarpped the order of is_attach check with the
if (igroup->hwpt == NULL) check, hence no need to add WARN_ON.
v1: https://lore.kernel.org/linux-iommu/20250304120754.12450-1-yi.l.liu@intel.c…
---
drivers/iommu/iommufd/device.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index b2f0cb909e6d..bd50146e2ad0 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -471,6 +471,17 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
/* The device attach/detach/replace helpers for attach_handle */
+/* Check if idev is attached to igroup->hwpt */
+static bool iommufd_device_is_attached(struct iommufd_device *idev)
+{
+ struct iommufd_device *cur;
+
+ list_for_each_entry(cur, &idev->igroup->device_list, group_item)
+ if (cur == idev)
+ return true;
+ return false;
+}
+
static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
struct iommufd_device *idev)
{
@@ -710,6 +721,11 @@ iommufd_device_do_replace(struct iommufd_device *idev,
goto err_unlock;
}
+ if (!iommufd_device_is_attached(idev)) {
+ rc = -EINVAL;
+ goto err_unlock;
+ }
+
if (hwpt == igroup->hwpt) {
mutex_unlock(&idev->igroup->lock);
return NULL;
--
2.34.1
Compared to the SNP Guest Request, the "Extended" version adds data pages
for receiving certificates. If not enough pages provided, the HV can
report to the VM how much is needed so the VM can reallocate and repeat.
Commit ae596615d93d ("virt: sev-guest: Reduce the scope of SNP command
mutex") moved handling of the allocated/desired pages number out of scope
of said mutex and create a possibility for a race (multiple instances
trying to trigger Extended request in a VM) as there is just one instance
of snp_msg_desc per /dev/sev-guest and no locking other than snp_cmd_mutex.
Fix the issue by moving the data blob/size and the GHCB input struct
(snp_req_data) into snp_guest_req which is allocated on stack now
and accessed by the GHCB caller under that mutex.
Stop allocating SEV_FW_BLOB_MAX_SIZE in snp_msg_alloc() as only one of
four callers needs it. Free the received blob in get_ext_report() right
after it is copied to the userspace. Possible future users of
snp_send_guest_request() are likely to have different ideas about
the buffer size anyways.
Fixes: ae596615d93d ("virt: sev-guest: Reduce the scope of SNP command mutex")
Cc: stable(a)vger.kernel.org
Cc: Nikunj A Dadhania <nikunj(a)amd.com>
Signed-off-by: Alexey Kardashevskiy <aik(a)amd.com>
---
arch/x86/include/asm/sev.h | 6 ++--
arch/x86/coco/sev/core.c | 23 +++++--------
drivers/virt/coco/sev-guest/sev-guest.c | 34 ++++++++++++++++----
3 files changed, 39 insertions(+), 24 deletions(-)
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 1581246491b5..ba7999f66abe 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -203,6 +203,9 @@ struct snp_guest_req {
unsigned int vmpck_id;
u8 msg_version;
u8 msg_type;
+
+ struct snp_req_data input;
+ void *certs_data;
};
/*
@@ -263,9 +266,6 @@ struct snp_msg_desc {
struct snp_guest_msg secret_request, secret_response;
struct snp_secrets_page *secrets;
- struct snp_req_data input;
-
- void *certs_data;
struct aesgcm_ctx *ctx;
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 82492efc5d94..d02eea5e3d50 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2853,19 +2853,8 @@ struct snp_msg_desc *snp_msg_alloc(void)
if (!mdesc->response)
goto e_free_request;
- mdesc->certs_data = alloc_shared_pages(SEV_FW_BLOB_MAX_SIZE);
- if (!mdesc->certs_data)
- goto e_free_response;
-
- /* initial the input address for guest request */
- mdesc->input.req_gpa = __pa(mdesc->request);
- mdesc->input.resp_gpa = __pa(mdesc->response);
- mdesc->input.data_gpa = __pa(mdesc->certs_data);
-
return mdesc;
-e_free_response:
- free_shared_pages(mdesc->response, sizeof(struct snp_guest_msg));
e_free_request:
free_shared_pages(mdesc->request, sizeof(struct snp_guest_msg));
e_unmap:
@@ -2885,7 +2874,6 @@ void snp_msg_free(struct snp_msg_desc *mdesc)
kfree(mdesc->ctx);
free_shared_pages(mdesc->response, sizeof(struct snp_guest_msg));
free_shared_pages(mdesc->request, sizeof(struct snp_guest_msg));
- free_shared_pages(mdesc->certs_data, SEV_FW_BLOB_MAX_SIZE);
iounmap((__force void __iomem *)mdesc->secrets);
memset(mdesc, 0, sizeof(*mdesc));
@@ -3054,7 +3042,7 @@ static int __handle_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_r
* sequence number must be incremented or the VMPCK must be deleted to
* prevent reuse of the IV.
*/
- rc = snp_issue_guest_request(req, &mdesc->input, rio);
+ rc = snp_issue_guest_request(req, &req->input, rio);
switch (rc) {
case -ENOSPC:
/*
@@ -3064,7 +3052,7 @@ static int __handle_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_r
* order to increment the sequence number and thus avoid
* IV reuse.
*/
- override_npages = mdesc->input.data_npages;
+ override_npages = req->input.data_npages;
req->exit_code = SVM_VMGEXIT_GUEST_REQUEST;
/*
@@ -3120,7 +3108,7 @@ static int __handle_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_r
}
if (override_npages)
- mdesc->input.data_npages = override_npages;
+ req->input.data_npages = override_npages;
return rc;
}
@@ -3158,6 +3146,11 @@ int snp_send_guest_request(struct snp_msg_desc *mdesc, struct snp_guest_req *req
*/
memcpy(mdesc->request, &mdesc->secret_request, sizeof(mdesc->secret_request));
+ /* initial the input address for guest request */
+ req->input.req_gpa = __pa(mdesc->request);
+ req->input.resp_gpa = __pa(mdesc->response);
+ req->input.data_gpa = req->certs_data ? __pa(req->certs_data) : 0;
+
rc = __handle_guest_request(mdesc, req, rio);
if (rc) {
if (rc == -EIO &&
diff --git a/drivers/virt/coco/sev-guest/sev-guest.c b/drivers/virt/coco/sev-guest/sev-guest.c
index 4699fdc9ed44..cf3fb61f4d5b 100644
--- a/drivers/virt/coco/sev-guest/sev-guest.c
+++ b/drivers/virt/coco/sev-guest/sev-guest.c
@@ -177,6 +177,7 @@ static int get_ext_report(struct snp_guest_dev *snp_dev, struct snp_guest_reques
struct snp_guest_req req = {};
int ret, npages = 0, resp_len;
sockptr_t certs_address;
+ struct page *page;
if (sockptr_is_null(io->req_data) || sockptr_is_null(io->resp_data))
return -EINVAL;
@@ -210,8 +211,20 @@ static int get_ext_report(struct snp_guest_dev *snp_dev, struct snp_guest_reques
* the host. If host does not supply any certs in it, then copy
* zeros to indicate that certificate data was not provided.
*/
- memset(mdesc->certs_data, 0, report_req->certs_len);
npages = report_req->certs_len >> PAGE_SHIFT;
+ page = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO,
+ get_order(report_req->certs_len));
+ if (!page)
+ return -ENOMEM;
+
+ req.certs_data = page_address(page);
+ ret = set_memory_decrypted((unsigned long)req.certs_data, npages);
+ if (ret) {
+ pr_err("failed to mark page shared, ret=%d\n", ret);
+ __free_pages(page, get_order(report_req->certs_len));
+ return -EFAULT;
+ }
+
cmd:
/*
* The intermediate response buffer is used while decrypting the
@@ -220,10 +233,12 @@ static int get_ext_report(struct snp_guest_dev *snp_dev, struct snp_guest_reques
*/
resp_len = sizeof(report_resp->data) + mdesc->ctx->authsize;
report_resp = kzalloc(resp_len, GFP_KERNEL_ACCOUNT);
- if (!report_resp)
- return -ENOMEM;
+ if (!report_resp) {
+ ret = -ENOMEM;
+ goto e_free_data;
+ }
- mdesc->input.data_npages = npages;
+ req.input.data_npages = npages;
req.msg_version = arg->msg_version;
req.msg_type = SNP_MSG_REPORT_REQ;
@@ -238,7 +253,7 @@ static int get_ext_report(struct snp_guest_dev *snp_dev, struct snp_guest_reques
/* If certs length is invalid then copy the returned length */
if (arg->vmm_error == SNP_GUEST_VMM_ERR_INVALID_LEN) {
- report_req->certs_len = mdesc->input.data_npages << PAGE_SHIFT;
+ report_req->certs_len = req.input.data_npages << PAGE_SHIFT;
if (copy_to_sockptr(io->req_data, report_req, sizeof(*report_req)))
ret = -EFAULT;
@@ -247,7 +262,7 @@ static int get_ext_report(struct snp_guest_dev *snp_dev, struct snp_guest_reques
if (ret)
goto e_free;
- if (npages && copy_to_sockptr(certs_address, mdesc->certs_data, report_req->certs_len)) {
+ if (npages && copy_to_sockptr(certs_address, req.certs_data, report_req->certs_len)) {
ret = -EFAULT;
goto e_free;
}
@@ -257,6 +272,13 @@ static int get_ext_report(struct snp_guest_dev *snp_dev, struct snp_guest_reques
e_free:
kfree(report_resp);
+e_free_data:
+ if (npages) {
+ if (set_memory_encrypted((unsigned long)req.certs_data, npages))
+ WARN_ONCE(ret, "failed to restore encryption mask (leak it)\n");
+ else
+ __free_pages(page, get_order(report_req->certs_len));
+ }
return ret;
}
--
2.47.1
On our Marvell OCTEON CN96XX board, we observed the following panic on
the latest kernel:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080
Mem abort info:
ESR = 0x0000000096000005
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
Data abort info:
ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[0000000000000080] user address but active_mm is swapper
Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
Modules linked in:
CPU: 9 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.13.0-rc7-00149-g9bffa1ad25b8 #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : of_pci_add_properties+0x278/0x4c8
lr : of_pci_add_properties+0x258/0x4c8
sp : ffff8000822ef9b0
x29: ffff8000822ef9b0 x28: ffff000106dd8000 x27: ffff800081bc3b30
x26: ffff800081540118 x25: ffff8000813d2be0 x24: 0000000000000000
x23: ffff00010528a800 x22: ffff000107c50000 x21: ffff0001039c2630
x20: ffff0001039c2630 x19: 0000000000000000 x18: ffffffffffffffff
x17: 00000000a49c1b85 x16: 0000000084c07b58 x15: ffff000103a10f98
x14: ffffffffffffffff x13: ffff000103a10f96 x12: 0000000000000003
x11: 0101010101010101 x10: 000000000000002c x9 : ffff800080ca7acc
x8 : ffff0001038fd900 x7 : 0000000000000000 x6 : 0000000000696370
x5 : 0000000000000000 x4 : 0000000000000002 x3 : ffff8000822efa40
x2 : ffff800081341000 x1 : ffff000107c50000 x0 : 0000000000000000
Call trace:
of_pci_add_properties+0x278/0x4c8 (P)
of_pci_make_dev_node+0xe0/0x158
pci_bus_add_device+0x158/0x210
pci_bus_add_devices+0x40/0x98
pci_host_probe+0x94/0x118
pci_host_common_probe+0x120/0x1a0
platform_probe+0x70/0xf0
really_probe+0xb4/0x2a8
__driver_probe_device+0x80/0x140
driver_probe_device+0x48/0x170
__driver_attach+0x9c/0x1b0
bus_for_each_dev+0x7c/0xe8
driver_attach+0x2c/0x40
bus_add_driver+0xec/0x218
driver_register+0x68/0x138
__platform_driver_register+0x2c/0x40
gen_pci_driver_init+0x24/0x38
do_one_initcall+0x4c/0x278
kernel_init_freeable+0x1f4/0x3d0
kernel_init+0x28/0x1f0
ret_from_fork+0x10/0x20
Code: aa1603e1 f0005522 d2800044 91000042 (f94040a0)
This regression was introduced by commit 7246a4520b4b ("PCI: Use
preserve_config in place of pci_flags"). On our board, the 002:00:07.0
bridge is misconfigured by the bootloader. Both its secondary and
subordinate bus numbers are initialized to 0, while its fixed secondary
bus number is set to 8. However, bus number 8 is also assigned to another
bridge (0002:00:0f.0). Although this is a bootloader issue, before the
change in commit 7246a4520b4b, the PCI_REASSIGN_ALL_BUS flag was
set by default when PCI_PROBE_ONLY was enabled, ensuing that all the
bus number for these bridges were reassigned, avoiding any conflicts.
After the change introduced in commit 7246a4520b4b, the bus numbers
assigned by the bootloader are reused by all other bridges, except
the misconfigured 002:00:07.0 bridge. The kernel attempt to reconfigure
002:00:07.0 by reusing the fixed secondary bus number 8 assigned by
bootloader. However, since a pci_bus has already been allocated for
bus 8 due to the probe of 0002:00:0f.0, no new pci_bus allocated for
002:00:07.0. This results in a pci bridge device without a pci_bus
attached (pdev->subordinate == NULL). Consequently, accessing
pdev->subordinate in of_pci_prop_bus_range() leads to a NULL pointer
dereference.
To summarize, we need to restore the PCI_REASSIGN_ALL_BUS flag when
PCI_PROBE_ONLY is enabled in order to work around issue like the one
described above.
Cc: stable(a)vger.kernel.org
Fixes: 7246a4520b4b ("PCI: Use preserve_config in place of pci_flags")
Signed-off-by: Bo Sun <Bo.Sun.CN(a)windriver.com>
---
drivers/pci/controller/pci-host-common.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/pci/controller/pci-host-common.c b/drivers/pci/controller/pci-host-common.c
index cf5f59a745b3..615923acbc3e 100644
--- a/drivers/pci/controller/pci-host-common.c
+++ b/drivers/pci/controller/pci-host-common.c
@@ -73,6 +73,10 @@ int pci_host_common_probe(struct platform_device *pdev)
if (IS_ERR(cfg))
return PTR_ERR(cfg);
+ /* Do not reassign resources if probe only */
+ if (!pci_has_flag(PCI_PROBE_ONLY))
+ pci_add_flags(PCI_REASSIGN_ALL_BUS);
+
bridge->sysdata = cfg;
bridge->ops = (struct pci_ops *)&ops->pci_ops;
bridge->msi_domain = true;
--
2.48.1
Instead of writing a pte directly into the table, use the set_pte_at()
helper, which gives the arch visibility of the change.
In this instance we are guaranteed that the pte was originally none and
is being modified to a not-present pte, so there was unlikely to be a
bug in practice (at least not on arm64). But it's bad practice to write
the page table memory directly without arch involvement.
Cc: <stable(a)vger.kernel.org>
Fixes: 662df3e5c376 ("mm: madvise: implement lightweight guard page mechanism")
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
---
mm/madvise.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 388dc289b5d1..6170f4acc14f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1101,7 +1101,7 @@ static int guard_install_set_pte(unsigned long addr, unsigned long next,
unsigned long *nr_pages = (unsigned long *)walk->private;
/* Simply install a PTE marker, this causes segfault on access. */
- *ptep = make_pte_marker(PTE_MARKER_GUARD);
+ set_pte_at(walk->mm, addr, ptep, make_pte_marker(PTE_MARKER_GUARD));
(*nr_pages)++;
return 0;
--
2.43.0
On Tue, Feb 18, 2025 at 02:10:08AM +0100, Andrew Lunn wrote:
> On Tue, Feb 18, 2025 at 12:24:43AM +0000, Qasim Ijaz wrote:
> > In mii_nway_restart() during the line:
> >
> > bmcr = mii->mdio_read(mii->dev, mii->phy_id, MII_BMCR);
> >
> > The code attempts to call mii->mdio_read which is ch9200_mdio_read().
> >
> > ch9200_mdio_read() utilises a local buffer, which is initialised
> > with control_read():
> >
> > unsigned char buff[2];
> >
> > However buff is conditionally initialised inside control_read():
> >
> > if (err == size) {
> > memcpy(data, buf, size);
> > }
> >
> > If the condition of "err == size" is not met, then buff remains
> > uninitialised. Once this happens the uninitialised buff is accessed
> > and returned during ch9200_mdio_read():
> >
> > return (buff[0] | buff[1] << 8);
> >
> > The problem stems from the fact that ch9200_mdio_read() ignores the
> > return value of control_read(), leading to uinit-access of buff.
> >
> > To fix this we should check the return value of control_read()
> > and return early on error.
>
> What about get_mac_address()?
>
> If you find a bug, it is a good idea to look around and see if there
> are any more instances of the same bug. I could be wrong, but it seems
> like get_mac_address() suffers from the same problem?
Thank you for the feedback Andrew. I checked get_mac_address() before
sending this patch and to me it looks like it does check the return value of
control_read(). It accumulates the return value of each control_read() call into
rd_mac_len and then checks if it not equal to what is expected (ETH_ALEN which is 6),
I believe each call should return 2.
>
> Andrew