This is a note to let you know that I've just added the patch titled
usb: dwc3: gadget: Fix null pointer exception
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 26288448120b28af1dfd85a6fa6b6d55a16c7f2f Mon Sep 17 00:00:00 2001
From: Albert Wang <albertccwang(a)google.com>
Date: Tue, 9 Nov 2021 17:26:42 +0800
Subject: usb: dwc3: gadget: Fix null pointer exception
In the endpoint interrupt functions
dwc3_gadget_endpoint_transfer_in_progress() and
dwc3_gadget_endpoint_trbs_complete() will dereference the endpoint
descriptor. But it could be cleared in __dwc3_gadget_ep_disable()
when accessory disconnected. So we need to check whether it is null
or not before dereferencing it.
Fixes: f09ddcfcb8c5 ("usb: dwc3: gadget: Prevent EP queuing while stopping transfers")
Cc: stable <stable(a)vger.kernel.org>
Reviewed-by: Jack Pham <quic_jackp(a)quicinc.com>
Signed-off-by: Albert Wang <albertccwang(a)google.com>
Link: https://lore.kernel.org/r/20211109092642.3507692-1-albertccwang@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/dwc3/gadget.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index daa8f8548a2e..7e3db00e9759 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -3263,6 +3263,9 @@ static bool dwc3_gadget_endpoint_trbs_complete(struct dwc3_ep *dep,
struct dwc3 *dwc = dep->dwc;
bool no_started_trb = true;
+ if (!dep->endpoint.desc)
+ return no_started_trb;
+
dwc3_gadget_ep_cleanup_completed_requests(dep, event, status);
if (dep->flags & DWC3_EP_END_TRANSFER_PENDING)
@@ -3310,6 +3313,9 @@ static void dwc3_gadget_endpoint_transfer_in_progress(struct dwc3_ep *dep,
{
int status = 0;
+ if (!dep->endpoint.desc)
+ return;
+
if (usb_endpoint_xfer_isoc(dep->endpoint.desc))
dwc3_gadget_endpoint_frame_from_event(dep, event);
--
2.34.0
This is a note to let you know that I've just added the patch titled
usb: dwc2: gadget: Fix ISOC flow for elapsed frames
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 7ad4a0b1d46b2612f4429a72afd8f137d7efa9a9 Mon Sep 17 00:00:00 2001
From: Minas Harutyunyan <Minas.Harutyunyan(a)synopsys.com>
Date: Thu, 4 Nov 2021 11:36:01 +0400
Subject: usb: dwc2: gadget: Fix ISOC flow for elapsed frames
Added updating of request frame number for elapsed frames,
otherwise frame number will remain as previous use of request.
This will allow function driver to correctly track frames in
case of Missed ISOC occurs.
Added setting request actual length to 0 for elapsed frames.
In Slave mode when pushing data to RxFIFO by dwords, request
actual length incrementing accordingly. But before whole packet
will be pushed into RxFIFO and send to host can occurs Missed
ISOC and data will not send to host. So, in this case request
actual length should be reset to 0.
Fixes: 91bb163e1e4f ("usb: dwc2: gadget: Fix ISOC flow for BDMA and Slave")
Cc: stable <stable(a)vger.kernel.org>
Reviewed-by: John Keeping <john(a)metanate.com>
Signed-off-by: Minas Harutyunyan <Minas.Harutyunyan(a)synopsys.com>
Link: https://lore.kernel.org/r/c356baade6e9716d312d43df08d53ae557cb8037.16360112…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/dwc2/gadget.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/dwc2/gadget.c b/drivers/usb/dwc2/gadget.c
index 4ab4a1d5062b..ab8d7dad9f56 100644
--- a/drivers/usb/dwc2/gadget.c
+++ b/drivers/usb/dwc2/gadget.c
@@ -1198,6 +1198,8 @@ static void dwc2_hsotg_start_req(struct dwc2_hsotg *hsotg,
}
ctrl |= DXEPCTL_CNAK;
} else {
+ hs_req->req.frame_number = hs_ep->target_frame;
+ hs_req->req.actual = 0;
dwc2_hsotg_complete_request(hsotg, hs_ep, hs_req, -ENODATA);
return;
}
@@ -2857,9 +2859,12 @@ static void dwc2_gadget_handle_ep_disabled(struct dwc2_hsotg_ep *hs_ep)
do {
hs_req = get_ep_head(hs_ep);
- if (hs_req)
+ if (hs_req) {
+ hs_req->req.frame_number = hs_ep->target_frame;
+ hs_req->req.actual = 0;
dwc2_hsotg_complete_request(hsotg, hs_ep, hs_req,
-ENODATA);
+ }
dwc2_gadget_incr_frame_num(hs_ep);
/* Update current frame number value. */
hsotg->frame_number = dwc2_hsotg_read_frameno(hsotg);
@@ -2912,8 +2917,11 @@ static void dwc2_gadget_handle_out_token_ep_disabled(struct dwc2_hsotg_ep *ep)
while (dwc2_gadget_target_frame_elapsed(ep)) {
hs_req = get_ep_head(ep);
- if (hs_req)
+ if (hs_req) {
+ hs_req->req.frame_number = ep->target_frame;
+ hs_req->req.actual = 0;
dwc2_hsotg_complete_request(hsotg, ep, hs_req, -ENODATA);
+ }
dwc2_gadget_incr_frame_num(ep);
/* Update current frame number value. */
@@ -3002,8 +3010,11 @@ static void dwc2_gadget_handle_nak(struct dwc2_hsotg_ep *hs_ep)
while (dwc2_gadget_target_frame_elapsed(hs_ep)) {
hs_req = get_ep_head(hs_ep);
- if (hs_req)
+ if (hs_req) {
+ hs_req->req.frame_number = hs_ep->target_frame;
+ hs_req->req.actual = 0;
dwc2_hsotg_complete_request(hsotg, hs_ep, hs_req, -ENODATA);
+ }
dwc2_gadget_incr_frame_num(hs_ep);
/* Update current frame number value. */
--
2.34.0
This is a note to let you know that I've just added the patch titled
usb: dwc3: gadget: Ignore NoStream after End Transfer
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From d74dc3e9f58c28689cef1faccf918e06587367d3 Mon Sep 17 00:00:00 2001
From: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Date: Mon, 25 Oct 2021 16:21:10 -0700
Subject: usb: dwc3: gadget: Ignore NoStream after End Transfer
The End Transfer command from a stream endpoint will generate a NoStream
event, and we should ignore it. Currently we set the flag
DWC3_EP_IGNORE_NEXT_NOSTREAM to track this prior to sending the command,
and it will be cleared on the next stream event. However, a stream event
may be generated before the End Transfer command completion and
prematurely clear the flag. Fix this by setting the flag on End Transfer
completion instead.
Fixes: 140ca4cfea8a ("usb: dwc3: gadget: Handle stream transfers")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Link: https://lore.kernel.org/r/cee1253af4c3600edb878d11c9c08b040817ae23.16352039…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/dwc3/gadget.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 23de2a5a40d6..3d6f4adaa15a 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -3352,6 +3352,14 @@ static void dwc3_gadget_endpoint_command_complete(struct dwc3_ep *dep,
if (cmd != DWC3_DEPCMD_ENDTRANSFER)
return;
+ /*
+ * The END_TRANSFER command will cause the controller to generate a
+ * NoStream Event, and it's not due to the host DP NoStream rejection.
+ * Ignore the next NoStream event.
+ */
+ if (dep->stream_capable)
+ dep->flags |= DWC3_EP_IGNORE_NEXT_NOSTREAM;
+
dep->flags &= ~DWC3_EP_END_TRANSFER_PENDING;
dep->flags &= ~DWC3_EP_TRANSFER_STARTED;
dwc3_gadget_ep_cleanup_cancelled_requests(dep);
@@ -3574,14 +3582,6 @@ static void dwc3_stop_active_transfer(struct dwc3_ep *dep, bool force,
WARN_ON_ONCE(ret);
dep->resource_index = 0;
- /*
- * The END_TRANSFER command will cause the controller to generate a
- * NoStream Event, and it's not due to the host DP NoStream rejection.
- * Ignore the next NoStream event.
- */
- if (dep->stream_capable)
- dep->flags |= DWC3_EP_IGNORE_NEXT_NOSTREAM;
-
if (!interrupt)
dep->flags &= ~DWC3_EP_TRANSFER_STARTED;
else
--
2.34.0
This is a note to let you know that I've just added the patch titled
usb: dwc3: gadget: Check for L1/L2/U3 for Start Transfer
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 63c4c320ccf77074ffe9019ac596603133c1b517 Mon Sep 17 00:00:00 2001
From: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Date: Mon, 25 Oct 2021 16:35:06 -0700
Subject: usb: dwc3: gadget: Check for L1/L2/U3 for Start Transfer
The programming guide noted that the driver needs to verify if the link
state is in U0 before executing the Start Transfer command. If it's not
in U0, the driver needs to perform remote wakeup. This is not accurate.
If the link state is in U1/U2, then the controller will not respond to
link recovery request from DCTL.ULSTCHNGREQ. The Start Transfer command
will trigger a link recovery if it is in U1/U2. A clarification will be
added to the programming guide for all controller versions.
The current implementation shouldn't cause any functional issue. It may
occasionally report an invalid time out warning from failed link
recovery request. The driver will still go ahead with the Start Transfer
command if the remote wakeup fails. The new change only initiates remote
wakeup where it is needed, which is when the link state is in L1/L2/U3.
Fixes: c36d8e947a56 ("usb: dwc3: gadget: put link to U0 before Start Transfer")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Link: https://lore.kernel.org/r/05b4a5fbfbd0863fc9b1d7af934a366219e3d0b4.16352047…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/dwc3/gadget.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 3d6f4adaa15a..daa8f8548a2e 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -310,13 +310,24 @@ int dwc3_send_gadget_ep_cmd(struct dwc3_ep *dep, unsigned int cmd,
if (DWC3_DEPCMD_CMD(cmd) == DWC3_DEPCMD_STARTTRANSFER) {
int link_state;
+ /*
+ * Initiate remote wakeup if the link state is in U3 when
+ * operating in SS/SSP or L1/L2 when operating in HS/FS. If the
+ * link state is in U1/U2, no remote wakeup is needed. The Start
+ * Transfer command will initiate the link recovery.
+ */
link_state = dwc3_gadget_get_link_state(dwc);
- if (link_state == DWC3_LINK_STATE_U1 ||
- link_state == DWC3_LINK_STATE_U2 ||
- link_state == DWC3_LINK_STATE_U3) {
+ switch (link_state) {
+ case DWC3_LINK_STATE_U2:
+ if (dwc->gadget->speed >= USB_SPEED_SUPER)
+ break;
+
+ fallthrough;
+ case DWC3_LINK_STATE_U3:
ret = __dwc3_gadget_wakeup(dwc);
dev_WARN_ONCE(dwc->dev, ret, "wakeup failed --> %d\n",
ret);
+ break;
}
}
--
2.34.0
This is a note to let you know that I've just added the patch titled
usb: dwc3: core: Revise GHWPARAMS9 offset
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 250fdabec6ffcaf895c5e0dedca62706ef10d8f6 Mon Sep 17 00:00:00 2001
From: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Date: Mon, 25 Oct 2021 16:15:32 -0700
Subject: usb: dwc3: core: Revise GHWPARAMS9 offset
During our predesign phase for DWC_usb32, the GHWPARAMS9 register offset
was 0xc680. We revised our final design, and the GHWPARAMS9 offset is
now moved to 0xc6e8 on release.
Fixes: 16710380d3aa ("usb: dwc3: Capture new capability register GHWPARAMS9")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Link: https://lore.kernel.org/r/1541737108266a97208ff827805be1f32852590c.16352028…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/dwc3/core.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
index 620c8d3914d7..5c491d0a19d7 100644
--- a/drivers/usb/dwc3/core.h
+++ b/drivers/usb/dwc3/core.h
@@ -143,7 +143,7 @@
#define DWC3_GHWPARAMS8 0xc600
#define DWC3_GUCTL3 0xc60c
#define DWC3_GFLADJ 0xc630
-#define DWC3_GHWPARAMS9 0xc680
+#define DWC3_GHWPARAMS9 0xc6e0
/* Device Registers */
#define DWC3_DCFG 0xc700
--
2.34.0
This is a note to let you know that I've just added the patch titled
staging: rtl8192e: Fix use after free in _rtl92e_pci_disconnect()
to my staging git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git
in the staging-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From b535917c51acc97fb0761b1edec85f1f3d02bda4 Mon Sep 17 00:00:00 2001
From: Dan Carpenter <dan.carpenter(a)oracle.com>
Date: Wed, 17 Nov 2021 10:20:16 +0300
Subject: staging: rtl8192e: Fix use after free in _rtl92e_pci_disconnect()
The free_rtllib() function frees the "dev" pointer so there is use
after free on the next line. Re-arrange things to avoid that.
Fixes: 66898177e7e5 ("staging: rtl8192e: Fix unload/reload problem")
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter(a)oracle.com>
Link: https://lore.kernel.org/r/20211117072016.GA5237@kili
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/staging/rtl8192e/rtl8192e/rtl_core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/staging/rtl8192e/rtl8192e/rtl_core.c b/drivers/staging/rtl8192e/rtl8192e/rtl_core.c
index d2e9df60e9ba..b9ce71848023 100644
--- a/drivers/staging/rtl8192e/rtl8192e/rtl_core.c
+++ b/drivers/staging/rtl8192e/rtl8192e/rtl_core.c
@@ -2549,13 +2549,14 @@ static void _rtl92e_pci_disconnect(struct pci_dev *pdev)
free_irq(dev->irq, dev);
priv->irq = 0;
}
- free_rtllib(dev);
if (dev->mem_start != 0) {
iounmap((void __iomem *)dev->mem_start);
release_mem_region(pci_resource_start(pdev, 1),
pci_resource_len(pdev, 1));
}
+
+ free_rtllib(dev);
}
pci_disable_device(pdev);
--
2.34.0
This is a note to let you know that I've just added the patch titled
staging: greybus: Add missing rwsem around snd_ctl_remove() calls
to my staging git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git
in the staging-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From ffcf7ae90f4489047d7b076539ba207024dea5f6 Mon Sep 17 00:00:00 2001
From: Takashi Iwai <tiwai(a)suse.de>
Date: Tue, 16 Nov 2021 08:20:27 +0100
Subject: staging: greybus: Add missing rwsem around snd_ctl_remove() calls
snd_ctl_remove() has to be called with card->controls_rwsem held (when
called after the card instantiation). This patch adds the missing
rwsem calls around it.
Fixes: 510e340efe0c ("staging: greybus: audio: Add helper APIs for dynamic audio modules")
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Link: https://lore.kernel.org/r/20211116072027.18466-1-tiwai@suse.de
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/staging/greybus/audio_helper.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/staging/greybus/audio_helper.c b/drivers/staging/greybus/audio_helper.c
index 1ed4772d2771..843760675876 100644
--- a/drivers/staging/greybus/audio_helper.c
+++ b/drivers/staging/greybus/audio_helper.c
@@ -192,7 +192,11 @@ int gbaudio_remove_component_controls(struct snd_soc_component *component,
unsigned int num_controls)
{
struct snd_card *card = component->card->snd_card;
+ int err;
- return gbaudio_remove_controls(card, component->dev, controls,
- num_controls, component->name_prefix);
+ down_write(&card->controls_rwsem);
+ err = gbaudio_remove_controls(card, component->dev, controls,
+ num_controls, component->name_prefix);
+ up_write(&card->controls_rwsem);
+ return err;
}
--
2.34.0
As spotted and explained in commit c12ab8dbc492 ("powerpc/8xx: Fix
Oops with STRICT_KERNEL_RWX without DEBUG_RODATA_TEST"), the selection
of STRICT_KERNEL_RWX without selecting DEBUG_RODATA_TEST has spotted
the lack of the DIRTY bit in the pinned kernel data TLBs.
This problem should have been detected a lot earlier if things had
been working as expected. But due to an incredible level of chance or
mishap, this went undetected because of a set of bugs: In fact the
DTLBs were not pinned, because instead of setting the reserve bit
in MD_CTR, it was set in MI_CTR that is the register for ITLBs.
But then, another huge bug was there: the physical address was
reset to 0 at the boundary between RO and RW areas, leading to the
same physical space being mapped at both 0xc0000000 and 0xc8000000.
This had by miracle no consequence until now because the entry was
not really pinned so it was overwritten soon enough to go undetected.
Of course, now that we really pin the DTLBs, it must be fixed as well.
Depends-on: c12ab8dbc492 ("powerpc/8xx: Fix Oops with STRICT_KERNEL_RWX without DEBUG_RODATA_TEST")
Fixes: f76c8f6d257c ("powerpc/8xx: Add function to set pinned TLBs")
Cc: stable(a)vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
---
arch/powerpc/kernel/head_8xx.S | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 2d596881b70e..0d073b9fd52c 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -733,6 +733,7 @@ _GLOBAL(mmu_pin_tlb)
#ifdef CONFIG_PIN_TLB_DATA
LOAD_REG_IMMEDIATE(r6, PAGE_OFFSET)
LOAD_REG_IMMEDIATE(r7, MI_SVALID | MI_PS8MEG | _PMD_ACCESSED)
+ li r8, 0
#ifdef CONFIG_PIN_TLB_IMMR
li r0, 3
#else
@@ -741,26 +742,26 @@ _GLOBAL(mmu_pin_tlb)
mtctr r0
cmpwi r4, 0
beq 4f
- LOAD_REG_IMMEDIATE(r8, 0xf0 | _PAGE_RO | _PAGE_SPS | _PAGE_SH | _PAGE_PRESENT)
LOAD_REG_ADDR(r9, _sinittext)
2: ori r0, r6, MD_EVALID
+ ori r12, r8, 0xf0 | _PAGE_RO | _PAGE_SPS | _PAGE_SH | _PAGE_PRESENT
mtspr SPRN_MD_CTR, r5
mtspr SPRN_MD_EPN, r0
mtspr SPRN_MD_TWC, r7
- mtspr SPRN_MD_RPN, r8
+ mtspr SPRN_MD_RPN, r12
addi r5, r5, 0x100
addis r6, r6, SZ_8M@h
addis r8, r8, SZ_8M@h
cmplw r6, r9
bdnzt lt, 2b
-
-4: LOAD_REG_IMMEDIATE(r8, 0xf0 | _PAGE_DIRTY | _PAGE_SPS | _PAGE_SH | _PAGE_PRESENT)
+4:
2: ori r0, r6, MD_EVALID
+ ori r12, r8, 0xf0 | _PAGE_DIRTY | _PAGE_SPS | _PAGE_SH | _PAGE_PRESENT
mtspr SPRN_MD_CTR, r5
mtspr SPRN_MD_EPN, r0
mtspr SPRN_MD_TWC, r7
- mtspr SPRN_MD_RPN, r8
+ mtspr SPRN_MD_RPN, r12
addi r5, r5, 0x100
addis r6, r6, SZ_8M@h
addis r8, r8, SZ_8M@h
@@ -781,7 +782,7 @@ _GLOBAL(mmu_pin_tlb)
#endif
#if defined(CONFIG_PIN_TLB_IMMR) || defined(CONFIG_PIN_TLB_DATA)
lis r0, (MD_RSV4I | MD_TWAM)@h
- mtspr SPRN_MI_CTR, r0
+ mtspr SPRN_MD_CTR, r0
#endif
mtspr SPRN_SRR1, r10
mtspr SPRN_SRR0, r11
--
2.31.1
The conversion from __copy_from_user() to __get_user() by
commit d3ccc9781560 ("powerpc/signal: Use __get_user() to copy
sigset_t") introduced a regression in __get_user_sigset() for
powerpc/32. The bug was subsequently moved into
unsafe_get_user_sigset().
The bug is due to the copied 64 bit value being truncated to
32 bits while being assigned to dst->sig[0]
The regression was reported by users of the Xorg packages distributed in
Debian/powerpc --
"The symptoms are that the fb screen goes blank, with the backlight
remaining on and no errors logged in /var/log; wdm (or startx) run
with no effect (I tried logging in in the blind, with no effect).
And they are hard to kill, requiring 'kill -KILL ...'"
Fix the regression by copying each word of the sigset, not only the
first one.
__get_user_sigset() was tentatively optimised to copy 64 bits at once
in order to minimise KUAP unlock/lock impact, but the unsafe variant
doesn't suffer that, so it can just copy words.
Cc: Christopher M. Riedl <cmr(a)bluescreens.de>
Fixes: 887f3ceb51cd ("powerpc/signal32: Convert do_setcontext[_tm]() to user access block")
Cc: stable(a)vger.kernel.org
Reported-by: Finn Thain <fthain(a)linux-m68k.org>
Reported-and-tested-by: Stan Johnson <userm57(a)yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
---
arch/powerpc/kernel/signal.h | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
index 1f07317964e4..618aeccdf691 100644
--- a/arch/powerpc/kernel/signal.h
+++ b/arch/powerpc/kernel/signal.h
@@ -25,8 +25,14 @@ static inline int __get_user_sigset(sigset_t *dst, const sigset_t __user *src)
return __get_user(dst->sig[0], (u64 __user *)&src->sig[0]);
}
-#define unsafe_get_user_sigset(dst, src, label) \
- unsafe_get_user((dst)->sig[0], (u64 __user *)&(src)->sig[0], label)
+#define unsafe_get_user_sigset(dst, src, label) do { \
+ sigset_t *__dst = dst; \
+ const sigset_t __user *__src = src; \
+ int i; \
+ \
+ for (i = 0; i < _NSIG_WORDS; i++) \
+ unsafe_get_user(__dst->sig[i], &__src->sig[i], label); \
+} while (0)
#ifdef CONFIG_VSX
extern unsigned long copy_vsx_to_user(void __user *to,
--
2.31.1
Before commit 740499c78408 ("iomap: fix the iomap_readpage_actor return
value for inline data"), when hitting an IOMAP_INLINE extent,
iomap_readpage_actor would report having read the entire page. Since
then, it only reports having read the inline data (iomap->length).
This will force iomap_readpage into another iteration, and the
filesystem will report an unaligned hole after the IOMAP_INLINE extent.
But iomap_readpage_actor (now iomap_readpage_iter) isn't prepared to
deal with unaligned extents, it will get things wrong on filesystems
with a block size smaller than the page size, and we'll eventually run
into the following warning in iomap_iter_advance:
WARN_ON_ONCE(iter->processed > iomap_length(iter));
Fix that by changing iomap_readpage_iter to return 0 when hitting an
inline extent; this will cause iomap_iter to stop immediately.
To fix readahead as well, change iomap_readahead_iter to pass on
iomap_readpage_iter return values less than or equal to zero.
Fixes: 740499c78408 ("iomap: fix the iomap_readpage_actor return value for inline data")
Cc: stable(a)vger.kernel.org # v5.15+
Signed-off-by: Andreas Gruenbacher <agruenba(a)redhat.com>
---
fs/iomap/buffered-io.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1753c26c8e76..fe10d8a30f6b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -256,8 +256,13 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
unsigned poff, plen;
sector_t sector;
- if (iomap->type == IOMAP_INLINE)
- return min(iomap_read_inline_data(iter, page), length);
+ if (iomap->type == IOMAP_INLINE) {
+ loff_t ret = iomap_read_inline_data(iter, page);
+
+ if (ret < 0)
+ return ret;
+ return 0;
+ }
/* zero post-eof blocks as the page may be mapped */
iop = iomap_page_create(iter->inode, page);
@@ -370,6 +375,8 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
ctx->cur_page_in_bio = false;
}
ret = iomap_readpage_iter(iter, ctx, done);
+ if (ret <= 0)
+ return ret;
}
return done;
--
2.31.1
Allow the liteuart driver to be compile tested by fixing the broken
Kconfig dependencies.
Fixes: 1da81e5562fa ("drivers/tty/serial: add LiteUART driver")
Cc: stable(a)vger.kernel.org # 5.11
Cc: Filip Kokosinski <fkokosinski(a)antmicro.com>
Cc: Mateusz Holenko <mholenko(a)antmicro.com>
Cc: Stafford Horne <shorne(a)gmail.com>
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/tty/serial/Kconfig | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/tty/serial/Kconfig b/drivers/tty/serial/Kconfig
index 6ff94cfcd9db..67de892e0947 100644
--- a/drivers/tty/serial/Kconfig
+++ b/drivers/tty/serial/Kconfig
@@ -1531,9 +1531,9 @@ config SERIAL_MILBEAUT_USIO_CONSOLE
config SERIAL_LITEUART
tristate "LiteUART serial port support"
+ depends on LITEX || COMPILE_TEST
depends on HAS_IOMEM
- depends on OF || COMPILE_TEST
- depends on LITEX
+ depends on OF
select SERIAL_CORE
help
This driver is for the FPGA-based LiteUART serial controller from LiteX
--
2.32.0
From: Christian Brauner <christian.brauner(a)ubuntu.com>
When calling setattr_prepare() to determine the validity of the attributes the
ia_{g,u}id fields contain the value that will be written to inode->i_{g,u}id.
When the {g,u}id attribute of the file isn't altered and the caller's fs{g,u}id
matches the current {g,u}id attribute the attribute change is allowed.
The value in ia_{g,u}id does already account for idmapped mounts and will have
taken the relevant idmapping into account. So in order to verify that the
{g,u}id attribute isn't changed we simple need to compare the ia_{g,u}id value
against the inode's i_{g,u}id value.
This only has any meaning for idmapped mounts as idmapping helpers are
idempotent without them. And for idmapped mounts this really only has a meaning
when circular idmappings are used, i.e. mappings where e.g. id 1000 is mapped
to id 1001 and id 1001 is mapped to id 1000. Such ciruclar mappings can e.g. be
useful when sharing the same home directory between multiple users at the same
time.
As an example consider a directory with two files: /source/file1 owned by
{g,u}id 1000 and /source/file2 owned by {g,u}id 1001. Assume we create an
idmapped mount at /target with an idmapping that maps files owned by {g,u}id
1000 to being owned by {g,u}id 1001 and files owned by {g,u}id 1001 to being
owned by {g,u}id 1000. In effect, the idmapped mount at /target switches the
ownership of /source/file1 and source/file2, i.e. /target/file1 will be owned
by {g,u}id 1001 and /target/file2 will be owned by {g,u}id 1000.
This means that a user with fs{g,u}id 1000 must be allowed to setattr
/target/file2 from {g,u}id 1000 to {g,u}id 1000. Similar, a user with fs{g,u}id
1001 must be allowed to setattr /target/file1 from {g,u}id 1001 to {g,u}id
1001. Conversely, a user with fs{g,u}id 1000 must fail to setattr /target/file1
from {g,u}id 1001 to {g,u}id 1000. And a user with fs{g,u}id 1001 must fail to
setattr /target/file2 from {g,u}id 1000 to {g,u}id 1000. Both cases must fail
with EPERM for non-capable callers.
Before this patch we could end up denying legitimate attribute changes and
allowing invalid attribute changes when circular mappings are used. To even get
into this situation the caller must've been privileged both to create that
mapping and to create that idmapped mount.
This hasn't been seen in the wild anywhere but came up when expanding the
testsuite during work on a series of hardening patches. All idmapped fstests
pass without any regressions and we add new tests to verify the behavior of
circular mappings.
Fixes: 2f221d6f7b88 ("attr: handle idmapped mounts")
Cc: Seth Forshee <seth.forshee(a)digitalocean.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: stable(a)vger.kernel.org
CC: linux-fsdevel(a)vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner(a)ubuntu.com>
---
fs/attr.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/attr.c b/fs/attr.c
index 473d21b3a86d..66899b6e9bd8 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -35,7 +35,7 @@ static bool chown_ok(struct user_namespace *mnt_userns,
kuid_t uid)
{
kuid_t kuid = i_uid_into_mnt(mnt_userns, inode);
- if (uid_eq(current_fsuid(), kuid) && uid_eq(uid, kuid))
+ if (uid_eq(current_fsuid(), kuid) && uid_eq(uid, inode->i_uid))
return true;
if (capable_wrt_inode_uidgid(mnt_userns, inode, CAP_CHOWN))
return true;
@@ -62,7 +62,7 @@ static bool chgrp_ok(struct user_namespace *mnt_userns,
{
kgid_t kgid = i_gid_into_mnt(mnt_userns, inode);
if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode)) &&
- (in_group_p(gid) || gid_eq(gid, kgid)))
+ (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
return true;
if (capable_wrt_inode_uidgid(mnt_userns, inode, CAP_CHOWN))
return true;
base-commit: 8bb7eca972ad531c9b149c0a51ab43a417385813
--
2.30.2
Please include "cifs: fix memory leak of smb3_fs_context_dup::server_hostname"
Commit id:
869da64d071142d4ed562a3e909deb18e4e72c4e
It fixes a problem found with additional testing of:
commit 7be3248f3139 ("cifs: To match file servers, make sure the
server hostname matches") which was marked for stable.
--
Thanks,
Steve
When using performance policy, EPP value is restored to non "performance"
mode EPP after offline and online.
For example:
cat /sys/devices/system/cpu/cpu1/cpufreq/energy_performance_preference
performance
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
cat /sys/devices/system/cpu/cpu1/cpufreq/energy_performance_preference
balance_performance
The commit 4adcf2e5829f ("cpufreq: intel_pstate: Add ->offline and ->online callbacks")
optimized save restore path of the HWP request MSR, when there is no
change in the policy. Also added special processing for performance mode
EPP. If EPP has been set to "performance" by the active mode "performance"
scaling algorithm, replace that value with the cached EPP. This ends up
replacing with cached EPP during offline, which is restored during online
again.
So add a change which will set cpu_data->epp_policy to zero, when in
performance policy and has non zero epp. In this way EPP is set to zero
again.
Fixes: 4adcf2e5829f ("cpufreq: intel_pstate: Add ->offline and ->online callbacks")
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada(a)linux.intel.com>
Cc: stable(a)vger.kernel.org # v5.9+
---
Update: Minor optimization to skip non performance policy code path
drivers/cpufreq/intel_pstate.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 815df3daae9d..6d7d73a0c66b 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -936,11 +936,17 @@ static void intel_pstate_hwp_set(unsigned int cpu)
max = cpu_data->max_perf_ratio;
min = cpu_data->min_perf_ratio;
- if (cpu_data->policy == CPUFREQ_POLICY_PERFORMANCE)
- min = max;
-
rdmsrl_on_cpu(cpu, MSR_HWP_REQUEST, &value);
+ if (cpu_data->policy == CPUFREQ_POLICY_PERFORMANCE) {
+ min = max;
+ epp = 0;
+ if (boot_cpu_has(X86_FEATURE_HWP_EPP))
+ epp = (value >> 24) & 0xff;
+ if (epp)
+ cpu_data->epp_policy = 0;
+ }
+
value &= ~HWP_MIN_PERF(~0L);
value |= HWP_MIN_PERF(min);
--
2.17.1
The patch titled
Subject: hugetlb, userfaultfd: fix reservation restore on userfaultfd error
has been added to the -mm tree. Its filename is
hugetlb-userfaultfd-fix-reservation-restore-on-userfaultfd-error.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/hugetlb-userfaultfd-fix-reservati…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/hugetlb-userfaultfd-fix-reservati…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Mina Almasry <almasrymina(a)google.com>
Subject: hugetlb, userfaultfd: fix reservation restore on userfaultfd error
Currently in the is_continue case in hugetlb_mcopy_atomic_pte(), if we
bail out using "goto out_release_unlock;" in the cases where idx >= size,
or !huge_pte_none(), the code will detect that new_pagecache_page ==
false, and so call restore_reserve_on_error(). In this case I see
restore_reserve_on_error() delete the reservation, and the following call
to remove_inode_hugepages() will increment h->resv_hugepages causing a
100% reproducible leak.
We should treat the is_continue case similar to adding a page into the
pagecache and set new_pagecache_page to true, to indicate that there is no
reservation to restore on the error path, and we need not call
restore_reserve_on_error().
Link: https://lkml.kernel.org/r/20211116235733.3774702-1-almasrymina@google.com
Fixes: c7b1850dfb41 ("hugetlb: don't pass page cache pages to restore_reserve_on_error")
Signed-off-by: Mina Almasry <almasrymina(a)google.com>
Reported-by: James Houghton <jthoughton(a)google.com>
Cc: Wei Xu <weixugc(a)google.com>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/mm/hugetlb.c~hugetlb-userfaultfd-fix-reservation-restore-on-userfaultfd-error
+++ a/mm/hugetlb.c
@@ -5743,6 +5743,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_s
page = find_lock_page(mapping, idx);
if (!page)
goto out;
+ /*
+ * Set new_pagecache_page to true, as we've added a page to the
+ * pagecache, but userfaultfd hasn't set up a mapping for this
+ * page yet. If we bail out before setting up the mapping, we
+ * want to indicate to restore_reserve_on_error() that we've
+ * added the page to the page cache.
+ */
+ new_pagecache_page = true;
} else if (!*pagep) {
/* If a page already exists, then it's UFFDIO_COPY for
* a non-missing case. Return -EEXIST.
_
Patches currently in -mm which might be from almasrymina(a)google.com are
hugetlb-userfaultfd-fix-reservation-restore-on-userfaultfd-error.patch
This reverts commit d07f3b081ee632268786601f55e1334d1f68b997.
pstore-blk was fixed to avoid the unwanted APIs in commit 7bb9557b48fc
("pstore/blk: Use the normal block device I/O path"), which landed in
the same release as the commit adding BROKEN.
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: stable(a)vger.kernel.org
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
fs/pstore/Kconfig | 1 -
1 file changed, 1 deletion(-)
diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig
index 328da35da390..8adabde685f1 100644
--- a/fs/pstore/Kconfig
+++ b/fs/pstore/Kconfig
@@ -173,7 +173,6 @@ config PSTORE_BLK
tristate "Log panic/oops to a block device"
depends on PSTORE
depends on BLOCK
- depends on BROKEN
select PSTORE_ZONE
default n
help
--
2.30.2
This series updates a few maintainer entries for VMware-maintained
subsystems and cleans up references to VMware's private mailing lists
to make it clear that they are effectively email-aliases to reach out
to reviewers.
Changes from v1->v3:
- Add Zack as the named maintainer for vmmouse driver
- Use R: to denote email-aliases for VMware reviewers
Regards,
Srivatsa
---
Srivatsa S. Bhat (VMware) (3):
MAINTAINERS: Update maintainers for paravirt ops and VMware hypervisor interface
MAINTAINERS: Add Zack as maintainer of vmmouse driver
MAINTAINERS: Mark VMware mailing list entries as email aliases
MAINTAINERS | 30 +++++++++++++++++-------------
1 file changed, 17 insertions(+), 13 deletions(-)
The SGX driver maintains a single global free page counter,
sgx_nr_free_pages, that reflects the number of free pages available
across all NUMA nodes. Correspondingly, a list of free pages is
associated with each NUMA node and sgx_nr_free_pages is updated
every time a page is added or removed from any of the free page
lists. The main usage of sgx_nr_free_pages is by the reclaimer
that runs when it (sgx_nr_free_pages) goes below a watermark
to ensure that there are always some free pages available to, for
example, support efficient page faults.
With sgx_nr_free_pages accessed and modified from a few places
it is essential to ensure that these accesses are done safely but
this is not the case. sgx_nr_free_pages is read without any
protection and updated with inconsistent protection by any one
of the spin locks associated with the individual NUMA nodes.
For example:
CPU_A CPU_B
----- -----
spin_lock(&nodeA->lock); spin_lock(&nodeB->lock);
... ...
sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--;
spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock);
Since sgx_nr_free_pages may be protected by different spin locks
while being modified from different CPUs, the following scenario
is possible:
CPU_A CPU_B
----- -----
{sgx_nr_free_pages = 100}
spin_lock(&nodeA->lock); spin_lock(&nodeB->lock);
sgx_nr_free_pages--; sgx_nr_free_pages--;
/* LOAD sgx_nr_free_pages = 100 */ /* LOAD sgx_nr_free_pages = 100 */
/* sgx_nr_free_pages-- */ /* sgx_nr_free_pages-- */
/* STORE sgx_nr_free_pages = 99 */ /* STORE sgx_nr_free_pages = 99 */
spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock);
In the above scenario, sgx_nr_free_pages is decremented from two CPUs
but instead of sgx_nr_free_pages ending with a value that is two less
than it started with, it was only decremented by one while the number
of free pages were actually reduced by two. The consequence of
sgx_nr_free_pages not being protected is that its value may not
accurately reflect the actual number of free pages on the system,
impacting the availability of free pages in support of many flows.
The problematic scenario is when the reclaimer does not run because it
believes there to be sufficient free pages while any attempt to allocate
a page fails because there are no free pages available. In the SGX driver
the reclaimer's watermark is only 32 pages so after encountering the
above example scenario 32 times a user space hang is possible when there
are no more free pages because of repeated page faults caused by no
free pages made available.
The following flow was encountered:
asm_exc_page_fault
...
sgx_vma_fault()
sgx_encl_load_page()
sgx_encl_eldu() // Encrypted page needs to be loaded from backing
// storage into newly allocated SGX memory page
sgx_alloc_epc_page() // Allocate a page of SGX memory
__sgx_alloc_epc_page() // Fails, no free SGX memory
...
if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
wake_up(&ksgxd_waitq);
return -EBUSY; // Return -EBUSY giving reclaimer time to run
return -EBUSY;
return -EBUSY;
return VM_FAULT_NOPAGE;
The reclaimer is triggered in above flow with the following code:
static bool sgx_should_reclaim(unsigned long watermark)
{
return sgx_nr_free_pages < watermark &&
!list_empty(&sgx_active_page_list);
}
In the problematic scenario there were no free pages available yet the
value of sgx_nr_free_pages was above the watermark. The allocation of
SGX memory thus always failed because of a lack of free pages while no
free pages were made available because the reclaimer is never started
because of sgx_nr_free_pages' incorrect value. The consequence was that
user space kept encountering VM_FAULT_NOPAGE that caused the same
address to be accessed repeatedly with the same result.
Change the global free page counter to an atomic type that
ensures simultaneous updates are done safely. While doing so, move
the updating of the variable outside of the spin lock critical
section to which it does not belong.
Cc: stable(a)vger.kernel.org
Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre(a)intel.com>
---
Changes since V2:
- V2:
https://lore.kernel.org/lkml/b2e69e9febcae5d98d331de094d9cc7ce3217e66.16364…
- Update changelog to provide example of unsafe variable modification (Jarkko).
Changes since V1:
- V1:
https://lore.kernel.org/lkml/373992d869cd356ce9e9afe43ef4934b70d604fd.16360…
- Add static to definition of sgx_nr_free_pages (Tony).
- Add Tony's signature.
- Provide detail about error scenario in changelog (Jarkko).
arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 63d3de02bbcc..8471a8b9b48e 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,8 +28,7 @@ static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
static LIST_HEAD(sgx_active_page_list);
static DEFINE_SPINLOCK(sgx_reclaimer_lock);
-/* The free page list lock protected variables prepend the lock. */
-static unsigned long sgx_nr_free_pages;
+static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
/* Nodes with one or more EPC sections. */
static nodemask_t sgx_numa_mask;
@@ -403,14 +402,15 @@ static void sgx_reclaim_pages(void)
spin_lock(&node->lock);
list_add_tail(&epc_page->list, &node->free_page_list);
- sgx_nr_free_pages++;
spin_unlock(&node->lock);
+ atomic_long_inc(&sgx_nr_free_pages);
}
}
static bool sgx_should_reclaim(unsigned long watermark)
{
- return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
+ return atomic_long_read(&sgx_nr_free_pages) < watermark &&
+ !list_empty(&sgx_active_page_list);
}
static int ksgxd(void *p)
@@ -471,9 +471,9 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list);
list_del_init(&page->list);
- sgx_nr_free_pages--;
spin_unlock(&node->lock);
+ atomic_long_dec(&sgx_nr_free_pages);
return page;
}
@@ -625,9 +625,9 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
spin_lock(&node->lock);
list_add_tail(&page->list, &node->free_page_list);
- sgx_nr_free_pages++;
spin_unlock(&node->lock);
+ atomic_long_inc(&sgx_nr_free_pages);
}
static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
--
2.25.1
From: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
BugLink: https://bugs.launchpad.net/bugs/1950644
commit 032146cda85566abcd1c4884d9d23e4e30a07e9a upstream.
If we open a file without read access and then pass the fd to a syscall
whose implementation calls kernel_read_file_from_fd(), we get a warning
from __kernel_read():
if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))
This currently affects both finit_module() and kexec_file_load(), but it
could affect other syscalls in the future.
Link: https://lkml.kernel.org/r/20211007220110.600005-1-willy@infradead.org
Fixes: b844f0ecbc56 ("vfs: define kernel_copy_file_from_fd()")
Signed-off-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Reported-by: Hao Sun <sunhao.th(a)gmail.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Acked-by: Christian Brauner <christian.brauner(a)ubuntu.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
(cherry picked from commit 0f218ba4c8aac7041cd8b81a5a893b0d121e6316 linux-5.4.y)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)canonical.com>
---
fs/exec.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c
index eeba096e8a38..006f7fb40b96 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1000,7 +1000,7 @@ int kernel_read_file_from_fd(int fd, void **buf, loff_t *size, loff_t max_size,
struct fd f = fdget(fd);
int ret = -EBADF;
- if (!f.file)
+ if (!f.file || !(f.file->f_mode & FMODE_READ))
goto out;
ret = kernel_read_file(f.file, buf, size, max_size, id);
--
2.32.0
From: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
BugLink: https://bugs.launchpad.net/bugs/1950644
If we open a file without read access and then pass the fd to a syscall
whose implementation calls kernel_read_file_from_fd(), we get a warning
from __kernel_read():
if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))
This currently affects both finit_module() and kexec_file_load(), but it
could affect other syscalls in the future.
Link: https://lkml.kernel.org/r/20211007220110.600005-1-willy@infradead.org
Fixes: b844f0ecbc56 ("vfs: define kernel_copy_file_from_fd()")
Signed-off-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Reported-by: Hao Sun <sunhao.th(a)gmail.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Acked-by: Christian Brauner <christian.brauner(a)ubuntu.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Mimi Zohar <zohar(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
(cherry picked from commit 032146cda85566abcd1c4884d9d23e4e30a07e9a)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)canonical.com>
---
fs/kernel_read_file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 90d255fbdd9b..c84d87f558cb 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -178,7 +178,7 @@ int kernel_read_file_from_fd(int fd, loff_t offset, void **buf,
struct fd f = fdget(fd);
int ret = -EBADF;
- if (!f.file)
+ if (!f.file || !(f.file->f_mode & FMODE_READ))
goto out;
ret = kernel_read_file(f.file, offset, buf, buf_size, file_size, id);
--
2.32.0
Currently, checks for whether VT-d PI can be used refer to the current
status of the feature in the current vCPU; or they more or less pick
vCPU 0 in case a specific vCPU is not available.
However, these checks do not attempt to synchronize with changes to
the IRTE. In particular, there is no path that updates the IRTE when
APICv is re-activated on vCPU 0; and there is no path to wakeup a CPU
that has APICv disabled, if the wakeup occurs because of an IRTE
that points to a posted interrupt.
To fix this, always go through the VT-d PI path as long as there are
assigned devices and APICv is available on both the host and the VM side.
Since the relevant condition was copied over three times, take the hint
and factor it into a separate function.
Suggested-by: Sean Christopherson <seanjc(a)google.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
---
arch/x86/kvm/vmx/posted_intr.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 5f81ef092bd4..b64dd1374ed9 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -5,6 +5,7 @@
#include <asm/cpu.h>
#include "lapic.h"
+#include "irq.h"
#include "posted_intr.h"
#include "trace.h"
#include "vmx.h"
@@ -77,13 +78,18 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
pi_set_on(pi_desc);
}
+static bool vmx_can_use_vtd_pi(struct kvm *kvm)
+{
+ return kvm_arch_has_assigned_device(kvm) &&
+ irq_remapping_cap(IRQ_POSTING_CAP) &&
+ irqchip_in_kernel(kvm) && enable_apicv;
+}
+
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
- !irq_remapping_cap(IRQ_POSTING_CAP) ||
- !kvm_vcpu_apicv_active(vcpu))
+ if (!vmx_can_use_vtd_pi(vcpu->kvm))
return;
/* Set SN when the vCPU is preempted */
@@ -141,9 +147,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
struct pi_desc old, new;
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
- !irq_remapping_cap(IRQ_POSTING_CAP) ||
- !kvm_vcpu_apicv_active(vcpu))
+ if (!vmx_can_use_vtd_pi(vcpu->kvm))
return 0;
WARN_ON(irqs_disabled());
@@ -270,9 +274,7 @@ int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq,
struct vcpu_data vcpu_info;
int idx, ret = 0;
- if (!kvm_arch_has_assigned_device(kvm) ||
- !irq_remapping_cap(IRQ_POSTING_CAP) ||
- !kvm_vcpu_apicv_active(kvm->vcpus[0]))
+ if (!vmx_can_use_vtd_pi(kvm))
return 0;
idx = srcu_read_lock(&kvm->irq_srcu);
--
2.27.0
Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains
exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the
'nomap' domains still in use under the powerpc arch. With this new
flag, the revmap_tree of the IRQ domain is not used anymore. This
change broke the support of shared LSIs [1] in the XIVE driver because
it was relying on a lookup in the revmap_tree to query previously
mapped interrupts. Linux now creates two distinct IRQ mappings on the
same HW IRQ which can lead to unexpected behavior in the drivers.
The XIVE IRQ domain is not a direct mapping domain and its HW IRQ
interrupt number space is rather large : 1M/socket on POWER9 and
POWER10, change the XIVE driver to use a 'tree' domain type instead.
[1] For instance, a linux KVM guest with virtio-rng and virtio-balloon
devices.
Cc: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org # v5.14+
Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive")
Signed-off-by: Cédric Le Goater <clg(a)kaod.org>
---
Marc,
The Fixes tag is there because the patch in question revealed that
something was broken in XIVE. genirq is not in cause. However, I
don't know for PS3 and Cell. May be less critical for now.
arch/powerpc/sysdev/xive/common.c | 3 +--
arch/powerpc/sysdev/xive/Kconfig | 1 -
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index fed6fd16c8f4..9d0f0fe25598 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops = {
static void __init xive_init_host(struct device_node *np)
{
- xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ,
- &xive_irq_domain_ops, NULL);
+ xive_irq_domain = irq_domain_add_tree(np, &xive_irq_domain_ops, NULL);
if (WARN_ON(xive_irq_domain == NULL))
return;
irq_set_default_host(xive_irq_domain);
diff --git a/arch/powerpc/sysdev/xive/Kconfig b/arch/powerpc/sysdev/xive/Kconfig
index 97796c6b63f0..785c292d104b 100644
--- a/arch/powerpc/sysdev/xive/Kconfig
+++ b/arch/powerpc/sysdev/xive/Kconfig
@@ -3,7 +3,6 @@ config PPC_XIVE
bool
select PPC_SMP_MUXED_IPI
select HARDIRQS_SW_RESEND
- select IRQ_DOMAIN_NOMAP
config PPC_XIVE_NATIVE
bool
--
2.31.1
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 541ac97186d9ea88491961a46284de3603c914fd Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp(a)suse.de>
Date: Fri, 1 Oct 2021 21:41:20 +0200
Subject: [PATCH] x86/sev: Make the #VC exception stacks part of the default
stacks storage
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.
Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.
Rip out the custom VC stacks mapping and storage code.
[ bp: Steal and adapt Tom's commit message. ]
Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes")
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tom Lendacky <thomas.lendacky(a)amd.com>
Tested-by: Brijesh Singh <brijesh.singh(a)amd.com>
Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 3d52b094850a..dd5ea1bdf04c 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -10,6 +10,12 @@
#ifdef CONFIG_X86_64
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+#define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ
+#else
+#define VC_EXCEPTION_STKSZ 0
+#endif
+
/* Macro to enforce the same ordering and stack sizes */
#define ESTACKS_MEMBERS(guardsize, optional_stack_size) \
char DF_stack_guard[guardsize]; \
@@ -28,7 +34,7 @@
/* The exception stacks' physical storage. No guard pages required */
struct exception_stacks {
- ESTACKS_MEMBERS(0, 0)
+ ESTACKS_MEMBERS(0, VC_EXCEPTION_STKSZ)
};
/* The effective cpu entry area mapping with guard pages. */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 53a6837d354b..4d0d1c2b65e1 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -46,16 +46,6 @@ static struct ghcb __initdata *boot_ghcb;
struct sev_es_runtime_data {
struct ghcb ghcb_page;
- /* Physical storage for the per-CPU IST stack of the #VC handler */
- char ist_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
- /*
- * Physical storage for the per-CPU fall-back stack of the #VC handler.
- * The fall-back stack is used when it is not safe to switch back to the
- * interrupted stack in the #VC entry code.
- */
- char fallback_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
/*
* Reserve one page per CPU as backup storage for the unencrypted GHCB.
* It is needed when an NMI happens while the #VC handler uses the real
@@ -99,27 +89,6 @@ DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
/* Needed in vc_early_forward_exception */
void do_early_exception(struct pt_regs *regs, int trapnr);
-static void __init setup_vc_stacks(int cpu)
-{
- struct sev_es_runtime_data *data;
- struct cpu_entry_area *cea;
- unsigned long vaddr;
- phys_addr_t pa;
-
- data = per_cpu(runtime_data, cpu);
- cea = get_cpu_entry_area(cpu);
-
- /* Map #VC IST stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC);
- pa = __pa(data->ist_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-
- /* Map VC fall-back stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC2);
- pa = __pa(data->fallback_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-}
-
static __always_inline bool on_vc_stack(struct pt_regs *regs)
{
unsigned long sp = regs->sp;
@@ -787,7 +756,6 @@ void __init sev_es_init_vc_handling(void)
for_each_possible_cpu(cpu) {
alloc_runtime_data(cpu);
init_ghcb(cpu);
- setup_vc_stacks(cpu);
}
sev_es_setup_play_dead();
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index f5e1e60c9095..6c2f1b76a0b6 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -110,6 +110,13 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(NMI);
cea_map_stack(DB);
cea_map_stack(MCE);
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
+ cea_map_stack(VC);
+ cea_map_stack(VC2);
+ }
+ }
}
#else
static inline void percpu_setup_exception_stacks(unsigned int cpu)
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 541ac97186d9ea88491961a46284de3603c914fd Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp(a)suse.de>
Date: Fri, 1 Oct 2021 21:41:20 +0200
Subject: [PATCH] x86/sev: Make the #VC exception stacks part of the default
stacks storage
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.
Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.
Rip out the custom VC stacks mapping and storage code.
[ bp: Steal and adapt Tom's commit message. ]
Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes")
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tom Lendacky <thomas.lendacky(a)amd.com>
Tested-by: Brijesh Singh <brijesh.singh(a)amd.com>
Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 3d52b094850a..dd5ea1bdf04c 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -10,6 +10,12 @@
#ifdef CONFIG_X86_64
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+#define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ
+#else
+#define VC_EXCEPTION_STKSZ 0
+#endif
+
/* Macro to enforce the same ordering and stack sizes */
#define ESTACKS_MEMBERS(guardsize, optional_stack_size) \
char DF_stack_guard[guardsize]; \
@@ -28,7 +34,7 @@
/* The exception stacks' physical storage. No guard pages required */
struct exception_stacks {
- ESTACKS_MEMBERS(0, 0)
+ ESTACKS_MEMBERS(0, VC_EXCEPTION_STKSZ)
};
/* The effective cpu entry area mapping with guard pages. */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 53a6837d354b..4d0d1c2b65e1 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -46,16 +46,6 @@ static struct ghcb __initdata *boot_ghcb;
struct sev_es_runtime_data {
struct ghcb ghcb_page;
- /* Physical storage for the per-CPU IST stack of the #VC handler */
- char ist_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
- /*
- * Physical storage for the per-CPU fall-back stack of the #VC handler.
- * The fall-back stack is used when it is not safe to switch back to the
- * interrupted stack in the #VC entry code.
- */
- char fallback_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
/*
* Reserve one page per CPU as backup storage for the unencrypted GHCB.
* It is needed when an NMI happens while the #VC handler uses the real
@@ -99,27 +89,6 @@ DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
/* Needed in vc_early_forward_exception */
void do_early_exception(struct pt_regs *regs, int trapnr);
-static void __init setup_vc_stacks(int cpu)
-{
- struct sev_es_runtime_data *data;
- struct cpu_entry_area *cea;
- unsigned long vaddr;
- phys_addr_t pa;
-
- data = per_cpu(runtime_data, cpu);
- cea = get_cpu_entry_area(cpu);
-
- /* Map #VC IST stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC);
- pa = __pa(data->ist_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-
- /* Map VC fall-back stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC2);
- pa = __pa(data->fallback_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-}
-
static __always_inline bool on_vc_stack(struct pt_regs *regs)
{
unsigned long sp = regs->sp;
@@ -787,7 +756,6 @@ void __init sev_es_init_vc_handling(void)
for_each_possible_cpu(cpu) {
alloc_runtime_data(cpu);
init_ghcb(cpu);
- setup_vc_stacks(cpu);
}
sev_es_setup_play_dead();
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index f5e1e60c9095..6c2f1b76a0b6 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -110,6 +110,13 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(NMI);
cea_map_stack(DB);
cea_map_stack(MCE);
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
+ cea_map_stack(VC);
+ cea_map_stack(VC2);
+ }
+ }
}
#else
static inline void percpu_setup_exception_stacks(unsigned int cpu)
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 541ac97186d9ea88491961a46284de3603c914fd Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp(a)suse.de>
Date: Fri, 1 Oct 2021 21:41:20 +0200
Subject: [PATCH] x86/sev: Make the #VC exception stacks part of the default
stacks storage
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.
Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.
Rip out the custom VC stacks mapping and storage code.
[ bp: Steal and adapt Tom's commit message. ]
Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes")
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tom Lendacky <thomas.lendacky(a)amd.com>
Tested-by: Brijesh Singh <brijesh.singh(a)amd.com>
Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 3d52b094850a..dd5ea1bdf04c 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -10,6 +10,12 @@
#ifdef CONFIG_X86_64
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+#define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ
+#else
+#define VC_EXCEPTION_STKSZ 0
+#endif
+
/* Macro to enforce the same ordering and stack sizes */
#define ESTACKS_MEMBERS(guardsize, optional_stack_size) \
char DF_stack_guard[guardsize]; \
@@ -28,7 +34,7 @@
/* The exception stacks' physical storage. No guard pages required */
struct exception_stacks {
- ESTACKS_MEMBERS(0, 0)
+ ESTACKS_MEMBERS(0, VC_EXCEPTION_STKSZ)
};
/* The effective cpu entry area mapping with guard pages. */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 53a6837d354b..4d0d1c2b65e1 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -46,16 +46,6 @@ static struct ghcb __initdata *boot_ghcb;
struct sev_es_runtime_data {
struct ghcb ghcb_page;
- /* Physical storage for the per-CPU IST stack of the #VC handler */
- char ist_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
- /*
- * Physical storage for the per-CPU fall-back stack of the #VC handler.
- * The fall-back stack is used when it is not safe to switch back to the
- * interrupted stack in the #VC entry code.
- */
- char fallback_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
/*
* Reserve one page per CPU as backup storage for the unencrypted GHCB.
* It is needed when an NMI happens while the #VC handler uses the real
@@ -99,27 +89,6 @@ DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
/* Needed in vc_early_forward_exception */
void do_early_exception(struct pt_regs *regs, int trapnr);
-static void __init setup_vc_stacks(int cpu)
-{
- struct sev_es_runtime_data *data;
- struct cpu_entry_area *cea;
- unsigned long vaddr;
- phys_addr_t pa;
-
- data = per_cpu(runtime_data, cpu);
- cea = get_cpu_entry_area(cpu);
-
- /* Map #VC IST stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC);
- pa = __pa(data->ist_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-
- /* Map VC fall-back stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC2);
- pa = __pa(data->fallback_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-}
-
static __always_inline bool on_vc_stack(struct pt_regs *regs)
{
unsigned long sp = regs->sp;
@@ -787,7 +756,6 @@ void __init sev_es_init_vc_handling(void)
for_each_possible_cpu(cpu) {
alloc_runtime_data(cpu);
init_ghcb(cpu);
- setup_vc_stacks(cpu);
}
sev_es_setup_play_dead();
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index f5e1e60c9095..6c2f1b76a0b6 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -110,6 +110,13 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(NMI);
cea_map_stack(DB);
cea_map_stack(MCE);
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
+ cea_map_stack(VC);
+ cea_map_stack(VC2);
+ }
+ }
}
#else
static inline void percpu_setup_exception_stacks(unsigned int cpu)
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 541ac97186d9ea88491961a46284de3603c914fd Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp(a)suse.de>
Date: Fri, 1 Oct 2021 21:41:20 +0200
Subject: [PATCH] x86/sev: Make the #VC exception stacks part of the default
stacks storage
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.
Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.
Rip out the custom VC stacks mapping and storage code.
[ bp: Steal and adapt Tom's commit message. ]
Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes")
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tom Lendacky <thomas.lendacky(a)amd.com>
Tested-by: Brijesh Singh <brijesh.singh(a)amd.com>
Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 3d52b094850a..dd5ea1bdf04c 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -10,6 +10,12 @@
#ifdef CONFIG_X86_64
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+#define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ
+#else
+#define VC_EXCEPTION_STKSZ 0
+#endif
+
/* Macro to enforce the same ordering and stack sizes */
#define ESTACKS_MEMBERS(guardsize, optional_stack_size) \
char DF_stack_guard[guardsize]; \
@@ -28,7 +34,7 @@
/* The exception stacks' physical storage. No guard pages required */
struct exception_stacks {
- ESTACKS_MEMBERS(0, 0)
+ ESTACKS_MEMBERS(0, VC_EXCEPTION_STKSZ)
};
/* The effective cpu entry area mapping with guard pages. */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 53a6837d354b..4d0d1c2b65e1 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -46,16 +46,6 @@ static struct ghcb __initdata *boot_ghcb;
struct sev_es_runtime_data {
struct ghcb ghcb_page;
- /* Physical storage for the per-CPU IST stack of the #VC handler */
- char ist_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
- /*
- * Physical storage for the per-CPU fall-back stack of the #VC handler.
- * The fall-back stack is used when it is not safe to switch back to the
- * interrupted stack in the #VC entry code.
- */
- char fallback_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
/*
* Reserve one page per CPU as backup storage for the unencrypted GHCB.
* It is needed when an NMI happens while the #VC handler uses the real
@@ -99,27 +89,6 @@ DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
/* Needed in vc_early_forward_exception */
void do_early_exception(struct pt_regs *regs, int trapnr);
-static void __init setup_vc_stacks(int cpu)
-{
- struct sev_es_runtime_data *data;
- struct cpu_entry_area *cea;
- unsigned long vaddr;
- phys_addr_t pa;
-
- data = per_cpu(runtime_data, cpu);
- cea = get_cpu_entry_area(cpu);
-
- /* Map #VC IST stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC);
- pa = __pa(data->ist_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-
- /* Map VC fall-back stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC2);
- pa = __pa(data->fallback_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-}
-
static __always_inline bool on_vc_stack(struct pt_regs *regs)
{
unsigned long sp = regs->sp;
@@ -787,7 +756,6 @@ void __init sev_es_init_vc_handling(void)
for_each_possible_cpu(cpu) {
alloc_runtime_data(cpu);
init_ghcb(cpu);
- setup_vc_stacks(cpu);
}
sev_es_setup_play_dead();
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index f5e1e60c9095..6c2f1b76a0b6 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -110,6 +110,13 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(NMI);
cea_map_stack(DB);
cea_map_stack(MCE);
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
+ cea_map_stack(VC);
+ cea_map_stack(VC2);
+ }
+ }
}
#else
static inline void percpu_setup_exception_stacks(unsigned int cpu)
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 541ac97186d9ea88491961a46284de3603c914fd Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp(a)suse.de>
Date: Fri, 1 Oct 2021 21:41:20 +0200
Subject: [PATCH] x86/sev: Make the #VC exception stacks part of the default
stacks storage
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.
Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.
Rip out the custom VC stacks mapping and storage code.
[ bp: Steal and adapt Tom's commit message. ]
Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes")
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tom Lendacky <thomas.lendacky(a)amd.com>
Tested-by: Brijesh Singh <brijesh.singh(a)amd.com>
Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 3d52b094850a..dd5ea1bdf04c 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -10,6 +10,12 @@
#ifdef CONFIG_X86_64
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+#define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ
+#else
+#define VC_EXCEPTION_STKSZ 0
+#endif
+
/* Macro to enforce the same ordering and stack sizes */
#define ESTACKS_MEMBERS(guardsize, optional_stack_size) \
char DF_stack_guard[guardsize]; \
@@ -28,7 +34,7 @@
/* The exception stacks' physical storage. No guard pages required */
struct exception_stacks {
- ESTACKS_MEMBERS(0, 0)
+ ESTACKS_MEMBERS(0, VC_EXCEPTION_STKSZ)
};
/* The effective cpu entry area mapping with guard pages. */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 53a6837d354b..4d0d1c2b65e1 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -46,16 +46,6 @@ static struct ghcb __initdata *boot_ghcb;
struct sev_es_runtime_data {
struct ghcb ghcb_page;
- /* Physical storage for the per-CPU IST stack of the #VC handler */
- char ist_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
- /*
- * Physical storage for the per-CPU fall-back stack of the #VC handler.
- * The fall-back stack is used when it is not safe to switch back to the
- * interrupted stack in the #VC entry code.
- */
- char fallback_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
/*
* Reserve one page per CPU as backup storage for the unencrypted GHCB.
* It is needed when an NMI happens while the #VC handler uses the real
@@ -99,27 +89,6 @@ DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
/* Needed in vc_early_forward_exception */
void do_early_exception(struct pt_regs *regs, int trapnr);
-static void __init setup_vc_stacks(int cpu)
-{
- struct sev_es_runtime_data *data;
- struct cpu_entry_area *cea;
- unsigned long vaddr;
- phys_addr_t pa;
-
- data = per_cpu(runtime_data, cpu);
- cea = get_cpu_entry_area(cpu);
-
- /* Map #VC IST stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC);
- pa = __pa(data->ist_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-
- /* Map VC fall-back stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC2);
- pa = __pa(data->fallback_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-}
-
static __always_inline bool on_vc_stack(struct pt_regs *regs)
{
unsigned long sp = regs->sp;
@@ -787,7 +756,6 @@ void __init sev_es_init_vc_handling(void)
for_each_possible_cpu(cpu) {
alloc_runtime_data(cpu);
init_ghcb(cpu);
- setup_vc_stacks(cpu);
}
sev_es_setup_play_dead();
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index f5e1e60c9095..6c2f1b76a0b6 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -110,6 +110,13 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(NMI);
cea_map_stack(DB);
cea_map_stack(MCE);
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
+ cea_map_stack(VC);
+ cea_map_stack(VC2);
+ }
+ }
}
#else
static inline void percpu_setup_exception_stacks(unsigned int cpu)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 541ac97186d9ea88491961a46284de3603c914fd Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp(a)suse.de>
Date: Fri, 1 Oct 2021 21:41:20 +0200
Subject: [PATCH] x86/sev: Make the #VC exception stacks part of the default
stacks storage
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.
Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.
Rip out the custom VC stacks mapping and storage code.
[ bp: Steal and adapt Tom's commit message. ]
Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes")
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tom Lendacky <thomas.lendacky(a)amd.com>
Tested-by: Brijesh Singh <brijesh.singh(a)amd.com>
Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 3d52b094850a..dd5ea1bdf04c 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -10,6 +10,12 @@
#ifdef CONFIG_X86_64
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+#define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ
+#else
+#define VC_EXCEPTION_STKSZ 0
+#endif
+
/* Macro to enforce the same ordering and stack sizes */
#define ESTACKS_MEMBERS(guardsize, optional_stack_size) \
char DF_stack_guard[guardsize]; \
@@ -28,7 +34,7 @@
/* The exception stacks' physical storage. No guard pages required */
struct exception_stacks {
- ESTACKS_MEMBERS(0, 0)
+ ESTACKS_MEMBERS(0, VC_EXCEPTION_STKSZ)
};
/* The effective cpu entry area mapping with guard pages. */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 53a6837d354b..4d0d1c2b65e1 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -46,16 +46,6 @@ static struct ghcb __initdata *boot_ghcb;
struct sev_es_runtime_data {
struct ghcb ghcb_page;
- /* Physical storage for the per-CPU IST stack of the #VC handler */
- char ist_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
- /*
- * Physical storage for the per-CPU fall-back stack of the #VC handler.
- * The fall-back stack is used when it is not safe to switch back to the
- * interrupted stack in the #VC entry code.
- */
- char fallback_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
/*
* Reserve one page per CPU as backup storage for the unencrypted GHCB.
* It is needed when an NMI happens while the #VC handler uses the real
@@ -99,27 +89,6 @@ DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
/* Needed in vc_early_forward_exception */
void do_early_exception(struct pt_regs *regs, int trapnr);
-static void __init setup_vc_stacks(int cpu)
-{
- struct sev_es_runtime_data *data;
- struct cpu_entry_area *cea;
- unsigned long vaddr;
- phys_addr_t pa;
-
- data = per_cpu(runtime_data, cpu);
- cea = get_cpu_entry_area(cpu);
-
- /* Map #VC IST stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC);
- pa = __pa(data->ist_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-
- /* Map VC fall-back stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC2);
- pa = __pa(data->fallback_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-}
-
static __always_inline bool on_vc_stack(struct pt_regs *regs)
{
unsigned long sp = regs->sp;
@@ -787,7 +756,6 @@ void __init sev_es_init_vc_handling(void)
for_each_possible_cpu(cpu) {
alloc_runtime_data(cpu);
init_ghcb(cpu);
- setup_vc_stacks(cpu);
}
sev_es_setup_play_dead();
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index f5e1e60c9095..6c2f1b76a0b6 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -110,6 +110,13 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(NMI);
cea_map_stack(DB);
cea_map_stack(MCE);
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
+ cea_map_stack(VC);
+ cea_map_stack(VC2);
+ }
+ }
}
#else
static inline void percpu_setup_exception_stacks(unsigned int cpu)
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 541ac97186d9ea88491961a46284de3603c914fd Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp(a)suse.de>
Date: Fri, 1 Oct 2021 21:41:20 +0200
Subject: [PATCH] x86/sev: Make the #VC exception stacks part of the default
stacks storage
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.
Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.
Rip out the custom VC stacks mapping and storage code.
[ bp: Steal and adapt Tom's commit message. ]
Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes")
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tom Lendacky <thomas.lendacky(a)amd.com>
Tested-by: Brijesh Singh <brijesh.singh(a)amd.com>
Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 3d52b094850a..dd5ea1bdf04c 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -10,6 +10,12 @@
#ifdef CONFIG_X86_64
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+#define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ
+#else
+#define VC_EXCEPTION_STKSZ 0
+#endif
+
/* Macro to enforce the same ordering and stack sizes */
#define ESTACKS_MEMBERS(guardsize, optional_stack_size) \
char DF_stack_guard[guardsize]; \
@@ -28,7 +34,7 @@
/* The exception stacks' physical storage. No guard pages required */
struct exception_stacks {
- ESTACKS_MEMBERS(0, 0)
+ ESTACKS_MEMBERS(0, VC_EXCEPTION_STKSZ)
};
/* The effective cpu entry area mapping with guard pages. */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 53a6837d354b..4d0d1c2b65e1 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -46,16 +46,6 @@ static struct ghcb __initdata *boot_ghcb;
struct sev_es_runtime_data {
struct ghcb ghcb_page;
- /* Physical storage for the per-CPU IST stack of the #VC handler */
- char ist_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
- /*
- * Physical storage for the per-CPU fall-back stack of the #VC handler.
- * The fall-back stack is used when it is not safe to switch back to the
- * interrupted stack in the #VC entry code.
- */
- char fallback_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
/*
* Reserve one page per CPU as backup storage for the unencrypted GHCB.
* It is needed when an NMI happens while the #VC handler uses the real
@@ -99,27 +89,6 @@ DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
/* Needed in vc_early_forward_exception */
void do_early_exception(struct pt_regs *regs, int trapnr);
-static void __init setup_vc_stacks(int cpu)
-{
- struct sev_es_runtime_data *data;
- struct cpu_entry_area *cea;
- unsigned long vaddr;
- phys_addr_t pa;
-
- data = per_cpu(runtime_data, cpu);
- cea = get_cpu_entry_area(cpu);
-
- /* Map #VC IST stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC);
- pa = __pa(data->ist_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-
- /* Map VC fall-back stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC2);
- pa = __pa(data->fallback_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-}
-
static __always_inline bool on_vc_stack(struct pt_regs *regs)
{
unsigned long sp = regs->sp;
@@ -787,7 +756,6 @@ void __init sev_es_init_vc_handling(void)
for_each_possible_cpu(cpu) {
alloc_runtime_data(cpu);
init_ghcb(cpu);
- setup_vc_stacks(cpu);
}
sev_es_setup_play_dead();
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index f5e1e60c9095..6c2f1b76a0b6 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -110,6 +110,13 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(NMI);
cea_map_stack(DB);
cea_map_stack(MCE);
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
+ cea_map_stack(VC);
+ cea_map_stack(VC2);
+ }
+ }
}
#else
static inline void percpu_setup_exception_stacks(unsigned int cpu)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 541ac97186d9ea88491961a46284de3603c914fd Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp(a)suse.de>
Date: Fri, 1 Oct 2021 21:41:20 +0200
Subject: [PATCH] x86/sev: Make the #VC exception stacks part of the default
stacks storage
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.
Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.
Rip out the custom VC stacks mapping and storage code.
[ bp: Steal and adapt Tom's commit message. ]
Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes")
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Tested-by: Tom Lendacky <thomas.lendacky(a)amd.com>
Tested-by: Brijesh Singh <brijesh.singh(a)amd.com>
Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 3d52b094850a..dd5ea1bdf04c 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -10,6 +10,12 @@
#ifdef CONFIG_X86_64
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+#define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ
+#else
+#define VC_EXCEPTION_STKSZ 0
+#endif
+
/* Macro to enforce the same ordering and stack sizes */
#define ESTACKS_MEMBERS(guardsize, optional_stack_size) \
char DF_stack_guard[guardsize]; \
@@ -28,7 +34,7 @@
/* The exception stacks' physical storage. No guard pages required */
struct exception_stacks {
- ESTACKS_MEMBERS(0, 0)
+ ESTACKS_MEMBERS(0, VC_EXCEPTION_STKSZ)
};
/* The effective cpu entry area mapping with guard pages. */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 53a6837d354b..4d0d1c2b65e1 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -46,16 +46,6 @@ static struct ghcb __initdata *boot_ghcb;
struct sev_es_runtime_data {
struct ghcb ghcb_page;
- /* Physical storage for the per-CPU IST stack of the #VC handler */
- char ist_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
- /*
- * Physical storage for the per-CPU fall-back stack of the #VC handler.
- * The fall-back stack is used when it is not safe to switch back to the
- * interrupted stack in the #VC entry code.
- */
- char fallback_stack[EXCEPTION_STKSZ] __aligned(PAGE_SIZE);
-
/*
* Reserve one page per CPU as backup storage for the unencrypted GHCB.
* It is needed when an NMI happens while the #VC handler uses the real
@@ -99,27 +89,6 @@ DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
/* Needed in vc_early_forward_exception */
void do_early_exception(struct pt_regs *regs, int trapnr);
-static void __init setup_vc_stacks(int cpu)
-{
- struct sev_es_runtime_data *data;
- struct cpu_entry_area *cea;
- unsigned long vaddr;
- phys_addr_t pa;
-
- data = per_cpu(runtime_data, cpu);
- cea = get_cpu_entry_area(cpu);
-
- /* Map #VC IST stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC);
- pa = __pa(data->ist_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-
- /* Map VC fall-back stack */
- vaddr = CEA_ESTACK_BOT(&cea->estacks, VC2);
- pa = __pa(data->fallback_stack);
- cea_set_pte((void *)vaddr, pa, PAGE_KERNEL);
-}
-
static __always_inline bool on_vc_stack(struct pt_regs *regs)
{
unsigned long sp = regs->sp;
@@ -787,7 +756,6 @@ void __init sev_es_init_vc_handling(void)
for_each_possible_cpu(cpu) {
alloc_runtime_data(cpu);
init_ghcb(cpu);
- setup_vc_stacks(cpu);
}
sev_es_setup_play_dead();
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index f5e1e60c9095..6c2f1b76a0b6 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -110,6 +110,13 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(NMI);
cea_map_stack(DB);
cea_map_stack(MCE);
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
+ cea_map_stack(VC);
+ cea_map_stack(VC2);
+ }
+ }
}
#else
static inline void percpu_setup_exception_stacks(unsigned int cpu)
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3ec18fc7831e7d79e2d536dd1f3bc0d3ba425e8a Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)stackframe.org>
Date: Sat, 13 Nov 2021 20:41:17 +0100
Subject: [PATCH] parisc/entry: fix trace test in syscall exit path
commit 8779e05ba8aa ("parisc: Fix ptrace check on syscall return")
fixed testing of TI_FLAGS. This uncovered a bug in the test mask.
syscall_restore_rfi is only used when the kernel needs to exit to
usespace with single or block stepping and the recovery counter
enabled. The test however used _TIF_SYSCALL_TRACE_MASK, which
includes a lot of bits that shouldn't be tested here.
Fix this by using TIF_SINGLESTEP and TIF_BLOCKSTEP directly.
I encountered this bug by enabling syscall tracepoints. Both in qemu and
on real hardware. As soon as i enabled the tracepoint (sys_exit_read,
but i guess it doesn't really matter which one), i got random page
faults in userspace almost immediately.
Signed-off-by: Sven Schnelle <svens(a)stackframe.org>
Signed-off-by: Helge Deller <deller(a)gmx.de>
diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
index 57944d6f9ebb..88c188a965d8 100644
--- a/arch/parisc/kernel/entry.S
+++ b/arch/parisc/kernel/entry.S
@@ -1805,7 +1805,7 @@ syscall_restore:
/* Are we being ptraced? */
LDREG TASK_TI_FLAGS(%r1),%r19
- ldi _TIF_SYSCALL_TRACE_MASK,%r2
+ ldi _TIF_SINGLESTEP|_TIF_BLOCKSTEP,%r2
and,COND(=) %r19,%r2,%r0
b,n syscall_restore_rfi
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3ec18fc7831e7d79e2d536dd1f3bc0d3ba425e8a Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)stackframe.org>
Date: Sat, 13 Nov 2021 20:41:17 +0100
Subject: [PATCH] parisc/entry: fix trace test in syscall exit path
commit 8779e05ba8aa ("parisc: Fix ptrace check on syscall return")
fixed testing of TI_FLAGS. This uncovered a bug in the test mask.
syscall_restore_rfi is only used when the kernel needs to exit to
usespace with single or block stepping and the recovery counter
enabled. The test however used _TIF_SYSCALL_TRACE_MASK, which
includes a lot of bits that shouldn't be tested here.
Fix this by using TIF_SINGLESTEP and TIF_BLOCKSTEP directly.
I encountered this bug by enabling syscall tracepoints. Both in qemu and
on real hardware. As soon as i enabled the tracepoint (sys_exit_read,
but i guess it doesn't really matter which one), i got random page
faults in userspace almost immediately.
Signed-off-by: Sven Schnelle <svens(a)stackframe.org>
Signed-off-by: Helge Deller <deller(a)gmx.de>
diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
index 57944d6f9ebb..88c188a965d8 100644
--- a/arch/parisc/kernel/entry.S
+++ b/arch/parisc/kernel/entry.S
@@ -1805,7 +1805,7 @@ syscall_restore:
/* Are we being ptraced? */
LDREG TASK_TI_FLAGS(%r1),%r19
- ldi _TIF_SYSCALL_TRACE_MASK,%r2
+ ldi _TIF_SINGLESTEP|_TIF_BLOCKSTEP,%r2
and,COND(=) %r19,%r2,%r0
b,n syscall_restore_rfi
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3ec18fc7831e7d79e2d536dd1f3bc0d3ba425e8a Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)stackframe.org>
Date: Sat, 13 Nov 2021 20:41:17 +0100
Subject: [PATCH] parisc/entry: fix trace test in syscall exit path
commit 8779e05ba8aa ("parisc: Fix ptrace check on syscall return")
fixed testing of TI_FLAGS. This uncovered a bug in the test mask.
syscall_restore_rfi is only used when the kernel needs to exit to
usespace with single or block stepping and the recovery counter
enabled. The test however used _TIF_SYSCALL_TRACE_MASK, which
includes a lot of bits that shouldn't be tested here.
Fix this by using TIF_SINGLESTEP and TIF_BLOCKSTEP directly.
I encountered this bug by enabling syscall tracepoints. Both in qemu and
on real hardware. As soon as i enabled the tracepoint (sys_exit_read,
but i guess it doesn't really matter which one), i got random page
faults in userspace almost immediately.
Signed-off-by: Sven Schnelle <svens(a)stackframe.org>
Signed-off-by: Helge Deller <deller(a)gmx.de>
diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
index 57944d6f9ebb..88c188a965d8 100644
--- a/arch/parisc/kernel/entry.S
+++ b/arch/parisc/kernel/entry.S
@@ -1805,7 +1805,7 @@ syscall_restore:
/* Are we being ptraced? */
LDREG TASK_TI_FLAGS(%r1),%r19
- ldi _TIF_SYSCALL_TRACE_MASK,%r2
+ ldi _TIF_SINGLESTEP|_TIF_BLOCKSTEP,%r2
and,COND(=) %r19,%r2,%r0
b,n syscall_restore_rfi
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3ec18fc7831e7d79e2d536dd1f3bc0d3ba425e8a Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)stackframe.org>
Date: Sat, 13 Nov 2021 20:41:17 +0100
Subject: [PATCH] parisc/entry: fix trace test in syscall exit path
commit 8779e05ba8aa ("parisc: Fix ptrace check on syscall return")
fixed testing of TI_FLAGS. This uncovered a bug in the test mask.
syscall_restore_rfi is only used when the kernel needs to exit to
usespace with single or block stepping and the recovery counter
enabled. The test however used _TIF_SYSCALL_TRACE_MASK, which
includes a lot of bits that shouldn't be tested here.
Fix this by using TIF_SINGLESTEP and TIF_BLOCKSTEP directly.
I encountered this bug by enabling syscall tracepoints. Both in qemu and
on real hardware. As soon as i enabled the tracepoint (sys_exit_read,
but i guess it doesn't really matter which one), i got random page
faults in userspace almost immediately.
Signed-off-by: Sven Schnelle <svens(a)stackframe.org>
Signed-off-by: Helge Deller <deller(a)gmx.de>
diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
index 57944d6f9ebb..88c188a965d8 100644
--- a/arch/parisc/kernel/entry.S
+++ b/arch/parisc/kernel/entry.S
@@ -1805,7 +1805,7 @@ syscall_restore:
/* Are we being ptraced? */
LDREG TASK_TI_FLAGS(%r1),%r19
- ldi _TIF_SYSCALL_TRACE_MASK,%r2
+ ldi _TIF_SINGLESTEP|_TIF_BLOCKSTEP,%r2
and,COND(=) %r19,%r2,%r0
b,n syscall_restore_rfi
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3ec18fc7831e7d79e2d536dd1f3bc0d3ba425e8a Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)stackframe.org>
Date: Sat, 13 Nov 2021 20:41:17 +0100
Subject: [PATCH] parisc/entry: fix trace test in syscall exit path
commit 8779e05ba8aa ("parisc: Fix ptrace check on syscall return")
fixed testing of TI_FLAGS. This uncovered a bug in the test mask.
syscall_restore_rfi is only used when the kernel needs to exit to
usespace with single or block stepping and the recovery counter
enabled. The test however used _TIF_SYSCALL_TRACE_MASK, which
includes a lot of bits that shouldn't be tested here.
Fix this by using TIF_SINGLESTEP and TIF_BLOCKSTEP directly.
I encountered this bug by enabling syscall tracepoints. Both in qemu and
on real hardware. As soon as i enabled the tracepoint (sys_exit_read,
but i guess it doesn't really matter which one), i got random page
faults in userspace almost immediately.
Signed-off-by: Sven Schnelle <svens(a)stackframe.org>
Signed-off-by: Helge Deller <deller(a)gmx.de>
diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
index 57944d6f9ebb..88c188a965d8 100644
--- a/arch/parisc/kernel/entry.S
+++ b/arch/parisc/kernel/entry.S
@@ -1805,7 +1805,7 @@ syscall_restore:
/* Are we being ptraced? */
LDREG TASK_TI_FLAGS(%r1),%r19
- ldi _TIF_SYSCALL_TRACE_MASK,%r2
+ ldi _TIF_SINGLESTEP|_TIF_BLOCKSTEP,%r2
and,COND(=) %r19,%r2,%r0
b,n syscall_restore_rfi
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3ec18fc7831e7d79e2d536dd1f3bc0d3ba425e8a Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)stackframe.org>
Date: Sat, 13 Nov 2021 20:41:17 +0100
Subject: [PATCH] parisc/entry: fix trace test in syscall exit path
commit 8779e05ba8aa ("parisc: Fix ptrace check on syscall return")
fixed testing of TI_FLAGS. This uncovered a bug in the test mask.
syscall_restore_rfi is only used when the kernel needs to exit to
usespace with single or block stepping and the recovery counter
enabled. The test however used _TIF_SYSCALL_TRACE_MASK, which
includes a lot of bits that shouldn't be tested here.
Fix this by using TIF_SINGLESTEP and TIF_BLOCKSTEP directly.
I encountered this bug by enabling syscall tracepoints. Both in qemu and
on real hardware. As soon as i enabled the tracepoint (sys_exit_read,
but i guess it doesn't really matter which one), i got random page
faults in userspace almost immediately.
Signed-off-by: Sven Schnelle <svens(a)stackframe.org>
Signed-off-by: Helge Deller <deller(a)gmx.de>
diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
index 57944d6f9ebb..88c188a965d8 100644
--- a/arch/parisc/kernel/entry.S
+++ b/arch/parisc/kernel/entry.S
@@ -1805,7 +1805,7 @@ syscall_restore:
/* Are we being ptraced? */
LDREG TASK_TI_FLAGS(%r1),%r19
- ldi _TIF_SYSCALL_TRACE_MASK,%r2
+ ldi _TIF_SINGLESTEP|_TIF_BLOCKSTEP,%r2
and,COND(=) %r19,%r2,%r0
b,n syscall_restore_rfi
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3ec18fc7831e7d79e2d536dd1f3bc0d3ba425e8a Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)stackframe.org>
Date: Sat, 13 Nov 2021 20:41:17 +0100
Subject: [PATCH] parisc/entry: fix trace test in syscall exit path
commit 8779e05ba8aa ("parisc: Fix ptrace check on syscall return")
fixed testing of TI_FLAGS. This uncovered a bug in the test mask.
syscall_restore_rfi is only used when the kernel needs to exit to
usespace with single or block stepping and the recovery counter
enabled. The test however used _TIF_SYSCALL_TRACE_MASK, which
includes a lot of bits that shouldn't be tested here.
Fix this by using TIF_SINGLESTEP and TIF_BLOCKSTEP directly.
I encountered this bug by enabling syscall tracepoints. Both in qemu and
on real hardware. As soon as i enabled the tracepoint (sys_exit_read,
but i guess it doesn't really matter which one), i got random page
faults in userspace almost immediately.
Signed-off-by: Sven Schnelle <svens(a)stackframe.org>
Signed-off-by: Helge Deller <deller(a)gmx.de>
diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
index 57944d6f9ebb..88c188a965d8 100644
--- a/arch/parisc/kernel/entry.S
+++ b/arch/parisc/kernel/entry.S
@@ -1805,7 +1805,7 @@ syscall_restore:
/* Are we being ptraced? */
LDREG TASK_TI_FLAGS(%r1),%r19
- ldi _TIF_SYSCALL_TRACE_MASK,%r2
+ ldi _TIF_SINGLESTEP|_TIF_BLOCKSTEP,%r2
and,COND(=) %r19,%r2,%r0
b,n syscall_restore_rfi
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3ec18fc7831e7d79e2d536dd1f3bc0d3ba425e8a Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)stackframe.org>
Date: Sat, 13 Nov 2021 20:41:17 +0100
Subject: [PATCH] parisc/entry: fix trace test in syscall exit path
commit 8779e05ba8aa ("parisc: Fix ptrace check on syscall return")
fixed testing of TI_FLAGS. This uncovered a bug in the test mask.
syscall_restore_rfi is only used when the kernel needs to exit to
usespace with single or block stepping and the recovery counter
enabled. The test however used _TIF_SYSCALL_TRACE_MASK, which
includes a lot of bits that shouldn't be tested here.
Fix this by using TIF_SINGLESTEP and TIF_BLOCKSTEP directly.
I encountered this bug by enabling syscall tracepoints. Both in qemu and
on real hardware. As soon as i enabled the tracepoint (sys_exit_read,
but i guess it doesn't really matter which one), i got random page
faults in userspace almost immediately.
Signed-off-by: Sven Schnelle <svens(a)stackframe.org>
Signed-off-by: Helge Deller <deller(a)gmx.de>
diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
index 57944d6f9ebb..88c188a965d8 100644
--- a/arch/parisc/kernel/entry.S
+++ b/arch/parisc/kernel/entry.S
@@ -1805,7 +1805,7 @@ syscall_restore:
/* Are we being ptraced? */
LDREG TASK_TI_FLAGS(%r1),%r19
- ldi _TIF_SYSCALL_TRACE_MASK,%r2
+ ldi _TIF_SINGLESTEP|_TIF_BLOCKSTEP,%r2
and,COND(=) %r19,%r2,%r0
b,n syscall_restore_rfi
In vcpu_load_eoi_exitmap(), currently the eoi_exit_bitmap[4] array is
initialized only when Hyper-V context is available, in other path it is
just passed to kvm_x86_ops.load_eoi_exitmap() directly from on the stack,
which would cause unexpected interrupt delivery/handling issues, e.g. an
*old* linux kernel that relies on PIT to do clock calibration on KVM might
randomly fail to boot.
Fix it by passing ioapic_handled_vectors to load_eoi_exitmap() when Hyper-V
context is not available.
Fixes: f2bc14b69c38 ("KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context")
Cc: stable(a)vger.kernel.org
Reviewed-by: Vitaly Kuznetsov <vkuznets(a)redhat.com>
Signed-off-by: Huang Le <huangle1(a)jd.com>
---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dc7eb5fddfd3..26466f94e31a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9547,12 +9547,16 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
if (!kvm_apic_hw_enabled(vcpu->arch.apic))
return;
- if (to_hv_vcpu(vcpu))
+ if (to_hv_vcpu(vcpu)) {
bitmap_or((ulong *)eoi_exit_bitmap,
vcpu->arch.ioapic_handled_vectors,
to_hv_synic(vcpu)->vec_bitmap, 256);
+ static_call(kvm_x86_load_eoi_exitmap)(vcpu, eoi_exit_bitmap);
+ return;
+ }
- static_call(kvm_x86_load_eoi_exitmap)(vcpu, eoi_exit_bitmap);
+ static_call(kvm_x86_load_eoi_exitmap)(
+ vcpu, (u64 *)vcpu->arch.ioapic_handled_vectors);
}
void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
Access to netdev after free_netdev() will cause use-after-free bug.
Move debug log before free_netdev() call to avoid it.
Cc: stable(a)vger.kernel.org # v4.19+
Signed-off-by: Pavel Skripkin <paskripkin(a)gmail.com>
---
Note about Fixes: tag. The commit introduced it was before this driver
was moved out from staging. I guess, Fixes tag cannot be used here.
Please, let me know if I am wrong.
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
@Dan, is there a smatch checker for straigthforward use after free bugs?
Like acessing pointer after free was called? I think, adding
free_netdev() to check list might be good idea
I've skimmed througth smatch source and didn't find one, so can you,
please, point out to it if it exists.
With regards,
Pavel Skripkin
---
drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index e0c3c58e2ac7..abd833d94eb3 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -4540,10 +4540,10 @@ static int dpaa2_eth_remove(struct fsl_mc_device *ls_dev)
fsl_mc_portal_free(priv->mc_io);
- free_netdev(net_dev);
-
dev_dbg(net_dev->dev.parent, "Removed interface %s\n", net_dev->name);
+ free_netdev(net_dev);
+
return 0;
}
--
2.33.1
The patch titled
Subject: mm/damon/dbgfs: fix missed use of damon_dbgfs_lock
has been added to the -mm tree. Its filename is
mm-damon-dbgfs-fix-missed-use-of-damon_dbgfs_lock.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-damon-dbgfs-fix-missed-use-of-…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-damon-dbgfs-fix-missed-use-of-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: SeongJae Park <sj(a)kernel.org>
Subject: mm/damon/dbgfs: fix missed use of damon_dbgfs_lock
DAMON debugfs is supposed to protect dbgfs_ctxs, dbgfs_nr_ctxs, and
dbgfs_dirs using damon_dbgfs_lock. However, some of the code is accessing
the variables without the protection. This commit fixes it by protecting
all such accesses.
Link: https://lkml.kernel.org/r/20211110145758.16558-3-sj@kernel.org
Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts")
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/dbgfs.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-fix-missed-use-of-damon_dbgfs_lock
+++ a/mm/damon/dbgfs.c
@@ -877,12 +877,14 @@ static ssize_t dbgfs_monitor_on_write(st
return -EINVAL;
}
+ mutex_lock(&damon_dbgfs_lock);
if (!strncmp(kbuf, "on", count)) {
int i;
for (i = 0; i < dbgfs_nr_ctxs; i++) {
if (damon_targets_empty(dbgfs_ctxs[i])) {
kfree(kbuf);
+ mutex_unlock(&damon_dbgfs_lock);
return -EINVAL;
}
}
@@ -892,6 +894,7 @@ static ssize_t dbgfs_monitor_on_write(st
} else {
ret = -EINVAL;
}
+ mutex_unlock(&damon_dbgfs_lock);
if (!ret)
ret = count;
@@ -944,15 +947,16 @@ static int __init __damon_dbgfs_init(voi
static int __init damon_dbgfs_init(void)
{
- int rc;
+ int rc = -ENOMEM;
+ mutex_lock(&damon_dbgfs_lock);
dbgfs_ctxs = kmalloc(sizeof(*dbgfs_ctxs), GFP_KERNEL);
if (!dbgfs_ctxs)
- return -ENOMEM;
+ goto out;
dbgfs_ctxs[0] = dbgfs_new_ctx();
if (!dbgfs_ctxs[0]) {
kfree(dbgfs_ctxs);
- return -ENOMEM;
+ goto out;
}
dbgfs_nr_ctxs = 1;
@@ -963,6 +967,8 @@ static int __init damon_dbgfs_init(void)
pr_err("%s: dbgfs init failed\n", __func__);
}
+out:
+ mutex_unlock(&damon_dbgfs_lock);
return rc;
}
_
Patches currently in -mm which might be from sj(a)kernel.org are
mm-damon-dbgfs-use-__gfp_nowarn-for-user-specified-size-buffer-allocation.patch
mm-damon-dbgfs-fix-missed-use-of-damon_dbgfs_lock.patch
The patch titled
Subject: mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation
has been added to the -mm tree. Its filename is
mm-damon-dbgfs-use-__gfp_nowarn-for-user-specified-size-buffer-allocation.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-damon-dbgfs-use-__gfp_nowarn-f…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-damon-dbgfs-use-__gfp_nowarn-f…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: SeongJae Park <sj(a)kernel.org>
Subject: mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation
Patch series "DAMON fixes".
This patch (of 2):
DAMON users can trigger below warning in '__alloc_pages()' by invoking
write() to some DAMON debugfs files with arbitrarily high count argument,
because DAMON debugfs interface allocates some buffers based on the
user-specified 'count'.
if (unlikely(order >= MAX_ORDER)) {
WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
return NULL;
}
Because the DAMON debugfs interface code checks failure of the
'kmalloc()', this commit simply suppresses the warnings by adding
'__GFP_NOWARN' flag.
Link: https://lkml.kernel.org/r/20211110145758.16558-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211110145758.16558-2-sj@kernel.org
Fixes: 4bc05954d007 ("mm/damon: implement a debugfs-based user space interface")
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/dbgfs.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-use-__gfp_nowarn-for-user-specified-size-buffer-allocation
+++ a/mm/damon/dbgfs.c
@@ -32,7 +32,7 @@ static char *user_input_str(const char _
if (*ppos)
return ERR_PTR(-EINVAL);
- kbuf = kmalloc(count + 1, GFP_KERNEL);
+ kbuf = kmalloc(count + 1, GFP_KERNEL | __GFP_NOWARN);
if (!kbuf)
return ERR_PTR(-ENOMEM);
@@ -133,7 +133,7 @@ static ssize_t dbgfs_schemes_read(struct
char *kbuf;
ssize_t len;
- kbuf = kmalloc(count, GFP_KERNEL);
+ kbuf = kmalloc(count, GFP_KERNEL | __GFP_NOWARN);
if (!kbuf)
return -ENOMEM;
@@ -452,7 +452,7 @@ static ssize_t dbgfs_init_regions_read(s
char *kbuf;
ssize_t len;
- kbuf = kmalloc(count, GFP_KERNEL);
+ kbuf = kmalloc(count, GFP_KERNEL | __GFP_NOWARN);
if (!kbuf)
return -ENOMEM;
@@ -578,7 +578,7 @@ static ssize_t dbgfs_kdamond_pid_read(st
char *kbuf;
ssize_t len;
- kbuf = kmalloc(count, GFP_KERNEL);
+ kbuf = kmalloc(count, GFP_KERNEL | __GFP_NOWARN);
if (!kbuf)
return -ENOMEM;
_
Patches currently in -mm which might be from sj(a)kernel.org are
mm-damon-dbgfs-use-__gfp_nowarn-for-user-specified-size-buffer-allocation.patch
mm-damon-dbgfs-fix-missed-use-of-damon_dbgfs_lock.patch
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 86432a6dca9bed79111990851df5756d3eb5f57c Mon Sep 17 00:00:00 2001
From: Gao Xiang <hsiangkao(a)linux.alibaba.com>
Date: Thu, 4 Nov 2021 02:20:06 +0800
Subject: [PATCH] erofs: fix unsafe pagevec reuse of hooked pclusters
There are pclusters in runtime marked with Z_EROFS_PCLUSTER_TAIL
before actual I/O submission. Thus, the decompression chain can be
extended if the following pcluster chain hooks such tail pcluster.
As the related comment mentioned, if some page is made of a hooked
pcluster and another followed pcluster, it can be reused for in-place
I/O (since I/O should be submitted anyway):
_______________________________________________________________
| tail (partial) page | head (partial) page |
|_____PRIMARY_HOOKED___|____________PRIMARY_FOLLOWED____________|
However, it's by no means safe to reuse as pagevec since if such
PRIMARY_HOOKED pclusters finally move into bypass chain without I/O
submission. It's somewhat hard to reproduce with LZ4 and I just found
it (general protection fault) by ro_fsstressing a LZMA image for long
time.
I'm going to actively clean up related code together with multi-page
folio adaption in the next few months. Let's address it directly for
easier backporting for now.
Call trace for reference:
z_erofs_decompress_pcluster+0x10a/0x8a0 [erofs]
z_erofs_decompress_queue.isra.36+0x3c/0x60 [erofs]
z_erofs_runqueue+0x5f3/0x840 [erofs]
z_erofs_readahead+0x1e8/0x320 [erofs]
read_pages+0x91/0x270
page_cache_ra_unbounded+0x18b/0x240
filemap_get_pages+0x10a/0x5f0
filemap_read+0xa9/0x330
new_sync_read+0x11b/0x1a0
vfs_read+0xf1/0x190
Link: https://lore.kernel.org/r/20211103182006.4040-1-xiang@kernel.org
Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support")
Cc: <stable(a)vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <chao(a)kernel.org>
Signed-off-by: Gao Xiang <hsiangkao(a)linux.alibaba.com>
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 11c7a1aaebad..eb51df4a9f77 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -373,8 +373,8 @@ static bool z_erofs_try_inplace_io(struct z_erofs_collector *clt,
/* callers must be with collection lock held */
static int z_erofs_attach_page(struct z_erofs_collector *clt,
- struct page *page,
- enum z_erofs_page_type type)
+ struct page *page, enum z_erofs_page_type type,
+ bool pvec_safereuse)
{
int ret;
@@ -384,9 +384,9 @@ static int z_erofs_attach_page(struct z_erofs_collector *clt,
z_erofs_try_inplace_io(clt, page))
return 0;
- ret = z_erofs_pagevec_enqueue(&clt->vector, page, type);
+ ret = z_erofs_pagevec_enqueue(&clt->vector, page, type,
+ pvec_safereuse);
clt->cl->vcnt += (unsigned int)ret;
-
return ret ? 0 : -EAGAIN;
}
@@ -729,7 +729,8 @@ static int z_erofs_do_read_page(struct z_erofs_decompress_frontend *fe,
tight &= (clt->mode >= COLLECT_PRIMARY_FOLLOWED);
retry:
- err = z_erofs_attach_page(clt, page, page_type);
+ err = z_erofs_attach_page(clt, page, page_type,
+ clt->mode >= COLLECT_PRIMARY_FOLLOWED);
/* should allocate an additional short-lived page for pagevec */
if (err == -EAGAIN) {
struct page *const newpage =
@@ -737,7 +738,7 @@ static int z_erofs_do_read_page(struct z_erofs_decompress_frontend *fe,
set_page_private(newpage, Z_EROFS_SHORTLIVED_PAGE);
err = z_erofs_attach_page(clt, newpage,
- Z_EROFS_PAGE_TYPE_EXCLUSIVE);
+ Z_EROFS_PAGE_TYPE_EXCLUSIVE, true);
if (!err)
goto retry;
}
diff --git a/fs/erofs/zpvec.h b/fs/erofs/zpvec.h
index dfd7fe0503bb..b05464f4a808 100644
--- a/fs/erofs/zpvec.h
+++ b/fs/erofs/zpvec.h
@@ -106,11 +106,18 @@ static inline void z_erofs_pagevec_ctor_init(struct z_erofs_pagevec_ctor *ctor,
static inline bool z_erofs_pagevec_enqueue(struct z_erofs_pagevec_ctor *ctor,
struct page *page,
- enum z_erofs_page_type type)
+ enum z_erofs_page_type type,
+ bool pvec_safereuse)
{
- if (!ctor->next && type)
- if (ctor->index + 1 == ctor->nr)
+ if (!ctor->next) {
+ /* some pages cannot be reused as pvec safely without I/O */
+ if (type == Z_EROFS_PAGE_TYPE_EXCLUSIVE && !pvec_safereuse)
+ type = Z_EROFS_VLE_PAGE_TYPE_TAIL_SHARED;
+
+ if (type != Z_EROFS_PAGE_TYPE_EXCLUSIVE &&
+ ctor->index + 1 == ctor->nr)
return false;
+ }
if (ctor->index >= ctor->nr)
z_erofs_pagevec_ctor_pagedown(ctor, false);
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0706f74f719e6e72c3a862ab2990796578fa73cc Mon Sep 17 00:00:00 2001
From: Masahiro Yamada <masahiroy(a)kernel.org>
Date: Wed, 10 Nov 2021 00:01:45 +0900
Subject: [PATCH] MIPS: fix *-pkg builds for loongson2ef platform
Since commit 805b2e1d427a ("kbuild: include Makefile.compiler only when
compiler is needed"), package builds for the loongson2f platform fail.
$ make ARCH=mips CROSS_COMPILE=mips64-linux- lemote2f_defconfig bindeb-pkg
[ snip ]
sh ./scripts/package/builddeb
arch/mips/loongson2ef//Platform:36: *** only binutils >= 2.20.2 have needed option -mfix-loongson2f-nop. Stop.
cp: cannot stat '': No such file or directory
make[5]: *** [scripts/Makefile.package:87: intdeb-pkg] Error 1
make[4]: *** [Makefile:1558: intdeb-pkg] Error 2
make[3]: *** [debian/rules:13: binary-arch] Error 2
dpkg-buildpackage: error: debian/rules binary subprocess returned exit status 2
make[2]: *** [scripts/Makefile.package:83: bindeb-pkg] Error 2
make[1]: *** [Makefile:1558: bindeb-pkg] Error 2
make: *** [Makefile:350: __build_one_by_one] Error 2
The reason is because "make image_name" fails.
$ make ARCH=mips CROSS_COMPILE=mips64-linux- image_name
arch/mips/loongson2ef//Platform:36: *** only binutils >= 2.20.2 have needed option -mfix-loongson2f-nop. Stop.
In general, adding $(error ...) in the parse stage is troublesome,
and it is pointless to check toolchains even if we are not building
anything. Do not include Kbuild.platform in such cases.
Fixes: 805b2e1d427a ("kbuild: include Makefile.compiler only when compiler is needed")
Reported-by: Jason Self <jason(a)bluehome.net>
Signed-off-by: Masahiro Yamada <masahiroy(a)kernel.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend(a)alpha.franken.de>
diff --git a/arch/mips/Makefile b/arch/mips/Makefile
index ea3cd080a1c7..f7b58da2f388 100644
--- a/arch/mips/Makefile
+++ b/arch/mips/Makefile
@@ -254,7 +254,9 @@ endif
#
# Board-dependent options and extra files
#
+ifdef need-compiler
include $(srctree)/arch/mips/Kbuild.platforms
+endif
ifdef CONFIG_PHYSICAL_START
load-y = $(CONFIG_PHYSICAL_START)
commit 0706f74f719e6e72c3a862ab2990796578fa73cc upstream.
Since commit 805b2e1d427a ("kbuild: include Makefile.compiler only when
compiler is needed"), package builds for the loongson2f platform fail.
$ make ARCH=mips CROSS_COMPILE=mips64-linux- lemote2f_defconfig bindeb-pkg
[ snip ]
sh ./scripts/package/builddeb
arch/mips/loongson2ef//Platform:36: *** only binutils >= 2.20.2 have needed option -mfix-loongson2f-nop. Stop.
cp: cannot stat '': No such file or directory
make[5]: *** [scripts/Makefile.package:87: intdeb-pkg] Error 1
make[4]: *** [Makefile:1558: intdeb-pkg] Error 2
make[3]: *** [debian/rules:13: binary-arch] Error 2
dpkg-buildpackage: error: debian/rules binary subprocess returned exit status 2
make[2]: *** [scripts/Makefile.package:83: bindeb-pkg] Error 2
make[1]: *** [Makefile:1558: bindeb-pkg] Error 2
make: *** [Makefile:350: __build_one_by_one] Error 2
The reason is because "make image_name" fails.
$ make ARCH=mips CROSS_COMPILE=mips64-linux- image_name
arch/mips/loongson2ef//Platform:36: *** only binutils >= 2.20.2 have needed option -mfix-loongson2f-nop. Stop.
In general, adding $(error ...) in the parse stage is troublesome,
and it is pointless to check toolchains even if we are not building
anything. Do not include Kbuild.platform in such cases.
Fixes: 805b2e1d427a ("kbuild: include Makefile.compiler only when compiler is needed")
Reported-by: Jason Self <jason(a)bluehome.net>
Signed-off-by: Masahiro Yamada <masahiroy(a)kernel.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend(a)alpha.franken.de>
---
arch/mips/Makefile | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/mips/Makefile b/arch/mips/Makefile
index 653befc1b176..0dfef0beaaaa 100644
--- a/arch/mips/Makefile
+++ b/arch/mips/Makefile
@@ -254,7 +254,9 @@ endif
#
# Board-dependent options and extra files
#
+ifdef need-compiler
include arch/mips/Kbuild.platforms
+endif
ifdef CONFIG_PHYSICAL_START
load-y = $(CONFIG_PHYSICAL_START)
--
2.30.2
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 86432a6dca9bed79111990851df5756d3eb5f57c Mon Sep 17 00:00:00 2001
From: Gao Xiang <hsiangkao(a)linux.alibaba.com>
Date: Thu, 4 Nov 2021 02:20:06 +0800
Subject: [PATCH] erofs: fix unsafe pagevec reuse of hooked pclusters
There are pclusters in runtime marked with Z_EROFS_PCLUSTER_TAIL
before actual I/O submission. Thus, the decompression chain can be
extended if the following pcluster chain hooks such tail pcluster.
As the related comment mentioned, if some page is made of a hooked
pcluster and another followed pcluster, it can be reused for in-place
I/O (since I/O should be submitted anyway):
_______________________________________________________________
| tail (partial) page | head (partial) page |
|_____PRIMARY_HOOKED___|____________PRIMARY_FOLLOWED____________|
However, it's by no means safe to reuse as pagevec since if such
PRIMARY_HOOKED pclusters finally move into bypass chain without I/O
submission. It's somewhat hard to reproduce with LZ4 and I just found
it (general protection fault) by ro_fsstressing a LZMA image for long
time.
I'm going to actively clean up related code together with multi-page
folio adaption in the next few months. Let's address it directly for
easier backporting for now.
Call trace for reference:
z_erofs_decompress_pcluster+0x10a/0x8a0 [erofs]
z_erofs_decompress_queue.isra.36+0x3c/0x60 [erofs]
z_erofs_runqueue+0x5f3/0x840 [erofs]
z_erofs_readahead+0x1e8/0x320 [erofs]
read_pages+0x91/0x270
page_cache_ra_unbounded+0x18b/0x240
filemap_get_pages+0x10a/0x5f0
filemap_read+0xa9/0x330
new_sync_read+0x11b/0x1a0
vfs_read+0xf1/0x190
Link: https://lore.kernel.org/r/20211103182006.4040-1-xiang@kernel.org
Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support")
Cc: <stable(a)vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <chao(a)kernel.org>
Signed-off-by: Gao Xiang <hsiangkao(a)linux.alibaba.com>
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 11c7a1aaebad..eb51df4a9f77 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -373,8 +373,8 @@ static bool z_erofs_try_inplace_io(struct z_erofs_collector *clt,
/* callers must be with collection lock held */
static int z_erofs_attach_page(struct z_erofs_collector *clt,
- struct page *page,
- enum z_erofs_page_type type)
+ struct page *page, enum z_erofs_page_type type,
+ bool pvec_safereuse)
{
int ret;
@@ -384,9 +384,9 @@ static int z_erofs_attach_page(struct z_erofs_collector *clt,
z_erofs_try_inplace_io(clt, page))
return 0;
- ret = z_erofs_pagevec_enqueue(&clt->vector, page, type);
+ ret = z_erofs_pagevec_enqueue(&clt->vector, page, type,
+ pvec_safereuse);
clt->cl->vcnt += (unsigned int)ret;
-
return ret ? 0 : -EAGAIN;
}
@@ -729,7 +729,8 @@ static int z_erofs_do_read_page(struct z_erofs_decompress_frontend *fe,
tight &= (clt->mode >= COLLECT_PRIMARY_FOLLOWED);
retry:
- err = z_erofs_attach_page(clt, page, page_type);
+ err = z_erofs_attach_page(clt, page, page_type,
+ clt->mode >= COLLECT_PRIMARY_FOLLOWED);
/* should allocate an additional short-lived page for pagevec */
if (err == -EAGAIN) {
struct page *const newpage =
@@ -737,7 +738,7 @@ static int z_erofs_do_read_page(struct z_erofs_decompress_frontend *fe,
set_page_private(newpage, Z_EROFS_SHORTLIVED_PAGE);
err = z_erofs_attach_page(clt, newpage,
- Z_EROFS_PAGE_TYPE_EXCLUSIVE);
+ Z_EROFS_PAGE_TYPE_EXCLUSIVE, true);
if (!err)
goto retry;
}
diff --git a/fs/erofs/zpvec.h b/fs/erofs/zpvec.h
index dfd7fe0503bb..b05464f4a808 100644
--- a/fs/erofs/zpvec.h
+++ b/fs/erofs/zpvec.h
@@ -106,11 +106,18 @@ static inline void z_erofs_pagevec_ctor_init(struct z_erofs_pagevec_ctor *ctor,
static inline bool z_erofs_pagevec_enqueue(struct z_erofs_pagevec_ctor *ctor,
struct page *page,
- enum z_erofs_page_type type)
+ enum z_erofs_page_type type,
+ bool pvec_safereuse)
{
- if (!ctor->next && type)
- if (ctor->index + 1 == ctor->nr)
+ if (!ctor->next) {
+ /* some pages cannot be reused as pvec safely without I/O */
+ if (type == Z_EROFS_PAGE_TYPE_EXCLUSIVE && !pvec_safereuse)
+ type = Z_EROFS_VLE_PAGE_TYPE_TAIL_SHARED;
+
+ if (type != Z_EROFS_PAGE_TYPE_EXCLUSIVE &&
+ ctor->index + 1 == ctor->nr)
return false;
+ }
if (ctor->index >= ctor->nr)
z_erofs_pagevec_ctor_pagedown(ctor, false);
The patch titled
Subject: proc/vmcore: fix clearing user buffer by properly using clear_user()
has been added to the -mm tree. Its filename is
proc-vmcore-fix-clearing-user-buffer-by-properly-using-clear_user.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/proc-vmcore-fix-clearing-user-buf…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/proc-vmcore-fix-clearing-user-buf…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: proc/vmcore: fix clearing user buffer by properly using clear_user()
To clear a user buffer we cannot simply use memset, we have to use
clear_user(). With a virtio-mem device that registers a vmcore_cb and has
some logically unplugged memory inside an added Linux memory block, I can
easily trigger a BUG by copying the vmcore via "cp":
[ 11.327580] systemd[1]: Starting Kdump Vmcore Save Service...
[ 11.339697] kdump[420]: Kdump is using the default log level(3).
[ 11.370964] kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
[ 11.373997] kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
[ 11.385357] kdump[465]: saving vmcore-dmesg.txt complete
[ 11.386722] kdump[467]: saving vmcore
[ 16.531275] BUG: unable to handle page fault for address: 00007f2374e01000
[ 16.531705] #PF: supervisor write access in kernel mode
[ 16.532037] #PF: error_code(0x0003) - permissions violation
[ 16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
[ 16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
[ 16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
[ 16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
[ 16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
[ 16.535670] RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
[ 16.535998] RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
[ 16.536441] RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
[ 16.536878] RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
[ 16.537315] R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
[ 16.537755] R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
[ 16.538200] FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
[ 16.538696] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 16.539055] CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
[ 16.539510] Call Trace:
[ 16.539679] <TASK>
[ 16.539828] read_vmcore+0x236/0x2c0
[ 16.540063] ? enqueue_hrtimer+0x2f/0x80
[ 16.540323] ? inode_security+0x22/0x60
[ 16.540572] proc_reg_read+0x55/0xa0
[ 16.540807] vfs_read+0x95/0x190
[ 16.541022] ksys_read+0x4f/0xc0
[ 16.541238] do_syscall_64+0x3b/0x90
[ 16.541475] entry_SYSCALL_64_after_hwframe+0x44/0xae
Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
Prevention (SMAP)", which is used to detect wrong access from the kernel
to user buffers like this: SMAP triggers a permissions violation on wrong
access. In the x86-64 variant of clear_user(), SMAP is properly handled
via clac()+stac().
To fix, properly use clear_user() when we're dealing with a user buffer.
Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com
Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Vivek Goyal <vgoyal(a)redhat.com>
Cc: Philipp Rudo <prudo(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/vmcore.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
--- a/fs/proc/vmcore.c~proc-vmcore-fix-clearing-user-buffer-by-properly-using-clear_user
+++ a/fs/proc/vmcore.c
@@ -154,9 +154,13 @@ ssize_t read_from_oldmem(char *buf, size
nr_bytes = count;
/* If pfn is not ram, return zeros for sparse dump files */
- if (!pfn_is_ram(pfn))
- memset(buf, 0, nr_bytes);
- else {
+ if (!pfn_is_ram(pfn)) {
+ tmp = 0;
+ if (!userbuf)
+ memset(buf, 0, nr_bytes);
+ else if (clear_user(buf, nr_bytes))
+ tmp = -EFAULT;
+ } else {
if (encrypted)
tmp = copy_oldmem_page_encrypted(pfn, buf,
nr_bytes,
@@ -165,12 +169,12 @@ ssize_t read_from_oldmem(char *buf, size
else
tmp = copy_oldmem_page(pfn, buf, nr_bytes,
offset, userbuf);
-
- if (tmp < 0) {
- up_read(&vmcore_cb_rwsem);
- return tmp;
- }
}
+ if (tmp < 0) {
+ up_read(&vmcore_cb_rwsem);
+ return tmp;
+ }
+
*ppos += nr_bytes;
count -= nr_bytes;
buf += nr_bytes;
_
Patches currently in -mm which might be from david(a)redhat.com are
proc-vmcore-fix-clearing-user-buffer-by-properly-using-clear_user.patch
Booting to Android userspace on 5.14 or newer triggers the following
SELinux denial:
avc: denied { sys_nice } for comm="init" capability=23
scontext=u:r:init:s0 tcontext=u:r:init:s0 tclass=capability
permissive=0
Init is PID 0 running as root, so it already has CAP_SYS_ADMIN. For
better compatibility with older SEPolicy, check ADMIN before NICE.
Fixes: 9d3a39a5f1e4 ("block: grant IOPRIO_CLASS_RT to CAP_SYS_NICE")
Signed-off-by: Alistair Delva <adelva(a)google.com>
Cc: Khazhismel Kumykov <khazhy(a)google.com>
Cc: Bart Van Assche <bvanassche(a)acm.org>
Cc: Serge Hallyn <serge(a)hallyn.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Paul Moore <paul(a)paul-moore.com>
Cc: selinux(a)vger.kernel.org
Cc: linux-security-module(a)vger.kernel.org
Cc: kernel-team(a)android.com
Cc: stable(a)vger.kernel.org # v5.14+
---
v2: added comment requested by Jens
block/ioprio.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/block/ioprio.c b/block/ioprio.c
index 0e4ff245f2bf..313c14a70bbd 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -69,7 +69,14 @@ int ioprio_check_cap(int ioprio)
switch (class) {
case IOPRIO_CLASS_RT:
- if (!capable(CAP_SYS_NICE) && !capable(CAP_SYS_ADMIN))
+ /*
+ * Originally this only checked for CAP_SYS_ADMIN,
+ * which was implicitly allowed for pid 0 by security
+ * modules such as SELinux. Make sure we check
+ * CAP_SYS_ADMIN first to avoid a denial/avc for
+ * possibly missing CAP_SYS_NICE permission.
+ */
+ if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_NICE))
return -EPERM;
fallthrough;
/* rt has prio field too */
--
2.34.0.rc1.387.gb447b232ab-goog
The patch titled
Subject: hugetlb: fix hugetlb cgroup refcounting during mremap
has been added to the -mm tree. Its filename is
hugetlb-fix-hugetlb-cgroup-refcounting-during-mremap.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/hugetlb-fix-hugetlb-cgroup-refcou…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/hugetlb-fix-hugetlb-cgroup-refcou…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Bui Quang Minh <minhquangbui99(a)gmail.com>
Subject: hugetlb: fix hugetlb cgroup refcounting during mremap
When hugetlb_vm_op_open() is called during copy_vma(), we may take the
reference to resv_map->css. Later, when clearing the reservation pointer
of old_vma after transferring it to new_vma, we forget to drop the
reference to resv_map->css. This leads to a reference leak of css.
Fixes this by adding a check to drop reservation css reference in
clear_vma_resv_huge_pages()
Link: https://lkml.kernel.org/r/20211113154412.91134-1-minhquangbui99@gmail.com
Fixes: 550a7d60bd5e35 ("mm, hugepages: add mremap() support for hugepage backed vma")
Signed-off-by: Bui Quang Minh <minhquangbui99(a)gmail.com>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Mina Almasry <almasrymina(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb_cgroup.h | 12 ++++++++++++
mm/hugetlb.c | 4 +++-
2 files changed, 15 insertions(+), 1 deletion(-)
--- a/include/linux/hugetlb_cgroup.h~hugetlb-fix-hugetlb-cgroup-refcounting-during-mremap
+++ a/include/linux/hugetlb_cgroup.h
@@ -128,6 +128,13 @@ static inline void resv_map_dup_hugetlb_
css_get(resv_map->css);
}
+static inline void resv_map_put_hugetlb_cgroup_uncharge_info(
+ struct resv_map *resv_map)
+{
+ if (resv_map->css)
+ css_put(resv_map->css);
+}
+
extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr);
extern int hugetlb_cgroup_charge_cgroup_rsvd(int idx, unsigned long nr_pages,
@@ -210,6 +217,11 @@ static inline void resv_map_dup_hugetlb_
struct resv_map *resv_map)
{
}
+
+static inline void resv_map_put_hugetlb_cgroup_uncharge_info(
+ struct resv_map *resv_map)
+{
+}
static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr)
--- a/mm/hugetlb.c~hugetlb-fix-hugetlb-cgroup-refcounting-during-mremap
+++ a/mm/hugetlb.c
@@ -1037,8 +1037,10 @@ void clear_vma_resv_huge_pages(struct vm
*/
struct resv_map *reservations = vma_resv_map(vma);
- if (reservations && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+ if (reservations && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+ resv_map_put_hugetlb_cgroup_uncharge_info(reservations);
kref_put(&reservations->refs, resv_map_release);
+ }
reset_vma_resv_huge_pages(vma);
}
_
Patches currently in -mm which might be from minhquangbui99(a)gmail.com are
hugetlb-fix-hugetlb-cgroup-refcounting-during-mremap.patch
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3735459037114d31e5acd9894fad9aed104231a0 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Tue, 9 Nov 2021 14:53:57 +0100
Subject: [PATCH] PCI/MSI: Destroy sysfs before freeing entries
free_msi_irqs() frees the MSI entries before destroying the sysfs entries
which are exposing them. Nothing prevents a concurrent free while a sysfs
file is read and accesses the possibly freed entry.
Move the sysfs release ahead of freeing the entries.
Fixes: 1c51b50c2995 ("PCI/MSI: Export MSI mode using attributes, not kobjects")
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Bjorn Helgaas <helgaas(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/87sfw5305m.ffs@tglx
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 70433013897b..48e3f4e47b29 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -368,6 +368,11 @@ static void free_msi_irqs(struct pci_dev *dev)
for (i = 0; i < entry->nvec_used; i++)
BUG_ON(irq_has_action(entry->irq + i));
+ if (dev->msi_irq_groups) {
+ msi_destroy_sysfs(&dev->dev, dev->msi_irq_groups);
+ dev->msi_irq_groups = NULL;
+ }
+
pci_msi_teardown_msi_irqs(dev);
list_for_each_entry_safe(entry, tmp, msi_list, list) {
@@ -379,11 +384,6 @@ static void free_msi_irqs(struct pci_dev *dev)
list_del(&entry->list);
free_msi_entry(entry);
}
-
- if (dev->msi_irq_groups) {
- msi_destroy_sysfs(&dev->dev, dev->msi_irq_groups);
- dev->msi_irq_groups = NULL;
- }
}
static void pci_intx_for_msi(struct pci_dev *dev, int enable)
While working on supporting the Intel HDR backlight interface, I noticed
that there's a couple of laptops that will very rarely manage to boot up
without detecting Intel HDR backlight support - even though it's supported
on the system. One example of such a laptop is the Lenovo P17 1st
generation.
Following some investigation Ville Syrjälä did through the docs they have
available to them, they discovered that there's actually supposed to be a
30ms wait after writing the source OUI before we begin setting up the rest
of the backlight interface.
This seems to be correct, as adding this 30ms delay seems to have
completely fixed the probing issues I was previously seeing. So - let's
start performing a 30ms wait after writing the OUI, which we do in a manner
similar to how we keep track of PPS delays (e.g. record the timestamp of
the OUI write, and then wait for however many ms are left since that
timestamp right before we interact with the backlight) in order to avoid
waiting any longer then we need to. As well, this also avoids us performing
this delay on systems where we don't end up using the HDR backlight
interface.
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
Fixes: 4a8d79901d5b ("drm/i915/dp: Enable Intel's HDR backlight interface (only SDR for now)")
Cc: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org> # v5.12+
---
drivers/gpu/drm/i915/display/intel_display_types.h | 3 +++
drivers/gpu/drm/i915/display/intel_dp.c | 3 +++
drivers/gpu/drm/i915/display/intel_dp_aux_backlight.c | 11 +++++++++++
3 files changed, 17 insertions(+)
diff --git a/drivers/gpu/drm/i915/display/intel_display_types.h b/drivers/gpu/drm/i915/display/intel_display_types.h
index ea1e8a6e10b0..b9c967837872 100644
--- a/drivers/gpu/drm/i915/display/intel_display_types.h
+++ b/drivers/gpu/drm/i915/display/intel_display_types.h
@@ -1653,6 +1653,9 @@ struct intel_dp {
struct intel_dp_pcon_frl frl;
struct intel_psr psr;
+
+ /* When we last wrote the OUI for eDP */
+ unsigned long last_oui_write;
};
enum lspcon_vendor {
diff --git a/drivers/gpu/drm/i915/display/intel_dp.c b/drivers/gpu/drm/i915/display/intel_dp.c
index 0a424bf69396..77d9a9390c1e 100644
--- a/drivers/gpu/drm/i915/display/intel_dp.c
+++ b/drivers/gpu/drm/i915/display/intel_dp.c
@@ -29,6 +29,7 @@
#include <linux/i2c.h>
#include <linux/notifier.h>
#include <linux/slab.h>
+#include <linux/timekeeping.h>
#include <linux/types.h>
#include <asm/byteorder.h>
@@ -2010,6 +2011,8 @@ intel_edp_init_source_oui(struct intel_dp *intel_dp, bool careful)
if (drm_dp_dpcd_write(&intel_dp->aux, DP_SOURCE_OUI, oui, sizeof(oui)) < 0)
drm_err(&i915->drm, "Failed to write source OUI\n");
+
+ intel_dp->last_oui_write = jiffies;
}
/* If the device supports it, try to set the power state appropriately */
diff --git a/drivers/gpu/drm/i915/display/intel_dp_aux_backlight.c b/drivers/gpu/drm/i915/display/intel_dp_aux_backlight.c
index 569d17b4d00f..2c35b999ec2c 100644
--- a/drivers/gpu/drm/i915/display/intel_dp_aux_backlight.c
+++ b/drivers/gpu/drm/i915/display/intel_dp_aux_backlight.c
@@ -96,6 +96,13 @@
#define INTEL_EDP_BRIGHTNESS_OPTIMIZATION_1 0x359
/* Intel EDP backlight callbacks */
+static void
+wait_for_oui(struct drm_i915_private *i915, struct intel_dp *intel_dp)
+{
+ drm_dbg_kms(&i915->drm, "Performing OUI wait\n");
+ wait_remaining_ms_from_jiffies(intel_dp->last_oui_write, 30);
+}
+
static bool
intel_dp_aux_supports_hdr_backlight(struct intel_connector *connector)
{
@@ -106,6 +113,8 @@ intel_dp_aux_supports_hdr_backlight(struct intel_connector *connector)
int ret;
u8 tcon_cap[4];
+ wait_for_oui(i915, intel_dp);
+
ret = drm_dp_dpcd_read(aux, INTEL_EDP_HDR_TCON_CAP0, tcon_cap, sizeof(tcon_cap));
if (ret != sizeof(tcon_cap))
return false;
@@ -204,6 +213,8 @@ intel_dp_aux_hdr_enable_backlight(const struct intel_crtc_state *crtc_state,
int ret;
u8 old_ctrl, ctrl;
+ wait_for_oui(i915, intel_dp);
+
ret = drm_dp_dpcd_readb(&intel_dp->aux, INTEL_EDP_HDR_GETSET_CTRL_PARAMS, &old_ctrl);
if (ret != 1) {
drm_err(&i915->drm, "Failed to read current backlight control mode: %d\n", ret);
--
2.31.1
The patch titled
Subject: mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag
has been added to the -mm tree. Its filename is
mm-kmemleak-slob-respect-slab_noleaktrace-flag.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-kmemleak-slob-respect-slab_nol…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-kmemleak-slob-respect-slab_nol…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Rustam Kovhaev <rkovhaev(a)gmail.com>
Subject: mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag
When kmemleak is enabled for SLOB, system does not boot and does not print
anything to the console. At the very early stage in the boot process we
hit infinite recursion from kmemleak_init() and eventually kernel crashes.
kmemleak_init() specifies SLAB_NOLEAKTRACE for KMEM_CACHE(), but
kmem_cache_create_usercopy() removes it because CACHE_CREATE_MASK is not
valid for SLOB.
Let's fix CACHE_CREATE_MASK and make kmemleak work with SLOB
Link: https://lkml.kernel.org/r/20211115020850.3154366-1-rkovhaev@gmail.com
Fixes: d8843922fba4 ("slab: Ignore internal flags in cache creation")
Signed-off-by: Rustam Kovhaev <rkovhaev(a)gmail.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Glauber Costa <glommer(a)parallels.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/slab.h~mm-kmemleak-slob-respect-slab_noleaktrace-flag
+++ a/mm/slab.h
@@ -147,7 +147,7 @@ static inline slab_flags_t kmem_cache_fl
#define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE | SLAB_RECLAIM_ACCOUNT | \
SLAB_TEMPORARY | SLAB_ACCOUNT)
#else
-#define SLAB_CACHE_FLAGS (0)
+#define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE)
#endif
/* Common flags available with current configuration */
_
Patches currently in -mm which might be from rkovhaev(a)gmail.com are
mm-kmemleak-slob-respect-slab_noleaktrace-flag.patch
The patch titled
Subject: hexagon: clean up timer-regs.h
has been added to the -mm tree. Its filename is
hexagon-clean-up-timer-regsh.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/hexagon-clean-up-timer-regsh.patch
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/hexagon-clean-up-timer-regsh.patch
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Nathan Chancellor <nathan(a)kernel.org>
Subject: hexagon: clean up timer-regs.h
When building allmodconfig, there is a warning about TIMER_ENABLE being
redefined:
drivers/clocksource/timer-oxnas-rps.c:39:9: error: 'TIMER_ENABLE' macro redefined [-Werror,-Wmacro-redefined]
#define TIMER_ENABLE BIT(7)
^
./arch/hexagon/include/asm/timer-regs.h:13:9: note: previous definition is here
#define TIMER_ENABLE 0
^
1 error generated.
The values in this header are only used in one file each, if they are
used at all. Remove the header and sink all of the constants into their
respective files.
TCX0_CLK_RATE is only used in arch/hexagon/include/asm/timex.h
TIMER_ENABLE, RTOS_TIMER_INT, RTOS_TIMER_REGS_ADDR are only used in
arch/hexagon/kernel/time.c.
SLEEP_CLK_RATE and TIMER_CLR_ON_MATCH have both been unused since the
file's introduction in commit 71e4a47f32f4 ("Hexagon: Add time and timer
functions").
TIMER_ENABLE is redefined as BIT(0) so the shift is moved into the
definition, rather than its use.
Link: https://lkml.kernel.org/r/20211115174250.1994179-3-nathan@kernel.org
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
Acked-by: Brian Cain <bcain(a)codeaurora.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/hexagon/include/asm/timer-regs.h | 26 ------------------------
arch/hexagon/include/asm/timex.h | 3 --
arch/hexagon/kernel/time.c | 12 +++++++++--
3 files changed, 11 insertions(+), 30 deletions(-)
--- a/arch/hexagon/include/asm/timer-regs.h
+++ /dev/null
@@ -1,26 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Timer support for Hexagon
- *
- * Copyright (c) 2010-2011, The Linux Foundation. All rights reserved.
- */
-
-#ifndef _ASM_TIMER_REGS_H
-#define _ASM_TIMER_REGS_H
-
-/* This stuff should go into a platform specific file */
-#define TCX0_CLK_RATE 19200
-#define TIMER_ENABLE 0
-#define TIMER_CLR_ON_MATCH 1
-
-/*
- * 8x50 HDD Specs 5-8. Simulator co-sim not fixed until
- * release 1.1, and then it's "adjustable" and probably not defaulted.
- */
-#define RTOS_TIMER_INT 3
-#ifdef CONFIG_HEXAGON_COMET
-#define RTOS_TIMER_REGS_ADDR 0xAB000000UL
-#endif
-#define SLEEP_CLK_RATE 32000
-
-#endif
--- a/arch/hexagon/include/asm/timex.h~hexagon-clean-up-timer-regsh
+++ a/arch/hexagon/include/asm/timex.h
@@ -7,11 +7,10 @@
#define _ASM_TIMEX_H
#include <asm-generic/timex.h>
-#include <asm/timer-regs.h>
#include <asm/hexagon_vm.h>
/* Using TCX0 as our clock. CLOCK_TICK_RATE scheduled to be removed. */
-#define CLOCK_TICK_RATE TCX0_CLK_RATE
+#define CLOCK_TICK_RATE 19200
#define ARCH_HAS_READ_CURRENT_TIMER
--- a/arch/hexagon/kernel/time.c~hexagon-clean-up-timer-regsh
+++ a/arch/hexagon/kernel/time.c
@@ -17,9 +17,10 @@
#include <linux/of_irq.h>
#include <linux/module.h>
-#include <asm/timer-regs.h>
#include <asm/hexagon_vm.h>
+#define TIMER_ENABLE BIT(0)
+
/*
* For the clocksource we need:
* pcycle frequency (600MHz)
@@ -33,6 +34,13 @@ cycles_t pcycle_freq_mhz;
cycles_t thread_freq_mhz;
cycles_t sleep_clk_freq;
+/*
+ * 8x50 HDD Specs 5-8. Simulator co-sim not fixed until
+ * release 1.1, and then it's "adjustable" and probably not defaulted.
+ */
+#define RTOS_TIMER_INT 3
+#define RTOS_TIMER_REGS_ADDR 0xAB000000UL
+
static struct resource rtos_timer_resources[] = {
{
.start = RTOS_TIMER_REGS_ADDR,
@@ -80,7 +88,7 @@ static int set_next_event(unsigned long
iowrite32(0, &rtos_timer->clear);
iowrite32(delta, &rtos_timer->match);
- iowrite32(1 << TIMER_ENABLE, &rtos_timer->enable);
+ iowrite32(TIMER_ENABLE, &rtos_timer->enable);
return 0;
}
_
Patches currently in -mm which might be from nathan(a)kernel.org are
hexagon-export-raw-i-o-routines-for-modules.patch
hexagon-clean-up-timer-regsh.patch
hexagon-ignore-vmlinuxlds.patch
The patch titled
Subject: hexagon: export raw I/O routines for modules
has been added to the -mm tree. Its filename is
hexagon-export-raw-i-o-routines-for-modules.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/hexagon-export-raw-i-o-routines-f…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/hexagon-export-raw-i-o-routines-f…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Nathan Chancellor <nathan(a)kernel.org>
Subject: hexagon: export raw I/O routines for modules
Patch series "Fixes for ARCH=hexagon allmodconfig", v2.
This series fixes some issues noticed with ARCH=hexagon allmodconfig.
This patch (of 3):
When building ARCH=hexagon allmodconfig, the following errors occur:
ERROR: modpost: "__raw_readsl" [drivers/i3c/master/svc-i3c-master.ko] undefined!
ERROR: modpost: "__raw_writesl" [drivers/i3c/master/dw-i3c-master.ko] undefined!
ERROR: modpost: "__raw_readsl" [drivers/i3c/master/dw-i3c-master.ko] undefined!
ERROR: modpost: "__raw_writesl" [drivers/i3c/master/i3c-master-cdns.ko] undefined!
ERROR: modpost: "__raw_readsl" [drivers/i3c/master/i3c-master-cdns.ko] undefined!
Export these symbols so that modules can use them without any errors.
Link: https://lkml.kernel.org/r/20211115174250.1994179-1-nathan@kernel.org
Link: https://lkml.kernel.org/r/20211115174250.1994179-2-nathan@kernel.org
Fixes: 013bf24c3829 ("Hexagon: Provide basic implementation and/or stubs for I/O routines.")
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
Acked-by: Brian Cain <bcain(a)codeaurora.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/hexagon/lib/io.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/arch/hexagon/lib/io.c~hexagon-export-raw-i-o-routines-for-modules
+++ a/arch/hexagon/lib/io.c
@@ -27,6 +27,7 @@ void __raw_readsw(const void __iomem *ad
*dst++ = *src;
}
+EXPORT_SYMBOL(__raw_readsw);
/*
* __raw_writesw - read words a short at a time
@@ -47,6 +48,7 @@ void __raw_writesw(void __iomem *addr, c
}
+EXPORT_SYMBOL(__raw_writesw);
/* Pretty sure len is pre-adjusted for the length of the access already */
void __raw_readsl(const void __iomem *addr, void *data, int len)
@@ -62,6 +64,7 @@ void __raw_readsl(const void __iomem *ad
}
+EXPORT_SYMBOL(__raw_readsl);
void __raw_writesl(void __iomem *addr, const void *data, int len)
{
@@ -76,3 +79,4 @@ void __raw_writesl(void __iomem *addr, c
}
+EXPORT_SYMBOL(__raw_writesl);
_
Patches currently in -mm which might be from nathan(a)kernel.org are
hexagon-export-raw-i-o-routines-for-modules.patch
hexagon-clean-up-timer-regsh.patch
hexagon-ignore-vmlinuxlds.patch
When building allmodconfig, there is a warning about TIMER_ENABLE being
redefined:
drivers/clocksource/timer-oxnas-rps.c:39:9: error: 'TIMER_ENABLE' macro redefined [-Werror,-Wmacro-redefined]
#define TIMER_ENABLE BIT(7)
^
./arch/hexagon/include/asm/timer-regs.h:13:9: note: previous definition is here
#define TIMER_ENABLE 0
^
1 error generated.
The values in this header are only used in one file each, if they are
used at all. Remove the header and sink all of the constants into their
respective files.
TCX0_CLK_RATE is only used in arch/hexagon/include/asm/timex.h
TIMER_ENABLE, RTOS_TIMER_INT, RTOS_TIMER_REGS_ADDR are only used in
arch/hexagon/kernel/time.c.
SLEEP_CLK_RATE and TIMER_CLR_ON_MATCH have both been unused since the
file's introduction in commit 71e4a47f32f4 ("Hexagon: Add time and timer
functions").
TIMER_ENABLE is redefined as BIT(0) so the shift is moved into the
definition, rather than its use.
Cc: stable(a)vger.kernel.org
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
Acked-by: Brian Cain <bcain(a)codeaurora.org>
---
arch/hexagon/include/asm/timer-regs.h | 26 --------------------------
arch/hexagon/include/asm/timex.h | 3 +--
arch/hexagon/kernel/time.c | 12 ++++++++++--
3 files changed, 11 insertions(+), 30 deletions(-)
delete mode 100644 arch/hexagon/include/asm/timer-regs.h
diff --git a/arch/hexagon/include/asm/timer-regs.h b/arch/hexagon/include/asm/timer-regs.h
deleted file mode 100644
index ee6c61423a05..000000000000
--- a/arch/hexagon/include/asm/timer-regs.h
+++ /dev/null
@@ -1,26 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Timer support for Hexagon
- *
- * Copyright (c) 2010-2011, The Linux Foundation. All rights reserved.
- */
-
-#ifndef _ASM_TIMER_REGS_H
-#define _ASM_TIMER_REGS_H
-
-/* This stuff should go into a platform specific file */
-#define TCX0_CLK_RATE 19200
-#define TIMER_ENABLE 0
-#define TIMER_CLR_ON_MATCH 1
-
-/*
- * 8x50 HDD Specs 5-8. Simulator co-sim not fixed until
- * release 1.1, and then it's "adjustable" and probably not defaulted.
- */
-#define RTOS_TIMER_INT 3
-#ifdef CONFIG_HEXAGON_COMET
-#define RTOS_TIMER_REGS_ADDR 0xAB000000UL
-#endif
-#define SLEEP_CLK_RATE 32000
-
-#endif
diff --git a/arch/hexagon/include/asm/timex.h b/arch/hexagon/include/asm/timex.h
index 8d4ec76fceb4..dfe69e118b2b 100644
--- a/arch/hexagon/include/asm/timex.h
+++ b/arch/hexagon/include/asm/timex.h
@@ -7,11 +7,10 @@
#define _ASM_TIMEX_H
#include <asm-generic/timex.h>
-#include <asm/timer-regs.h>
#include <asm/hexagon_vm.h>
/* Using TCX0 as our clock. CLOCK_TICK_RATE scheduled to be removed. */
-#define CLOCK_TICK_RATE TCX0_CLK_RATE
+#define CLOCK_TICK_RATE 19200
#define ARCH_HAS_READ_CURRENT_TIMER
diff --git a/arch/hexagon/kernel/time.c b/arch/hexagon/kernel/time.c
index feffe527ac92..febc95714d75 100644
--- a/arch/hexagon/kernel/time.c
+++ b/arch/hexagon/kernel/time.c
@@ -17,9 +17,10 @@
#include <linux/of_irq.h>
#include <linux/module.h>
-#include <asm/timer-regs.h>
#include <asm/hexagon_vm.h>
+#define TIMER_ENABLE BIT(0)
+
/*
* For the clocksource we need:
* pcycle frequency (600MHz)
@@ -33,6 +34,13 @@ cycles_t pcycle_freq_mhz;
cycles_t thread_freq_mhz;
cycles_t sleep_clk_freq;
+/*
+ * 8x50 HDD Specs 5-8. Simulator co-sim not fixed until
+ * release 1.1, and then it's "adjustable" and probably not defaulted.
+ */
+#define RTOS_TIMER_INT 3
+#define RTOS_TIMER_REGS_ADDR 0xAB000000UL
+
static struct resource rtos_timer_resources[] = {
{
.start = RTOS_TIMER_REGS_ADDR,
@@ -80,7 +88,7 @@ static int set_next_event(unsigned long delta, struct clock_event_device *evt)
iowrite32(0, &rtos_timer->clear);
iowrite32(delta, &rtos_timer->match);
- iowrite32(1 << TIMER_ENABLE, &rtos_timer->enable);
+ iowrite32(TIMER_ENABLE, &rtos_timer->enable);
return 0;
}
--
2.34.0.rc0
commit 415de44076640483648d6c0f6d645a9ee61328ad upstream.
Currently, Linux probes for X86_BUG_NULL_SEL unconditionally which
makes it unsafe to migrate in a virtualised environment as the
properties across the migration pool might differ.
To be specific, the case which goes wrong is:
1. Zen1 (or earlier) and Zen2 (or later) in a migration pool
2. Linux boots on Zen2, probes and finds the absence of X86_BUG_NULL_SEL
3. Linux is then migrated to Zen1
Linux is now running on a X86_BUG_NULL_SEL-impacted CPU while believing
that the bug is fixed.
The only way to address the problem is to fully trust the "no longer
affected" CPUID bit when virtualised, because in the above case it would
be clear deliberately to indicate the fact "you might migrate to
somewhere which has this behaviour".
Zen3 adds the NullSelectorClearsBase CPUID bit to indicate that loading
a NULL segment selector zeroes the base and limit fields, as well as
just attributes. Zen2 also has this behaviour but doesn't have the NSCB
bit.
[ bp: Minor touchups. ]
Signed-off-by: Jane Malalane <jane.malalane(a)citrix.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
CC: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/20211021104744.24126-1-jane.malalane@citrix.com
---
Backport to 4.19. Drop Hygon modifications.
---
arch/x86/kernel/cpu/amd.c | 2 ++
arch/x86/kernel/cpu/common.c | 44 +++++++++++++++++++++++++++++++++++++-------
arch/x86/kernel/cpu/cpu.h | 1 +
3 files changed, 40 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index de69090ca142..98c23126f751 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -993,6 +993,8 @@ static void init_amd(struct cpuinfo_x86 *c)
if (cpu_has(c, X86_FEATURE_IRPERF) &&
!cpu_has_amd_erratum(c, amd_erratum_1054))
msr_set_bit(MSR_K7_HWCR, MSR_K7_HWCR_IRPERF_EN_BIT);
+
+ check_null_seg_clears_base(c);
}
#ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 2058e8c0e61d..7e7400af7797 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1254,9 +1254,8 @@ void __init early_cpu_init(void)
early_identify_cpu(&boot_cpu_data);
}
-static void detect_null_seg_behavior(struct cpuinfo_x86 *c)
+static bool detect_null_seg_behavior(void)
{
-#ifdef CONFIG_X86_64
/*
* Empirically, writing zero to a segment selector on AMD does
* not clear the base, whereas writing zero to a segment
@@ -1277,10 +1276,43 @@ static void detect_null_seg_behavior(struct cpuinfo_x86 *c)
wrmsrl(MSR_FS_BASE, 1);
loadsegment(fs, 0);
rdmsrl(MSR_FS_BASE, tmp);
- if (tmp != 0)
- set_cpu_bug(c, X86_BUG_NULL_SEG);
wrmsrl(MSR_FS_BASE, old_base);
-#endif
+ return tmp == 0;
+}
+
+void check_null_seg_clears_base(struct cpuinfo_x86 *c)
+{
+ /* BUG_NULL_SEG is only relevant with 64bit userspace */
+ if (!IS_ENABLED(CONFIG_X86_64))
+ return;
+
+ /* Zen3 CPUs advertise Null Selector Clears Base in CPUID. */
+ if (c->extended_cpuid_level >= 0x80000021 &&
+ cpuid_eax(0x80000021) & BIT(6))
+ return;
+
+ /*
+ * CPUID bit above wasn't set. If this kernel is still running
+ * as a HV guest, then the HV has decided not to advertize
+ * that CPUID bit for whatever reason. For example, one
+ * member of the migration pool might be vulnerable. Which
+ * means, the bug is present: set the BUG flag and return.
+ */
+ if (cpu_has(c, X86_FEATURE_HYPERVISOR)) {
+ set_cpu_bug(c, X86_BUG_NULL_SEG);
+ return;
+ }
+
+ /*
+ * Zen2 CPUs also have this behaviour, but no CPUID bit.
+ * 0x18 is the respective family for Hygon.
+ */
+ if ((c->x86 == 0x17 || c->x86 == 0x18) &&
+ detect_null_seg_behavior())
+ return;
+
+ /* All the remaining ones are affected */
+ set_cpu_bug(c, X86_BUG_NULL_SEG);
}
static void generic_identify(struct cpuinfo_x86 *c)
@@ -1316,8 +1348,6 @@ static void generic_identify(struct cpuinfo_x86 *c)
get_model_name(c); /* Default name */
- detect_null_seg_behavior(c);
-
/*
* ESPFIX is a strange bug. All real CPUs have it. Paravirt
* systems that run Linux at CPL > 0 may or may not have the
diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index e89602d2aff5..4eb9bf68b122 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -76,6 +76,7 @@ extern int detect_extended_topology_early(struct cpuinfo_x86 *c);
extern int detect_extended_topology(struct cpuinfo_x86 *c);
extern int detect_ht_early(struct cpuinfo_x86 *c);
extern void detect_ht(struct cpuinfo_x86 *c);
+extern void check_null_seg_clears_base(struct cpuinfo_x86 *c);
unsigned int aperfmperf_get_khz(int cpu);
--
2.11.0
commit 91adec9e0709 ("drm/amd/display: Look at firmware version to
determine using dmub on dcn21")
Newer DMUB firmware on Renoir and Green Sardine do not need to disable
dmcu and this actually causes problems with DP-C alt mode for a number of
machines.
Backport the fix from this from hand modified backport because mainline
switched to IP version checking which doesn't exist in linux-stable.
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1772
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1735
Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com>
Reviewed-by: Alex Deucher <alexander.deucher(a)amd.com>
---
Changes from v1->v2:
* Update commit message syntax for hand modified commit
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 1ea31dcc7a8b..084491afe540 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1141,8 +1141,15 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
case CHIP_RAVEN:
case CHIP_RENOIR:
init_data.flags.gpu_vm_support = true;
- if (ASICREV_IS_GREEN_SARDINE(adev->external_rev_id))
+ switch (adev->dm.dmcub_fw_version) {
+ case 0: /* development */
+ case 0x1: /* linux-firmware.git hash 6d9f399 */
+ case 0x01000000: /* linux-firmware.git hash 9a0b0f4 */
+ init_data.flags.disable_dmcu = false;
+ break;
+ default:
init_data.flags.disable_dmcu = true;
+ }
break;
case CHIP_VANGOGH:
case CHIP_YELLOW_CARP:
--
2.25.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 76224355db7570cbe6b6f75c8929a1558828dd55 Mon Sep 17 00:00:00 2001
From: Miklos Szeredi <mszeredi(a)redhat.com>
Date: Tue, 17 Aug 2021 21:05:16 +0200
Subject: [PATCH] fuse: truncate pagecache on atomic_o_trunc
fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic
O_TRUNC. This can deadlock with fuse_wait_on_page_writeback() in
fuse_launder_page() triggered by invalidate_inode_pages2().
Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a
truncate_pagecache() call. This makes sense regardless of FOPEN_KEEP_CACHE
or fc->writeback cache, so do it unconditionally.
Reported-by: Xie Yongji <xieyongji(a)bytedance.com>
Reported-and-tested-by: syzbot+bea44a5189836d956894(a)syzkaller.appspotmail.com
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 97f860cfc195..5e5efb66c7d7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -198,12 +198,11 @@ void fuse_finish_open(struct inode *inode, struct file *file)
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = get_fuse_conn(inode);
- if (!(ff->open_flags & FOPEN_KEEP_CACHE))
- invalidate_inode_pages2(inode->i_mapping);
if (ff->open_flags & FOPEN_STREAM)
stream_open(inode, file);
else if (ff->open_flags & FOPEN_NONSEEKABLE)
nonseekable_open(inode, file);
+
if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -211,10 +210,14 @@ void fuse_finish_open(struct inode *inode, struct file *file)
fi->attr_version = atomic64_inc_return(&fc->attr_version);
i_size_write(inode, 0);
spin_unlock(&fi->lock);
+ truncate_pagecache(inode, 0);
fuse_invalidate_attr(inode);
if (fc->writeback_cache)
file_update_time(file);
+ } else if (!(ff->open_flags & FOPEN_KEEP_CACHE)) {
+ invalidate_inode_pages2(inode->i_mapping);
}
+
if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
fuse_link_write_file(file);
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a4e17d65dafdd3513042d8f00404c9b6068a825c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Pali=20Roh=C3=A1r?= <pali(a)kernel.org>
Date: Tue, 5 Oct 2021 20:09:41 +0200
Subject: [PATCH] PCI: aardvark: Fix PCIe Max Payload Size setting
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Change PCIe Max Payload Size setting in PCIe Device Control register to 512
bytes to align with PCIe Link Initialization sequence as defined in Marvell
Armada 3700 Functional Specification. According to the specification,
maximal Max Payload Size supported by this device is 512 bytes.
Without this kernel prints suspicious line:
pci 0000:01:00.0: Upstream bridge's Max Payload Size set to 256 (was 16384, max 512)
With this change it changes to:
pci 0000:01:00.0: Upstream bridge's Max Payload Size set to 256 (was 512, max 512)
Link: https://lore.kernel.org/r/20211005180952.6812-3-kabel@kernel.org
Fixes: 8c39d710363c ("PCI: aardvark: Add Aardvark PCI host controller driver")
Signed-off-by: Pali Rohár <pali(a)kernel.org>
Signed-off-by: Marek Behún <kabel(a)kernel.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi(a)arm.com>
Reviewed-by: Marek Behún <kabel(a)kernel.org>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c
index 596ebcfcc82d..884510630bae 100644
--- a/drivers/pci/controller/pci-aardvark.c
+++ b/drivers/pci/controller/pci-aardvark.c
@@ -488,8 +488,9 @@ static void advk_pcie_setup_hw(struct advk_pcie *pcie)
reg = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + PCI_EXP_DEVCTL);
reg &= ~PCI_EXP_DEVCTL_RELAX_EN;
reg &= ~PCI_EXP_DEVCTL_NOSNOOP_EN;
+ reg &= ~PCI_EXP_DEVCTL_PAYLOAD;
reg &= ~PCI_EXP_DEVCTL_READRQ;
- reg |= PCI_EXP_DEVCTL_PAYLOAD; /* Set max payload size */
+ reg |= PCI_EXP_DEVCTL_PAYLOAD_512B;
reg |= PCI_EXP_DEVCTL_READRQ_512B;
advk_writel(pcie, reg, PCIE_CORE_PCIEXP_CAP + PCI_EXP_DEVCTL);
[Public]
On designs that that GPIO controller has a dedicated IRQ, wakeups aren't working properly from s0i3.
S0i3 is first supported on 5.14 on AMD, so it's a good idea to fix this there.
This is fixed by the following commits:
commit 7e6f8d6f4a42 ("pinctrl: amd: Add irq field data") # 5.14
commit acd47b9f28e5 ("pinctrl: amd: Handle wake-up interrupt") #5.14
Other designs that the GPIO controller shares the IRQ with the ACPI SCI have other fixes that will be sent to stable once they're in 5.16.
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 319fa1a52e438a6e028329187783a25ad498c4e6 Mon Sep 17 00:00:00 2001
From: Nathan Lynch <nathanl(a)linux.ibm.com>
Date: Wed, 20 Oct 2021 14:47:03 -0500
Subject: [PATCH] powerpc/pseries/mobility: ignore ibm, platform-facilities
updates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
On VMs with NX encryption, compression, and/or RNG offload, these
capabilities are described by nodes in the ibm,platform-facilities device
tree hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
├── ibm,compression-v1
├── ibm,random-v1
└── ibm,sym-encryption-v1
3 directories
The acceleration functions that these nodes describe are not disrupted by
live migration, not even temporarily.
But the post-migration ibm,update-nodes sequence firmware always sends
"delete" messages for this hierarchy, followed by an "add" directive to
reconstruct it via ibm,configure-connector (log with debugging statements
enabled in mobility.c):
mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
mobility: removing node /ibm,platform-facilities:4294967286
...
mobility: added node /ibm,platform-facilities:4294967286
Note we receive a single "add" message for the entire hierarchy, and what
we receive from the ibm,configure-connector sequence is the top-level
platform-facilities node along with its three children. The debug message
simply reports the parent node and not the whole subtree.
Also, significantly, the nodes added are almost completely equivalent to
the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
the leaf nodes is the only property I've observed to differ, and Linux does
not use that. So in practice, the sum of update messages Linux receives for
this hierarchy is equivalent to minor property updates.
We succeed in removing the original hierarchy from the device tree. But the
vio bus code is ignorant of this, and does not unbind or relinquish its
references. The leaf nodes, still reachable through sysfs, of course still
refer to the now-freed ibm,platform-facilities parent node, which makes
use-after-free possible:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
refcount_warn_saturate+0x160/0x1f0 (unreliable)
kobject_get+0xf0/0x100
of_node_get+0x30/0x50
of_get_parent+0x50/0xb0
of_fwnode_get_parent+0x54/0x90
fwnode_count_parents+0x50/0x150
fwnode_full_name_string+0x30/0x110
device_node_string+0x49c/0x790
vsnprintf+0x1c0/0x4c0
sprintf+0x44/0x60
devspec_show+0x34/0x50
dev_attr_show+0x40/0xa0
sysfs_kf_seq_show+0xbc/0x200
kernfs_seq_show+0x44/0x60
seq_read_iter+0x2a4/0x740
kernfs_fop_read_iter+0x254/0x2e0
new_sync_read+0x120/0x190
vfs_read+0x1d0/0x240
Moreover, the "new" replacement subtree is not correctly added to the
device tree, resulting in ibm,platform-facilities parent node without the
appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
0 directories
$ cd /sys/devices/vio ; find . -xtype l -exec file {} +
./ibm,sym-encryption-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
./ibm,random-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
./ibm,compression-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
This is because add_dt_node() -> dlpar_attach_node() attaches only the
parent node returned from configure-connector, ignoring any children. This
should be corrected for the general case, but fixing that won't help with
the stale OF node references, which is the more urgent problem.
One way to address that would be to make the drivers respond to node
removal notifications, so that node references can be dropped
appropriately. But this would likely force the drivers to disrupt active
clients for no useful purpose: equivalent nodes are immediately re-added.
And recall that the acceleration capabilities described by the nodes remain
available throughout the whole process.
The solution I believe to be robust for this situation is to convert
remove+add of a node with an unchanged phandle to an update of the node's
properties in the Linux device tree structure. That would involve changing
and adding a fair amount of code, and may take several iterations to land.
Until that can be realized we have a confirmed use-after-free and the
possibility of memory corruption. So add a limited workaround that
discriminates on the node type, ignoring adds and removes. This should be
amenable to backporting in the meantime.
Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel")
Cc: stable(a)vger.kernel.org
Signed-off-by: Nathan Lynch <nathanl(a)linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index e83e0891272d..210a37a065fb 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -63,6 +63,27 @@ static int mobility_rtas_call(int token, char *buf, s32 scope)
static int delete_dt_node(struct device_node *dn)
{
+ struct device_node *pdn;
+ bool is_platfac;
+
+ pdn = of_get_parent(dn);
+ is_platfac = of_node_is_type(dn, "ibm,platform-facilities") ||
+ of_node_is_type(pdn, "ibm,platform-facilities");
+ of_node_put(pdn);
+
+ /*
+ * The drivers that bind to nodes in the platform-facilities
+ * hierarchy don't support node removal, and the removal directive
+ * from firmware is always followed by an add of an equivalent
+ * node. The capability (e.g. RNG, encryption, compression)
+ * represented by the node is never interrupted by the migration.
+ * So ignore changes to this part of the tree.
+ */
+ if (is_platfac) {
+ pr_notice("ignoring remove operation for %pOFfp\n", dn);
+ return 0;
+ }
+
pr_debug("removing node %pOFfp\n", dn);
dlpar_detach_node(dn);
return 0;
@@ -222,6 +243,19 @@ static int add_dt_node(struct device_node *parent_dn, __be32 drc_index)
if (!dn)
return -ENOENT;
+ /*
+ * Since delete_dt_node() ignores this node type, this is the
+ * necessary counterpart. We also know that a platform-facilities
+ * node returned from dlpar_configure_connector() has children
+ * attached, and dlpar_attach_node() only adds the parent, leaking
+ * the children. So ignore these on the add side for now.
+ */
+ if (of_node_is_type(dn, "ibm,platform-facilities")) {
+ pr_notice("ignoring add operation for %pOF\n", dn);
+ dlpar_free_cc_nodes(dn);
+ return 0;
+ }
+
rc = dlpar_attach_node(dn, parent_dn);
if (rc)
dlpar_free_cc_nodes(dn);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 319fa1a52e438a6e028329187783a25ad498c4e6 Mon Sep 17 00:00:00 2001
From: Nathan Lynch <nathanl(a)linux.ibm.com>
Date: Wed, 20 Oct 2021 14:47:03 -0500
Subject: [PATCH] powerpc/pseries/mobility: ignore ibm, platform-facilities
updates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
On VMs with NX encryption, compression, and/or RNG offload, these
capabilities are described by nodes in the ibm,platform-facilities device
tree hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
├── ibm,compression-v1
├── ibm,random-v1
└── ibm,sym-encryption-v1
3 directories
The acceleration functions that these nodes describe are not disrupted by
live migration, not even temporarily.
But the post-migration ibm,update-nodes sequence firmware always sends
"delete" messages for this hierarchy, followed by an "add" directive to
reconstruct it via ibm,configure-connector (log with debugging statements
enabled in mobility.c):
mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
mobility: removing node /ibm,platform-facilities:4294967286
...
mobility: added node /ibm,platform-facilities:4294967286
Note we receive a single "add" message for the entire hierarchy, and what
we receive from the ibm,configure-connector sequence is the top-level
platform-facilities node along with its three children. The debug message
simply reports the parent node and not the whole subtree.
Also, significantly, the nodes added are almost completely equivalent to
the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
the leaf nodes is the only property I've observed to differ, and Linux does
not use that. So in practice, the sum of update messages Linux receives for
this hierarchy is equivalent to minor property updates.
We succeed in removing the original hierarchy from the device tree. But the
vio bus code is ignorant of this, and does not unbind or relinquish its
references. The leaf nodes, still reachable through sysfs, of course still
refer to the now-freed ibm,platform-facilities parent node, which makes
use-after-free possible:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
refcount_warn_saturate+0x160/0x1f0 (unreliable)
kobject_get+0xf0/0x100
of_node_get+0x30/0x50
of_get_parent+0x50/0xb0
of_fwnode_get_parent+0x54/0x90
fwnode_count_parents+0x50/0x150
fwnode_full_name_string+0x30/0x110
device_node_string+0x49c/0x790
vsnprintf+0x1c0/0x4c0
sprintf+0x44/0x60
devspec_show+0x34/0x50
dev_attr_show+0x40/0xa0
sysfs_kf_seq_show+0xbc/0x200
kernfs_seq_show+0x44/0x60
seq_read_iter+0x2a4/0x740
kernfs_fop_read_iter+0x254/0x2e0
new_sync_read+0x120/0x190
vfs_read+0x1d0/0x240
Moreover, the "new" replacement subtree is not correctly added to the
device tree, resulting in ibm,platform-facilities parent node without the
appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
0 directories
$ cd /sys/devices/vio ; find . -xtype l -exec file {} +
./ibm,sym-encryption-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
./ibm,random-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
./ibm,compression-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
This is because add_dt_node() -> dlpar_attach_node() attaches only the
parent node returned from configure-connector, ignoring any children. This
should be corrected for the general case, but fixing that won't help with
the stale OF node references, which is the more urgent problem.
One way to address that would be to make the drivers respond to node
removal notifications, so that node references can be dropped
appropriately. But this would likely force the drivers to disrupt active
clients for no useful purpose: equivalent nodes are immediately re-added.
And recall that the acceleration capabilities described by the nodes remain
available throughout the whole process.
The solution I believe to be robust for this situation is to convert
remove+add of a node with an unchanged phandle to an update of the node's
properties in the Linux device tree structure. That would involve changing
and adding a fair amount of code, and may take several iterations to land.
Until that can be realized we have a confirmed use-after-free and the
possibility of memory corruption. So add a limited workaround that
discriminates on the node type, ignoring adds and removes. This should be
amenable to backporting in the meantime.
Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel")
Cc: stable(a)vger.kernel.org
Signed-off-by: Nathan Lynch <nathanl(a)linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index e83e0891272d..210a37a065fb 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -63,6 +63,27 @@ static int mobility_rtas_call(int token, char *buf, s32 scope)
static int delete_dt_node(struct device_node *dn)
{
+ struct device_node *pdn;
+ bool is_platfac;
+
+ pdn = of_get_parent(dn);
+ is_platfac = of_node_is_type(dn, "ibm,platform-facilities") ||
+ of_node_is_type(pdn, "ibm,platform-facilities");
+ of_node_put(pdn);
+
+ /*
+ * The drivers that bind to nodes in the platform-facilities
+ * hierarchy don't support node removal, and the removal directive
+ * from firmware is always followed by an add of an equivalent
+ * node. The capability (e.g. RNG, encryption, compression)
+ * represented by the node is never interrupted by the migration.
+ * So ignore changes to this part of the tree.
+ */
+ if (is_platfac) {
+ pr_notice("ignoring remove operation for %pOFfp\n", dn);
+ return 0;
+ }
+
pr_debug("removing node %pOFfp\n", dn);
dlpar_detach_node(dn);
return 0;
@@ -222,6 +243,19 @@ static int add_dt_node(struct device_node *parent_dn, __be32 drc_index)
if (!dn)
return -ENOENT;
+ /*
+ * Since delete_dt_node() ignores this node type, this is the
+ * necessary counterpart. We also know that a platform-facilities
+ * node returned from dlpar_configure_connector() has children
+ * attached, and dlpar_attach_node() only adds the parent, leaking
+ * the children. So ignore these on the add side for now.
+ */
+ if (of_node_is_type(dn, "ibm,platform-facilities")) {
+ pr_notice("ignoring add operation for %pOF\n", dn);
+ dlpar_free_cc_nodes(dn);
+ return 0;
+ }
+
rc = dlpar_attach_node(dn, parent_dn);
if (rc)
dlpar_free_cc_nodes(dn);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 319fa1a52e438a6e028329187783a25ad498c4e6 Mon Sep 17 00:00:00 2001
From: Nathan Lynch <nathanl(a)linux.ibm.com>
Date: Wed, 20 Oct 2021 14:47:03 -0500
Subject: [PATCH] powerpc/pseries/mobility: ignore ibm, platform-facilities
updates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
On VMs with NX encryption, compression, and/or RNG offload, these
capabilities are described by nodes in the ibm,platform-facilities device
tree hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
├── ibm,compression-v1
├── ibm,random-v1
└── ibm,sym-encryption-v1
3 directories
The acceleration functions that these nodes describe are not disrupted by
live migration, not even temporarily.
But the post-migration ibm,update-nodes sequence firmware always sends
"delete" messages for this hierarchy, followed by an "add" directive to
reconstruct it via ibm,configure-connector (log with debugging statements
enabled in mobility.c):
mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
mobility: removing node /ibm,platform-facilities:4294967286
...
mobility: added node /ibm,platform-facilities:4294967286
Note we receive a single "add" message for the entire hierarchy, and what
we receive from the ibm,configure-connector sequence is the top-level
platform-facilities node along with its three children. The debug message
simply reports the parent node and not the whole subtree.
Also, significantly, the nodes added are almost completely equivalent to
the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
the leaf nodes is the only property I've observed to differ, and Linux does
not use that. So in practice, the sum of update messages Linux receives for
this hierarchy is equivalent to minor property updates.
We succeed in removing the original hierarchy from the device tree. But the
vio bus code is ignorant of this, and does not unbind or relinquish its
references. The leaf nodes, still reachable through sysfs, of course still
refer to the now-freed ibm,platform-facilities parent node, which makes
use-after-free possible:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
refcount_warn_saturate+0x160/0x1f0 (unreliable)
kobject_get+0xf0/0x100
of_node_get+0x30/0x50
of_get_parent+0x50/0xb0
of_fwnode_get_parent+0x54/0x90
fwnode_count_parents+0x50/0x150
fwnode_full_name_string+0x30/0x110
device_node_string+0x49c/0x790
vsnprintf+0x1c0/0x4c0
sprintf+0x44/0x60
devspec_show+0x34/0x50
dev_attr_show+0x40/0xa0
sysfs_kf_seq_show+0xbc/0x200
kernfs_seq_show+0x44/0x60
seq_read_iter+0x2a4/0x740
kernfs_fop_read_iter+0x254/0x2e0
new_sync_read+0x120/0x190
vfs_read+0x1d0/0x240
Moreover, the "new" replacement subtree is not correctly added to the
device tree, resulting in ibm,platform-facilities parent node without the
appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
0 directories
$ cd /sys/devices/vio ; find . -xtype l -exec file {} +
./ibm,sym-encryption-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
./ibm,random-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
./ibm,compression-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
This is because add_dt_node() -> dlpar_attach_node() attaches only the
parent node returned from configure-connector, ignoring any children. This
should be corrected for the general case, but fixing that won't help with
the stale OF node references, which is the more urgent problem.
One way to address that would be to make the drivers respond to node
removal notifications, so that node references can be dropped
appropriately. But this would likely force the drivers to disrupt active
clients for no useful purpose: equivalent nodes are immediately re-added.
And recall that the acceleration capabilities described by the nodes remain
available throughout the whole process.
The solution I believe to be robust for this situation is to convert
remove+add of a node with an unchanged phandle to an update of the node's
properties in the Linux device tree structure. That would involve changing
and adding a fair amount of code, and may take several iterations to land.
Until that can be realized we have a confirmed use-after-free and the
possibility of memory corruption. So add a limited workaround that
discriminates on the node type, ignoring adds and removes. This should be
amenable to backporting in the meantime.
Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel")
Cc: stable(a)vger.kernel.org
Signed-off-by: Nathan Lynch <nathanl(a)linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index e83e0891272d..210a37a065fb 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -63,6 +63,27 @@ static int mobility_rtas_call(int token, char *buf, s32 scope)
static int delete_dt_node(struct device_node *dn)
{
+ struct device_node *pdn;
+ bool is_platfac;
+
+ pdn = of_get_parent(dn);
+ is_platfac = of_node_is_type(dn, "ibm,platform-facilities") ||
+ of_node_is_type(pdn, "ibm,platform-facilities");
+ of_node_put(pdn);
+
+ /*
+ * The drivers that bind to nodes in the platform-facilities
+ * hierarchy don't support node removal, and the removal directive
+ * from firmware is always followed by an add of an equivalent
+ * node. The capability (e.g. RNG, encryption, compression)
+ * represented by the node is never interrupted by the migration.
+ * So ignore changes to this part of the tree.
+ */
+ if (is_platfac) {
+ pr_notice("ignoring remove operation for %pOFfp\n", dn);
+ return 0;
+ }
+
pr_debug("removing node %pOFfp\n", dn);
dlpar_detach_node(dn);
return 0;
@@ -222,6 +243,19 @@ static int add_dt_node(struct device_node *parent_dn, __be32 drc_index)
if (!dn)
return -ENOENT;
+ /*
+ * Since delete_dt_node() ignores this node type, this is the
+ * necessary counterpart. We also know that a platform-facilities
+ * node returned from dlpar_configure_connector() has children
+ * attached, and dlpar_attach_node() only adds the parent, leaking
+ * the children. So ignore these on the add side for now.
+ */
+ if (of_node_is_type(dn, "ibm,platform-facilities")) {
+ pr_notice("ignoring add operation for %pOF\n", dn);
+ dlpar_free_cc_nodes(dn);
+ return 0;
+ }
+
rc = dlpar_attach_node(dn, parent_dn);
if (rc)
dlpar_free_cc_nodes(dn);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 319fa1a52e438a6e028329187783a25ad498c4e6 Mon Sep 17 00:00:00 2001
From: Nathan Lynch <nathanl(a)linux.ibm.com>
Date: Wed, 20 Oct 2021 14:47:03 -0500
Subject: [PATCH] powerpc/pseries/mobility: ignore ibm, platform-facilities
updates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
On VMs with NX encryption, compression, and/or RNG offload, these
capabilities are described by nodes in the ibm,platform-facilities device
tree hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
├── ibm,compression-v1
├── ibm,random-v1
└── ibm,sym-encryption-v1
3 directories
The acceleration functions that these nodes describe are not disrupted by
live migration, not even temporarily.
But the post-migration ibm,update-nodes sequence firmware always sends
"delete" messages for this hierarchy, followed by an "add" directive to
reconstruct it via ibm,configure-connector (log with debugging statements
enabled in mobility.c):
mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
mobility: removing node /ibm,platform-facilities:4294967286
...
mobility: added node /ibm,platform-facilities:4294967286
Note we receive a single "add" message for the entire hierarchy, and what
we receive from the ibm,configure-connector sequence is the top-level
platform-facilities node along with its three children. The debug message
simply reports the parent node and not the whole subtree.
Also, significantly, the nodes added are almost completely equivalent to
the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
the leaf nodes is the only property I've observed to differ, and Linux does
not use that. So in practice, the sum of update messages Linux receives for
this hierarchy is equivalent to minor property updates.
We succeed in removing the original hierarchy from the device tree. But the
vio bus code is ignorant of this, and does not unbind or relinquish its
references. The leaf nodes, still reachable through sysfs, of course still
refer to the now-freed ibm,platform-facilities parent node, which makes
use-after-free possible:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
refcount_warn_saturate+0x160/0x1f0 (unreliable)
kobject_get+0xf0/0x100
of_node_get+0x30/0x50
of_get_parent+0x50/0xb0
of_fwnode_get_parent+0x54/0x90
fwnode_count_parents+0x50/0x150
fwnode_full_name_string+0x30/0x110
device_node_string+0x49c/0x790
vsnprintf+0x1c0/0x4c0
sprintf+0x44/0x60
devspec_show+0x34/0x50
dev_attr_show+0x40/0xa0
sysfs_kf_seq_show+0xbc/0x200
kernfs_seq_show+0x44/0x60
seq_read_iter+0x2a4/0x740
kernfs_fop_read_iter+0x254/0x2e0
new_sync_read+0x120/0x190
vfs_read+0x1d0/0x240
Moreover, the "new" replacement subtree is not correctly added to the
device tree, resulting in ibm,platform-facilities parent node without the
appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
0 directories
$ cd /sys/devices/vio ; find . -xtype l -exec file {} +
./ibm,sym-encryption-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
./ibm,random-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
./ibm,compression-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
This is because add_dt_node() -> dlpar_attach_node() attaches only the
parent node returned from configure-connector, ignoring any children. This
should be corrected for the general case, but fixing that won't help with
the stale OF node references, which is the more urgent problem.
One way to address that would be to make the drivers respond to node
removal notifications, so that node references can be dropped
appropriately. But this would likely force the drivers to disrupt active
clients for no useful purpose: equivalent nodes are immediately re-added.
And recall that the acceleration capabilities described by the nodes remain
available throughout the whole process.
The solution I believe to be robust for this situation is to convert
remove+add of a node with an unchanged phandle to an update of the node's
properties in the Linux device tree structure. That would involve changing
and adding a fair amount of code, and may take several iterations to land.
Until that can be realized we have a confirmed use-after-free and the
possibility of memory corruption. So add a limited workaround that
discriminates on the node type, ignoring adds and removes. This should be
amenable to backporting in the meantime.
Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel")
Cc: stable(a)vger.kernel.org
Signed-off-by: Nathan Lynch <nathanl(a)linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index e83e0891272d..210a37a065fb 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -63,6 +63,27 @@ static int mobility_rtas_call(int token, char *buf, s32 scope)
static int delete_dt_node(struct device_node *dn)
{
+ struct device_node *pdn;
+ bool is_platfac;
+
+ pdn = of_get_parent(dn);
+ is_platfac = of_node_is_type(dn, "ibm,platform-facilities") ||
+ of_node_is_type(pdn, "ibm,platform-facilities");
+ of_node_put(pdn);
+
+ /*
+ * The drivers that bind to nodes in the platform-facilities
+ * hierarchy don't support node removal, and the removal directive
+ * from firmware is always followed by an add of an equivalent
+ * node. The capability (e.g. RNG, encryption, compression)
+ * represented by the node is never interrupted by the migration.
+ * So ignore changes to this part of the tree.
+ */
+ if (is_platfac) {
+ pr_notice("ignoring remove operation for %pOFfp\n", dn);
+ return 0;
+ }
+
pr_debug("removing node %pOFfp\n", dn);
dlpar_detach_node(dn);
return 0;
@@ -222,6 +243,19 @@ static int add_dt_node(struct device_node *parent_dn, __be32 drc_index)
if (!dn)
return -ENOENT;
+ /*
+ * Since delete_dt_node() ignores this node type, this is the
+ * necessary counterpart. We also know that a platform-facilities
+ * node returned from dlpar_configure_connector() has children
+ * attached, and dlpar_attach_node() only adds the parent, leaking
+ * the children. So ignore these on the add side for now.
+ */
+ if (of_node_is_type(dn, "ibm,platform-facilities")) {
+ pr_notice("ignoring add operation for %pOF\n", dn);
+ dlpar_free_cc_nodes(dn);
+ return 0;
+ }
+
rc = dlpar_attach_node(dn, parent_dn);
if (rc)
dlpar_free_cc_nodes(dn);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 319fa1a52e438a6e028329187783a25ad498c4e6 Mon Sep 17 00:00:00 2001
From: Nathan Lynch <nathanl(a)linux.ibm.com>
Date: Wed, 20 Oct 2021 14:47:03 -0500
Subject: [PATCH] powerpc/pseries/mobility: ignore ibm, platform-facilities
updates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
On VMs with NX encryption, compression, and/or RNG offload, these
capabilities are described by nodes in the ibm,platform-facilities device
tree hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
├── ibm,compression-v1
├── ibm,random-v1
└── ibm,sym-encryption-v1
3 directories
The acceleration functions that these nodes describe are not disrupted by
live migration, not even temporarily.
But the post-migration ibm,update-nodes sequence firmware always sends
"delete" messages for this hierarchy, followed by an "add" directive to
reconstruct it via ibm,configure-connector (log with debugging statements
enabled in mobility.c):
mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
mobility: removing node /ibm,platform-facilities:4294967286
...
mobility: added node /ibm,platform-facilities:4294967286
Note we receive a single "add" message for the entire hierarchy, and what
we receive from the ibm,configure-connector sequence is the top-level
platform-facilities node along with its three children. The debug message
simply reports the parent node and not the whole subtree.
Also, significantly, the nodes added are almost completely equivalent to
the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
the leaf nodes is the only property I've observed to differ, and Linux does
not use that. So in practice, the sum of update messages Linux receives for
this hierarchy is equivalent to minor property updates.
We succeed in removing the original hierarchy from the device tree. But the
vio bus code is ignorant of this, and does not unbind or relinquish its
references. The leaf nodes, still reachable through sysfs, of course still
refer to the now-freed ibm,platform-facilities parent node, which makes
use-after-free possible:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
refcount_warn_saturate+0x160/0x1f0 (unreliable)
kobject_get+0xf0/0x100
of_node_get+0x30/0x50
of_get_parent+0x50/0xb0
of_fwnode_get_parent+0x54/0x90
fwnode_count_parents+0x50/0x150
fwnode_full_name_string+0x30/0x110
device_node_string+0x49c/0x790
vsnprintf+0x1c0/0x4c0
sprintf+0x44/0x60
devspec_show+0x34/0x50
dev_attr_show+0x40/0xa0
sysfs_kf_seq_show+0xbc/0x200
kernfs_seq_show+0x44/0x60
seq_read_iter+0x2a4/0x740
kernfs_fop_read_iter+0x254/0x2e0
new_sync_read+0x120/0x190
vfs_read+0x1d0/0x240
Moreover, the "new" replacement subtree is not correctly added to the
device tree, resulting in ibm,platform-facilities parent node without the
appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
0 directories
$ cd /sys/devices/vio ; find . -xtype l -exec file {} +
./ibm,sym-encryption-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
./ibm,random-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
./ibm,compression-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
This is because add_dt_node() -> dlpar_attach_node() attaches only the
parent node returned from configure-connector, ignoring any children. This
should be corrected for the general case, but fixing that won't help with
the stale OF node references, which is the more urgent problem.
One way to address that would be to make the drivers respond to node
removal notifications, so that node references can be dropped
appropriately. But this would likely force the drivers to disrupt active
clients for no useful purpose: equivalent nodes are immediately re-added.
And recall that the acceleration capabilities described by the nodes remain
available throughout the whole process.
The solution I believe to be robust for this situation is to convert
remove+add of a node with an unchanged phandle to an update of the node's
properties in the Linux device tree structure. That would involve changing
and adding a fair amount of code, and may take several iterations to land.
Until that can be realized we have a confirmed use-after-free and the
possibility of memory corruption. So add a limited workaround that
discriminates on the node type, ignoring adds and removes. This should be
amenable to backporting in the meantime.
Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel")
Cc: stable(a)vger.kernel.org
Signed-off-by: Nathan Lynch <nathanl(a)linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index e83e0891272d..210a37a065fb 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -63,6 +63,27 @@ static int mobility_rtas_call(int token, char *buf, s32 scope)
static int delete_dt_node(struct device_node *dn)
{
+ struct device_node *pdn;
+ bool is_platfac;
+
+ pdn = of_get_parent(dn);
+ is_platfac = of_node_is_type(dn, "ibm,platform-facilities") ||
+ of_node_is_type(pdn, "ibm,platform-facilities");
+ of_node_put(pdn);
+
+ /*
+ * The drivers that bind to nodes in the platform-facilities
+ * hierarchy don't support node removal, and the removal directive
+ * from firmware is always followed by an add of an equivalent
+ * node. The capability (e.g. RNG, encryption, compression)
+ * represented by the node is never interrupted by the migration.
+ * So ignore changes to this part of the tree.
+ */
+ if (is_platfac) {
+ pr_notice("ignoring remove operation for %pOFfp\n", dn);
+ return 0;
+ }
+
pr_debug("removing node %pOFfp\n", dn);
dlpar_detach_node(dn);
return 0;
@@ -222,6 +243,19 @@ static int add_dt_node(struct device_node *parent_dn, __be32 drc_index)
if (!dn)
return -ENOENT;
+ /*
+ * Since delete_dt_node() ignores this node type, this is the
+ * necessary counterpart. We also know that a platform-facilities
+ * node returned from dlpar_configure_connector() has children
+ * attached, and dlpar_attach_node() only adds the parent, leaking
+ * the children. So ignore these on the add side for now.
+ */
+ if (of_node_is_type(dn, "ibm,platform-facilities")) {
+ pr_notice("ignoring add operation for %pOF\n", dn);
+ dlpar_free_cc_nodes(dn);
+ return 0;
+ }
+
rc = dlpar_attach_node(dn, parent_dn);
if (rc)
dlpar_free_cc_nodes(dn);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 319fa1a52e438a6e028329187783a25ad498c4e6 Mon Sep 17 00:00:00 2001
From: Nathan Lynch <nathanl(a)linux.ibm.com>
Date: Wed, 20 Oct 2021 14:47:03 -0500
Subject: [PATCH] powerpc/pseries/mobility: ignore ibm, platform-facilities
updates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
On VMs with NX encryption, compression, and/or RNG offload, these
capabilities are described by nodes in the ibm,platform-facilities device
tree hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
├── ibm,compression-v1
├── ibm,random-v1
└── ibm,sym-encryption-v1
3 directories
The acceleration functions that these nodes describe are not disrupted by
live migration, not even temporarily.
But the post-migration ibm,update-nodes sequence firmware always sends
"delete" messages for this hierarchy, followed by an "add" directive to
reconstruct it via ibm,configure-connector (log with debugging statements
enabled in mobility.c):
mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
mobility: removing node /ibm,platform-facilities:4294967286
...
mobility: added node /ibm,platform-facilities:4294967286
Note we receive a single "add" message for the entire hierarchy, and what
we receive from the ibm,configure-connector sequence is the top-level
platform-facilities node along with its three children. The debug message
simply reports the parent node and not the whole subtree.
Also, significantly, the nodes added are almost completely equivalent to
the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
the leaf nodes is the only property I've observed to differ, and Linux does
not use that. So in practice, the sum of update messages Linux receives for
this hierarchy is equivalent to minor property updates.
We succeed in removing the original hierarchy from the device tree. But the
vio bus code is ignorant of this, and does not unbind or relinquish its
references. The leaf nodes, still reachable through sysfs, of course still
refer to the now-freed ibm,platform-facilities parent node, which makes
use-after-free possible:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
refcount_warn_saturate+0x160/0x1f0 (unreliable)
kobject_get+0xf0/0x100
of_node_get+0x30/0x50
of_get_parent+0x50/0xb0
of_fwnode_get_parent+0x54/0x90
fwnode_count_parents+0x50/0x150
fwnode_full_name_string+0x30/0x110
device_node_string+0x49c/0x790
vsnprintf+0x1c0/0x4c0
sprintf+0x44/0x60
devspec_show+0x34/0x50
dev_attr_show+0x40/0xa0
sysfs_kf_seq_show+0xbc/0x200
kernfs_seq_show+0x44/0x60
seq_read_iter+0x2a4/0x740
kernfs_fop_read_iter+0x254/0x2e0
new_sync_read+0x120/0x190
vfs_read+0x1d0/0x240
Moreover, the "new" replacement subtree is not correctly added to the
device tree, resulting in ibm,platform-facilities parent node without the
appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
0 directories
$ cd /sys/devices/vio ; find . -xtype l -exec file {} +
./ibm,sym-encryption-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
./ibm,random-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
./ibm,compression-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
This is because add_dt_node() -> dlpar_attach_node() attaches only the
parent node returned from configure-connector, ignoring any children. This
should be corrected for the general case, but fixing that won't help with
the stale OF node references, which is the more urgent problem.
One way to address that would be to make the drivers respond to node
removal notifications, so that node references can be dropped
appropriately. But this would likely force the drivers to disrupt active
clients for no useful purpose: equivalent nodes are immediately re-added.
And recall that the acceleration capabilities described by the nodes remain
available throughout the whole process.
The solution I believe to be robust for this situation is to convert
remove+add of a node with an unchanged phandle to an update of the node's
properties in the Linux device tree structure. That would involve changing
and adding a fair amount of code, and may take several iterations to land.
Until that can be realized we have a confirmed use-after-free and the
possibility of memory corruption. So add a limited workaround that
discriminates on the node type, ignoring adds and removes. This should be
amenable to backporting in the meantime.
Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel")
Cc: stable(a)vger.kernel.org
Signed-off-by: Nathan Lynch <nathanl(a)linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index e83e0891272d..210a37a065fb 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -63,6 +63,27 @@ static int mobility_rtas_call(int token, char *buf, s32 scope)
static int delete_dt_node(struct device_node *dn)
{
+ struct device_node *pdn;
+ bool is_platfac;
+
+ pdn = of_get_parent(dn);
+ is_platfac = of_node_is_type(dn, "ibm,platform-facilities") ||
+ of_node_is_type(pdn, "ibm,platform-facilities");
+ of_node_put(pdn);
+
+ /*
+ * The drivers that bind to nodes in the platform-facilities
+ * hierarchy don't support node removal, and the removal directive
+ * from firmware is always followed by an add of an equivalent
+ * node. The capability (e.g. RNG, encryption, compression)
+ * represented by the node is never interrupted by the migration.
+ * So ignore changes to this part of the tree.
+ */
+ if (is_platfac) {
+ pr_notice("ignoring remove operation for %pOFfp\n", dn);
+ return 0;
+ }
+
pr_debug("removing node %pOFfp\n", dn);
dlpar_detach_node(dn);
return 0;
@@ -222,6 +243,19 @@ static int add_dt_node(struct device_node *parent_dn, __be32 drc_index)
if (!dn)
return -ENOENT;
+ /*
+ * Since delete_dt_node() ignores this node type, this is the
+ * necessary counterpart. We also know that a platform-facilities
+ * node returned from dlpar_configure_connector() has children
+ * attached, and dlpar_attach_node() only adds the parent, leaking
+ * the children. So ignore these on the add side for now.
+ */
+ if (of_node_is_type(dn, "ibm,platform-facilities")) {
+ pr_notice("ignoring add operation for %pOF\n", dn);
+ dlpar_free_cc_nodes(dn);
+ return 0;
+ }
+
rc = dlpar_attach_node(dn, parent_dn);
if (rc)
dlpar_free_cc_nodes(dn);
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 8468e937df1f31411d1e127fa38db064af051fe5 Mon Sep 17 00:00:00 2001
From: Rongwei Wang <rongwei.wang(a)linux.alibaba.com>
Date: Fri, 5 Nov 2021 13:43:44 -0700
Subject: [PATCH] mm, thp: fix incorrect unmap behavior for private pages
When truncating pagecache on file THP, the private pages of a process
should not be unmapped mapping. This incorrect behavior on a dynamic
shared libraries which will cause related processes to happen core dump.
A simple test for a DSO (Prerequisite is the DSO mapped in file THP):
int main(int argc, char *argv[])
{
int fd;
fd = open(argv[1], O_WRONLY);
if (fd < 0) {
perror("open");
}
close(fd);
return 0;
}
The test only to open a target DSO, and do nothing. But this operation
will lead one or more process to happen core dump. This patch mainly to
fix this bug.
Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba…
Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Rongwei Wang <rongwei.wang(a)linux.alibaba.com>
Tested-by: Xu Yu <xuyu(a)linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Song Liu <song(a)kernel.org>
Cc: William Kucharski <william.kucharski(a)oracle.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Collin Fijalkovich <cfijalkovich(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/fs/open.c b/fs/open.c
index 9ec3cfca3b1a..e0df1536eb69 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -857,8 +857,17 @@ static int do_dentry_open(struct file *f,
*/
smp_mb();
if (filemap_nr_thps(inode->i_mapping)) {
+ struct address_space *mapping = inode->i_mapping;
+
filemap_invalidate_lock(inode->i_mapping);
- truncate_pagecache(inode, 0);
+ /*
+ * unmap_mapping_range just need to be called once
+ * here, because the private pages is not need to be
+ * unmapped mapping (e.g. data segment of dynamic
+ * shared libraries here).
+ */
+ unmap_mapping_range(mapping, 0, 0, 0);
+ truncate_inode_pages(mapping, 0);
filemap_invalidate_unlock(inode->i_mapping);
}
}
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 55fc0d91746759c71bc165bba62a2db64ac98e35 Mon Sep 17 00:00:00 2001
From: Rongwei Wang <rongwei.wang(a)linux.alibaba.com>
Date: Fri, 5 Nov 2021 13:43:41 -0700
Subject: [PATCH] mm, thp: lock filemap when truncating page cache
Patch series "fix two bugs for file THP".
This patch (of 2):
Transparent huge page has supported read-only non-shmem files. The
file- backed THP is collapsed by khugepaged and truncated when written
(for shared libraries).
However, there is a race when multiple writers truncate the same page
cache concurrently.
In that case, subpage(s) of file THP can be revealed by find_get_entry
in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
truncate_inode_page, as follows:
page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
page dumped because: VM_BUG_ON_PAGE(PageTail(page))
------------[ cut here ]------------
kernel BUG at mm/truncate.c:213!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ...
CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ...
Hardware name: ECS, BIOS 0.0.0 02/06/2015
pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
Call trace:
truncate_inode_page+0x64/0x70
truncate_inode_pages_range+0x550/0x7e4
truncate_pagecache+0x58/0x80
do_dentry_open+0x1e4/0x3c0
vfs_open+0x38/0x44
do_open+0x1f0/0x310
path_openat+0x114/0x1dc
do_filp_open+0x84/0x134
do_sys_openat2+0xbc/0x164
__arm64_sys_openat+0x74/0xc0
el0_svc_common.constprop.0+0x88/0x220
do_el0_svc+0x30/0xa0
el0_svc+0x20/0x30
el0_sync_handler+0x1a4/0x1b0
el0_sync+0x180/0x1c0
Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)
This patch mainly to lock filemap when one enter truncate_pagecache(),
avoiding truncating the same page cache concurrently.
Link: https://lkml.kernel.org/r/20211025092134.18562-1-rongwei.wang@linux.alibaba…
Link: https://lkml.kernel.org/r/20211025092134.18562-2-rongwei.wang@linux.alibaba…
Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Xu Yu <xuyu(a)linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang(a)linux.alibaba.com>
Suggested-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Tested-by: Song Liu <song(a)kernel.org>
Cc: Collin Fijalkovich <cfijalkovich(a)google.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: William Kucharski <william.kucharski(a)oracle.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/fs/open.c b/fs/open.c
index daa324606a41..9ec3cfca3b1a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f,
* of THPs into the page cache will fail.
*/
smp_mb();
- if (filemap_nr_thps(inode->i_mapping))
+ if (filemap_nr_thps(inode->i_mapping)) {
+ filemap_invalidate_lock(inode->i_mapping);
truncate_pagecache(inode, 0);
+ filemap_invalidate_unlock(inode->i_mapping);
+ }
}
return 0;
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 8468f45091d2866affed6f6a7aecc20779139173 Mon Sep 17 00:00:00 2001
From: Coly Li <colyli(a)suse.de>
Date: Wed, 3 Nov 2021 14:49:17 +0800
Subject: [PATCH] bcache: fix use-after-free problem in bcache_device_free()
In bcache_device_free(), pointer disk is referenced still in
ida_simple_remove() after blk_cleanup_disk() gets called on this
pointer. This may cause a potential panic by use-after-free on the
disk pointer.
This patch fixes the problem by calling blk_cleanup_disk() after
ida_simple_remove().
Fixes: bc70852fd104 ("bcache: convert to blk_alloc_disk/blk_cleanup_disk")
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Hannes Reinecke <hare(a)suse.de>
Cc: Ulf Hansson <ulf.hansson(a)linaro.org>
Cc: stable(a)vger.kernel.org # v5.14+
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Link: https://lore.kernel.org/r/20211103064917.67383-1-colyli@suse.de
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 84a48eed8e24..a7bb3355b776 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -885,9 +885,9 @@ static void bcache_device_free(struct bcache_device *d)
bcache_device_detach(d);
if (disk) {
- blk_cleanup_disk(disk);
ida_simple_remove(&bcache_device_idx,
first_minor_to_idx(disk->first_minor));
+ blk_cleanup_disk(disk);
}
bioset_exit(&d->bio_split);
This is a backport of the remaining patches from the below series:
https://lore.kernel.org/all/cover.1633464148.git.naveen.n.rao@linux.vnet.ib…
Kindly apply to the longterm tree for v5.10
Thanks,
Naveen
Naveen N. Rao (4):
powerpc/lib: Add helper to check if offset is within conditional
branch range
powerpc/bpf: Validate branch ranges
powerpc/security: Add a helper to query stf_barrier type
powerpc/bpf: Emit stf barrier instruction sequences for BPF_NOSPEC
arch/powerpc/include/asm/code-patching.h | 1 +
arch/powerpc/include/asm/security_features.h | 5 ++
arch/powerpc/kernel/security.c | 5 ++
arch/powerpc/lib/code-patching.c | 7 ++-
arch/powerpc/net/bpf_jit.h | 33 ++++++----
arch/powerpc/net/bpf_jit64.h | 8 +--
arch/powerpc/net/bpf_jit_comp64.c | 64 ++++++++++++++++++--
7 files changed, 100 insertions(+), 23 deletions(-)
base-commit: 5040520482a594e92d4f69141229a6dd26173511
--
2.33.1
The patch below does not apply to the 5.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d3e3c102d107bb84251455a298cf475f24bab995 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 11 Nov 2021 17:32:53 -0700
Subject: [PATCH] io-wq: serialize hash clear with wakeup
We need to ensure that we serialize the stalled and hash bits with the
wait_queue wait handler, or we could be racing with someone modifying
the hashed state after we find it busy, but before we then give up and
wait for it to be cleared. This can cause random delays or stalls when
handling buffered writes for many files, where some of these files cause
hash collisions between the worker threads.
Cc: stable(a)vger.kernel.org
Reported-by: Daniel Black <daniel(a)mariadb.org>
Fixes: e941894eae31 ("io-wq: make buffered file write hashed work map per-ctx")
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.c b/fs/io-wq.c
index afd955d53db9..88202de519f6 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -423,9 +423,10 @@ static inline unsigned int io_get_work_hash(struct io_wq_work *work)
return work->flags >> IO_WQ_HASH_SHIFT;
}
-static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
+static bool io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
{
struct io_wq *wq = wqe->wq;
+ bool ret = false;
spin_lock_irq(&wq->hash->wait.lock);
if (list_empty(&wqe->wait.entry)) {
@@ -433,9 +434,11 @@ static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
if (!test_bit(hash, &wq->hash->map)) {
__set_current_state(TASK_RUNNING);
list_del_init(&wqe->wait.entry);
+ ret = true;
}
}
spin_unlock_irq(&wq->hash->wait.lock);
+ return ret;
}
static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
@@ -475,14 +478,21 @@ static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
}
if (stall_hash != -1U) {
+ bool unstalled;
+
/*
* Set this before dropping the lock to avoid racing with new
* work being added and clearing the stalled bit.
*/
set_bit(IO_ACCT_STALLED_BIT, &acct->flags);
raw_spin_unlock(&wqe->lock);
- io_wait_on_hash(wqe, stall_hash);
+ unstalled = io_wait_on_hash(wqe, stall_hash);
raw_spin_lock(&wqe->lock);
+ if (unstalled) {
+ clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
+ if (wq_has_sleeper(&wqe->wq->hash->wait))
+ wake_up(&wqe->wq->hash->wait);
+ }
}
return NULL;
@@ -564,8 +574,11 @@ static void io_worker_handle_work(struct io_worker *worker)
io_wqe_enqueue(wqe, linked);
if (hash != -1U && !next_hashed) {
+ /* serialize hash clear with wake_up() */
+ spin_lock_irq(&wq->hash->wait.lock);
clear_bit(hash, &wq->hash->map);
clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
+ spin_unlock_irq(&wq->hash->wait.lock);
if (wq_has_sleeper(&wq->hash->wait))
wake_up(&wq->hash->wait);
raw_spin_lock(&wqe->lock);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 35e4c6c1a2fc2eb11b9306e95cda1fa06a511948 Mon Sep 17 00:00:00 2001
From: Shin'ichiro Kawasaki <shinichiro.kawasaki(a)wdc.com>
Date: Tue, 9 Nov 2021 19:47:23 +0900
Subject: [PATCH] block: Hold invalidate_lock in BLKZEROOUT ioctl
When BLKZEROOUT ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is modified to call "blkdiscard -z" command
and repeated hundreds of times.
This patch can be applied back to the stable kernel version v5.15.y.
Rework is required for older stable kernels.
Fixes: 22dd6d356628 ("block: invalidate the page cache when issuing BLKZEROOUT")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki(a)wdc.com>
Cc: stable(a)vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack(a)suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-3-shinichiro.kawasaki@wdc.c…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/block/ioctl.c b/block/ioctl.c
index 9fa87f64f703..0a1d10ac2e1a 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -154,6 +154,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
{
uint64_t range[2];
uint64_t start, end, len;
+ struct inode *inode = bdev->bd_inode;
int err;
if (!(mode & FMODE_WRITE))
@@ -176,12 +177,17 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
return -EINVAL;
/* Invalidate the page cache, including dirty pages */
+ filemap_invalidate_lock(inode->i_mapping);
err = truncate_bdev_range(bdev, mode, start, end);
if (err)
- return err;
+ goto fail;
+
+ err = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
+ BLKDEV_ZERO_NOUNMAP);
- return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
- BLKDEV_ZERO_NOUNMAP);
+fail:
+ filemap_invalidate_unlock(inode->i_mapping);
+ return err;
}
static int put_ushort(unsigned short __user *argp, unsigned short val)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 35e4c6c1a2fc2eb11b9306e95cda1fa06a511948 Mon Sep 17 00:00:00 2001
From: Shin'ichiro Kawasaki <shinichiro.kawasaki(a)wdc.com>
Date: Tue, 9 Nov 2021 19:47:23 +0900
Subject: [PATCH] block: Hold invalidate_lock in BLKZEROOUT ioctl
When BLKZEROOUT ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is modified to call "blkdiscard -z" command
and repeated hundreds of times.
This patch can be applied back to the stable kernel version v5.15.y.
Rework is required for older stable kernels.
Fixes: 22dd6d356628 ("block: invalidate the page cache when issuing BLKZEROOUT")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki(a)wdc.com>
Cc: stable(a)vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack(a)suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-3-shinichiro.kawasaki@wdc.c…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/block/ioctl.c b/block/ioctl.c
index 9fa87f64f703..0a1d10ac2e1a 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -154,6 +154,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
{
uint64_t range[2];
uint64_t start, end, len;
+ struct inode *inode = bdev->bd_inode;
int err;
if (!(mode & FMODE_WRITE))
@@ -176,12 +177,17 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
return -EINVAL;
/* Invalidate the page cache, including dirty pages */
+ filemap_invalidate_lock(inode->i_mapping);
err = truncate_bdev_range(bdev, mode, start, end);
if (err)
- return err;
+ goto fail;
+
+ err = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
+ BLKDEV_ZERO_NOUNMAP);
- return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
- BLKDEV_ZERO_NOUNMAP);
+fail:
+ filemap_invalidate_unlock(inode->i_mapping);
+ return err;
}
static int put_ushort(unsigned short __user *argp, unsigned short val)
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 35e4c6c1a2fc2eb11b9306e95cda1fa06a511948 Mon Sep 17 00:00:00 2001
From: Shin'ichiro Kawasaki <shinichiro.kawasaki(a)wdc.com>
Date: Tue, 9 Nov 2021 19:47:23 +0900
Subject: [PATCH] block: Hold invalidate_lock in BLKZEROOUT ioctl
When BLKZEROOUT ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is modified to call "blkdiscard -z" command
and repeated hundreds of times.
This patch can be applied back to the stable kernel version v5.15.y.
Rework is required for older stable kernels.
Fixes: 22dd6d356628 ("block: invalidate the page cache when issuing BLKZEROOUT")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki(a)wdc.com>
Cc: stable(a)vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack(a)suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-3-shinichiro.kawasaki@wdc.c…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/block/ioctl.c b/block/ioctl.c
index 9fa87f64f703..0a1d10ac2e1a 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -154,6 +154,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
{
uint64_t range[2];
uint64_t start, end, len;
+ struct inode *inode = bdev->bd_inode;
int err;
if (!(mode & FMODE_WRITE))
@@ -176,12 +177,17 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
return -EINVAL;
/* Invalidate the page cache, including dirty pages */
+ filemap_invalidate_lock(inode->i_mapping);
err = truncate_bdev_range(bdev, mode, start, end);
if (err)
- return err;
+ goto fail;
+
+ err = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
+ BLKDEV_ZERO_NOUNMAP);
- return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
- BLKDEV_ZERO_NOUNMAP);
+fail:
+ filemap_invalidate_unlock(inode->i_mapping);
+ return err;
}
static int put_ushort(unsigned short __user *argp, unsigned short val)
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7607c44c157d343223510c8ffdf7206fdd2a6213 Mon Sep 17 00:00:00 2001
From: Shin'ichiro Kawasaki <shinichiro.kawasaki(a)wdc.com>
Date: Tue, 9 Nov 2021 19:47:22 +0900
Subject: [PATCH] block: Hold invalidate_lock in BLKDISCARD ioctl
When BLKDISCARD ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is repeated hundreds of times.
This patch can be applied back to the stable kernel version v5.15.y
with slight patch edit. Rework is required for older stable kernels.
Fixes: 351499a172c0 ("block: Invalidate cache on discard v2")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki(a)wdc.com>
Cc: stable(a)vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack(a)suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-2-shinichiro.kawasaki@wdc.c…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/block/ioctl.c b/block/ioctl.c
index d6af0ac97e57..9fa87f64f703 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -113,6 +113,7 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
uint64_t range[2];
uint64_t start, len;
struct request_queue *q = bdev_get_queue(bdev);
+ struct inode *inode = bdev->bd_inode;
int err;
if (!(mode & FMODE_WRITE))
@@ -135,12 +136,17 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
if (start + len > bdev_nr_bytes(bdev))
return -EINVAL;
+ filemap_invalidate_lock(inode->i_mapping);
err = truncate_bdev_range(bdev, mode, start, start + len - 1);
if (err)
- return err;
+ goto fail;
+
+ err = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+ GFP_KERNEL, flags);
- return blkdev_issue_discard(bdev, start >> 9, len >> 9,
- GFP_KERNEL, flags);
+fail:
+ filemap_invalidate_unlock(inode->i_mapping);
+ return err;
}
static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7607c44c157d343223510c8ffdf7206fdd2a6213 Mon Sep 17 00:00:00 2001
From: Shin'ichiro Kawasaki <shinichiro.kawasaki(a)wdc.com>
Date: Tue, 9 Nov 2021 19:47:22 +0900
Subject: [PATCH] block: Hold invalidate_lock in BLKDISCARD ioctl
When BLKDISCARD ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is repeated hundreds of times.
This patch can be applied back to the stable kernel version v5.15.y
with slight patch edit. Rework is required for older stable kernels.
Fixes: 351499a172c0 ("block: Invalidate cache on discard v2")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki(a)wdc.com>
Cc: stable(a)vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack(a)suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-2-shinichiro.kawasaki@wdc.c…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/block/ioctl.c b/block/ioctl.c
index d6af0ac97e57..9fa87f64f703 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -113,6 +113,7 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
uint64_t range[2];
uint64_t start, len;
struct request_queue *q = bdev_get_queue(bdev);
+ struct inode *inode = bdev->bd_inode;
int err;
if (!(mode & FMODE_WRITE))
@@ -135,12 +136,17 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
if (start + len > bdev_nr_bytes(bdev))
return -EINVAL;
+ filemap_invalidate_lock(inode->i_mapping);
err = truncate_bdev_range(bdev, mode, start, start + len - 1);
if (err)
- return err;
+ goto fail;
+
+ err = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+ GFP_KERNEL, flags);
- return blkdev_issue_discard(bdev, start >> 9, len >> 9,
- GFP_KERNEL, flags);
+fail:
+ filemap_invalidate_unlock(inode->i_mapping);
+ return err;
}
static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,